main.dvi

Size: px

Start display at page:

Download "main.dvi"

导项臧
4 years ago
Views:

1 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 Development of a Nanoelectronic 3-D (NEMO 3-D ) Simulator for Multimillion Atom Simulations and Its Application to Alloyed Quantum Dots Gerhard Klimeck 12, Fabiano Oyafuso 2, Timothy B. Boykin 3, R. Chris Bowen 2, Paul von Allmen 4 Abstract: Material layers with a thickness of a few nanometers are common-place in today s semiconductor devices. Before long, device fabrication methods will reach a point at which the other two device dimensions are scaled down to few tens of nanometers. The total atom count in such deca-nano devices is reduced to a few million. Only a small finite number of free electrons will operate such nano-scale devices due to quantized electron energies and electron charge. This work demonstrates that the simulation of electronic structure and electron transport on these length scales must not only be fundamentally quantum mechanical, but it must also include the atomic granularity of the device. Various elements of the theoretical, numerical, and software foundation of the prototype development of a Nanoelectronic Modeling tool (NEMO 3-D) which enables this class of device simulation on Beowulf cluster computers are presented. The electronic system is represented in a sparse complex Hamiltonian matrix of the order of hundreds of millions. A custom parallel matrix vector multiply algorithm that is coupled to a Lanczos and/or Rayleigh- Ritz eigenvalue solver has been developed. Benchmarks of the parallel electronic structure and the parallel strain calculation performed on various Beowulf cluster computers and a SGI Origin 2 are presented. The Beowulf cluster benchmarks show that the competition for memory access on dual CPU PC boards renders the utility of one of the CPUs useless, if the memory usage per node is about 1-2 GB. A new strain treatment for the 1 gekco@jpl.nasa.gov 2 Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA Department of Electrical and Computer Engineering and LICOS, The University of Alabama in Huntsville, Huntsville, AL Motorola Labs, Solid State Research Center, 77 S.River Pkwy., Tempe, AZ 8284 sp 3 s and sp 3 d s tight-binding models is developed and parameterized for bulk material properties of GaAs and InAs. The utility of the new tool is demonstrated by an atomistic analysis of the effects of disorder in alloys. In particular bulk In x Ga 1 x As and In.6 Ga.4 As quantum dots are examined. The quantum dot simulations show that the random atom configurations in the alloy, without any size or shape variations can lead to optical transition energy variations of several mev. The electron and hole wave functions show significant spatial variations due to spatial disorder indicating variations in electron and hole localization. keyword: quantum dot, alloy, nanoelectronic, sparse matrix-vector multiplication, tight-binding, optical transition, simulation. 1 Introduction Ongoing miniaturization of semiconductor devices has given rise to a multitude of applications unfathomed a few decades ago. Although the reduction in minimum feature size of semiconductor devices has thus far exceeded every expectation and overcome every predicted technological obstacle, it will nevertheless be ultimately limited by the atomic granularity of the underlying crystalline lattice and the small number of free electrons. Before long, device fabrication methods will reach a point at which both quantum mechanical effects and effects induced by the atomistic granularity of the underlying medium (Fig. 1) need to be considered in the device design. Quantum dots represent one incarnation of semiconductor devices at the end of the roadmap. Quantum dots can be characterized roughly as well-conducting, low energy regions surrounded on a nanometer scale by insulating materials. The self-capacitance of the spatial confinement region is reduced with decreasing sizes. A situation can arise, in which the capacitive energy as-

2 62 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 sociated with adding a single electron to the system is larger than the thermal energy, and charge quantization occurs. State quantization can occur if the central region is clean enough and if the region s dimensions are roughly on the length scale of an electron wavelength. Quantum dot implementations in various material systems (including silicon) have been examined since the late 198 s (Fig. 1b), and several designs have succeeded at room temperature operation. In particular pyramidal self-assembled quantum dot arrays appear to be promising candidates for use in quantum well lasers and detectors [Liu, Gao, McCaffrey (21)] within a few years. Although simulation has proven, especially in recent years, to be an important (and cost-effective) component of device design 6, existing commercial device simulators typically ignore or patch in the quantum mechanical and atomistic effects that must be included in the next generation of electronic devices. This document describes the development of an atomistic simulation tool, NEMO-3D, that incorporates quantum mechanical and atomistic effects by expanding the valence electron wave function in terms of a set of localized orbitals for each atom in the simulation. NEMO-3D, an extension of the successful 1D Nanoelectronic modeling tool (NEMO) [Bowen, et al. (1997a); Klimeck, et al. (1997); Bowen, et al. (1997b)], models the electronic structure of extended systems of atoms on the length scale of tens of nanometers. Section 2 of this document elaborates on our excitement about Nanoelectronic device modeling as it bridges gap between the large size, classical semiconductor device models and the molecular level modeling. Theoretical, numerical, and software methods used in NEMO 3-D,such as the theoretical background underlying the sp 3 s and sp 3 d s tight-binding models; the strain computation used to determine the atomic spatial configuration; sparse matrix eigenvalue solvers and object oriented I/O; are described in detail in Section 3. Any atomistic, 3-D, nano-scale simulation of a physically realistic semiconductor heterostructure-based system must include a very large number of atoms. For example, modeling an individual, self-assembled InAs Clean refers to a small number of unintentional impurities and crystal defects. 6 Physics-based device simulation tools have typically only been used to improve individual device performance after careful calibration of the simulation parameters quantum dot of 3nm diameter and nm height embedded in GaAs of buffer width nm requires a simulation domain of 4 4 1nm 3, containing approximately one million atoms. A horizontal array of four such dots separated by 2nm requires a simulation domain of 9 9 1nm 3,.2 million atoms. A 7 7 7nm 3 cube of Silicon contained in an ultra-scaled CMOS device contains about 1 million atoms. The memory and computation time required to model these realistic systems, necessitates usage of parallel computers. Section 4 discusses the specific parallel implementation and parallel performance of NEMO 3-D. The tight-binding model employed by NEMO-3D is semi-empirical in nature. Since the employed basis set is not complete in a mathematical sense, the parameters that enter the model do not correspond precisely to actual orbital overlaps. Instead, a genetic algorithm package is used to establish a set of parameters that represents a large number of physical data of the bulk binary system well. Section presents the parameterization of the tightbinding models in detail. Finally, Sections 6 and 7 discuss effects of disorder on identical alloyed quantum dots (i.e. quantum dots that differ only in the distribution of their constituent atoms) is presented. Significant variations in the spatial distribution of hole eigenfunctions and a spread of several mev in transition energies are demonstrated. 2 Nanoelectronic Modeling: A Problem of Conflicting Scales Nano-scale device technology is currently a heavily investigated research field. Nanoelectronic device modeling in particular is the intriguing area where the two worlds of micrometer-scale carrier transport simulations (engineering) and nanometer-scale electronic structure calculations (solid-state physics) collide. Effects that could be traditionally safely ignored (for reasons of computational complexity) in the semiconductor device engineering world such as quantum effects and material granularity are the key ingredients in the other world. By the same token, electronic structure calculations typically do not address issues regarding carrier transport and carrier interactions with their environment for reasons of computational complexity as well. Nanoelectronic device modeling must address all of these issues at once.

3 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 63 Minimum Feature Size ( m) DRAM SIA Roadmap CMOS Devices CMOS Devices with Quantum Effects Quantum Devices "single electronics" "spintronics" (a) Number of Electrons Number of Electrons SIA projection for SRAM dopant fluctuations particle noise problems electrically gated q.dots 3-D q.dot 1 heterostruct Year Year Figure 1 : (a) Minimum 2D feature size as projected on the SIA roadmap. Layer thickness od.1µm in the next generation devices are not captured in this graph. (b) Number of electrons under a CMOS SRAM gate. Dopant fluctuation and particle noise fluctuations may make reliable circuit design impossible, since each device may vary from the next significantly. (b) 2.1 Top-Down Approaches Traditionally, industrial semiconductor device research has approached nano-scale dimensions from micrometer dimensions. The object of this miniaturization is to make the current state-of-the-art devices operate faster, use less power, and perform at the same level of reliability. Commercial simulators of industrial Silicon based semiconductor devices are based on drift diffusion models, which treat electrons and holes in their respective bands as electron gases. The concept of individual electrons never explicitly appears since the electron gas is described by its density alone. Furthermore, the underlying matter is approximated by a so-called jellium with atoms represented by a uniform positive background. Effects due to interactions with impurities, phonons and other particles are included via mobility models, interaction rates, and other effective potentials. More sophisticated and computationally much more demanding models solve the Boltzmann Transport Equation (BTE) within a Monte Carlo framework. Electrons and holes are treated as semi-classical particles moving like billiard balls in the six-dimensional phase space and interacting with their environment through adequately weighted random scattering events. The most comprehensive and commercially available simulator of this kind is DAMOCLES 7 built at IBM but other BTE based simulators have also recently appeared on the mar- 7 See Damocles at DAMOCLES or search for Damocles on IBM web-site, in ket 89. The hydrodynamic approximation to the BTE has recently given rise to a class of models that is a step between the drift diffusion approach and the full-fledged BTE solver. Whereas the drift diffusion approach essentially only considers the zeroth order moment of the BTE, the hydrodynamic model extends the approximation to the first and second order moments. This treatment of higher order moments yields familiar momentum and energy conservation equations for an ideal fluid with additional terms for the electric and possibly the magnetic field. The hydrodynamic method enjoys considerable popularity since it describes hot carrier transport better than drift diffusion models yet it is significantly faster than the Monte Carlo BTE method. An industry has evolved dedicated to the development and maintenance of such semiconductor device simulators However, quantum mechanical effects such as tunneling and state quantization are not explicitly included in these models. Current efforts in the traditional 8 See MocaSim at products/ vwf/ mocasim/ mocasim br.html or search for MocaSim on Silvaco website, in 9 Search for DESSIS on the ISE, Integrated Systems Engineering web-site at 1 See Medici at Avant!/ SolutionsProducts/ Products/ Item/ 1,1,192,.html or search for Medici on the Avant! website, in 11 See Atlas at products/ vwf/ atlas/ atlas.html or search for Atlas on Silvaco web-site, in

4 64 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 device simulation community mainly focus on including these quantum mechanical effects into existing device simulation models incurring the least possible computational expense and with the overriding requirement of preserving the overall framework of the existing simulation tools. However, the problem with such simulator extensions is that they depend heavily on empirical parameterization to operate well on existing devices. The use of these tools and its parameterizations is generally not accurate for the next generation of devices. 2.2 Bottom-up Approaches While the industry oriented semiconductor device research community approaches nano-scale transport from the top down, the physics oriented solid state research community approaches the same regime from the bottom up. The models in the latter approach are fully quantum mechanical and can only be applied to relatively small systems with emphasis on high accuracy. The systems are often periodic with unit cells containing a few hundred to a few thousand atoms, and the main output is the electronic structure and the equilibrium atomic configuration with emphasis on surface and interface reconstruction and on impurity and defect levels. Charge transport is usually not included at the fundamental level, although some attempts are mentioned below. In contrast to the methods discussed in Section 2.1, electronic structure calculations explicitly include the granularity of condensed matter and describe the atoms at various levels of sophistication. At a fundamental level, the electrons are described by a many body Schrödinger equation in which the Hamiltonian contains interaction potentials with the atoms as well as electron-electron interaction terms. In this approach, it has already been assumed that the electrons adiabatically follow the motion of the atoms. Effects beyond this approximation lead to electron-phonon interaction terms that are evaluated in subsequent steps. In most cases, even the full electron problem is intractable, and calculations involving more than a handful of atoms rely on the socalled single electron approximation. The single electron approximation circumvents the difficulties raised by the interaction between the electrons by introducing a local or sometimes a non-local potential into a one particle Schrödinger equation. Familiar implementations of this idea are the Hartree-Fock approximation [McWeeny (1992)] and density functional theory [Hohenberg, Kohn (1964); Kohn, Sham (196); Jones, Gunnarson (198)]. Alternate approaches using the full Hamiltonian that explicitly includes electron-electron interaction are based on methods such as quantum Monte Carlo [Needs, Foulkes, Mitas, Rajagopal (21)]. Within the one-electron picture, it is in some cases possible to solve the all-electron problem, which means that all the electrons in the atoms are explicitly included in the self-consistent solution of the density dependent Schrödinger equation. The atom is then simply described by a Coulomb potential with the appropriate charge for the nucleus. However, in most situations only a restricted number of electrons in the atom participate in the chemical bonding and transport properties (valence electrons). Several methods have emerged where the core electrons are taken into account by modifying the Coulomb potential of the nucleus with an additional repulsive potential, which describes the interaction of the core electrons with the valence electrons. The resulting potential is termed pseudopotential. A number of approaches that have been explored to build these crucial components of electronic structure calculations are described below. Pseudopotentials are divided into several classes. Empirical pseudopotentials are fitted so that a set of calculated properties match experimental results. Such empirical pseudopotentials can be defined in real space by a parameterized function or directly in reciprocal space, which offers advantages for periodic systems and was one of the first avenues explored [Cohen, Bergstresser (1966)]. The real space pseudopotentials [Appelbaum, Hamann (1973); RamanaMurty, Atwater (199)] offer the advantage that non-bulk systems such as interfaces and surfaces can be described more realistically. First principles pseudopotentials do not require any fitting procedure, but they do require the knowledge of the eigenstates and eigenenergies for isolated atoms. A number of schemes have been devised, most of which strive to eliminate the nodes in the valence band electronic wave functions within the core region, to reduce the computational cost of the numerical solutions. These schemes, in turn, can be divided into two categories. Norm conserving pseudopotentials are derived (through inversion of the Schrödinger equation) from pseudowave functions with the reassuring property that the associated integrated charge inside the core region is identical to the charge obtained with the exact eigenfunctions. The most famous example and the most widely

5 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 6 used for benchmarks is the method by Bachelet, Hamann and Schlüter (1982). Another method that has gained considerable popularity in conjunction with a plane wave expansion for the numerical solution was later developed by Troullier and Martins (1991). Troullier and Martins method differs by the prescription used to build the pseudo wave functions Norm-conserving pseudopotentials require a large plane wave cutoff for elements of the first row, oxygen, and a number of other elements because the pseudowavefunction cannot be made sufficiently smooth in the core region. Conversely, a real space method would require a very fine mesh. Vanderbilt (199) recently introduced a successful ultra-soft pseudopotential for which the norm-conserving constraint is relaxed. The disadvantages of Vanderbilt s method are more complex coding and the need to solve a generalized eigenvalue problem, rather than a standard eigenvalue problem. While this document has reviewed the description of atoms with pseudopotentials it should be mentioned that a number of important issues related to improvements to the density functional theory and to the development of efficient numerical methods, which both lay at the core of other current investigations in the field, have been omitted. Finally, as already mentioned, although earlier most electronic structure calculations using pseudopotentials are restricted to systems much smaller than the quantum dots of interest in this work, it is worthwhile noting that, with a number of approximations, Canning, Wang, Williamson, Zunger (2) have recently managed to extend some of their pseudopotential work to systems containing up to one million atoms. Zunger s method has been applied extensively (see for example []) to model quantum dot structures, however, without yet including transport calculations. 2.3 An Intermediary Approach Whereas traditional semiconductor device simulators are insufficiently equipped to describe quantum effects at atomic dimensions, most ab-initio methods from condensed matter physics are still computationally too demanding for application to practical devices, even as small as quantum dots. A number of intermediary methods have therefore been developed in recent years. The methods can be divided into two major theory categories: atomistic and non-atomistic. The non-atomistic approaches do not attempt to model each individual atom in the structure, but introduce a variety of different approximations that are usually based on a continuous, jellium-type description of matter. At the lowest order approximation, such approaches only retain effective masses and band edges from the full electronic band structure, and they have given rise to the well-known effective mass approximation, in which on the scale of atomic distances a slowly varying envelope function describes the carriers. That envelope function is the solution to a one-particle effective mass Schrödinger equation. The general k p method leads to a straightforward extension of that approximation by including the coupling between multiple bands. The k p method has given rise to the popular multi-band effective mass approximation [Schuurmanst Hooft (198), vonallmen (1992a)], in which an envelope function is associated with each band explicitly included in the calculation, and a set of coupled Schrödinger-like equations is solved. It should be noted that the limitation to slowly varying perturbations remains in the multi-band version [vonallmen (1992b)]. The different materials are described by spacedependent parameters which are separately determined for each of the materials in the device. One strength of the effective mass approximation is the capability to discretize realistically-sized systems without the tremendous computational expense of previously mentioned ab-initio methods. However, the approximation inherently does not contain any direct atomic level information, and is, therefore, not well suited for the representation of nano-scale features such as interfaces and disorder from a fundamental perspective. This limitation has sparked lively discussions concerning the validity of the near-zone center plane wave expansion k p basis and the need to include each atom in the simulation [Fu, Wang, Zunger (1998a); Fu, Wang, Zunger (1998b); Efros, Rosen (1998)]. Despite its limitations the effective mass approximation has provided excellent agreement with measurements for a large number of experiments. Another interesting issue [Keating (1966); Pryor, Kim, Wang (1998)] of particular relevance to quantum dots relates to the most appropriate treatment of strain: should continuum or atomistic models be preferred? This work uses the atomistic valence force field method by Keating (1996). Atomistic approaches attempt to work directly with the electronic wave function of each individual atom. Ab-

6 66 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 initio methods overcome the shortcomings of the effective mass approximation, however, additional approximations must be introduced to reduce computational costs. As described briefly in the previous section, one of the critical questions is the choice of a basis set for the representation of the electronic wave function. Many approaches have been considered, ranging from traditional numerical methods, such as finite difference and finite elements, as well as plane wave expansions [Canning, Wang, Williamson, Zunger (2)], to methods that exploit the natural properties of chemical bonding in condensed matter. Among these latter approaches, local orbital methods are particularly attractive. While the method of using atomic orbitals as a basis set has a long history in solid state physics, new basis sets with compact support have recently been developed [Sankey (1989)], and, together with specific energy minimization schemes, these new basis sets result in computational costs which increase linearly with the number of atoms in the system without much accuracy degradation [Ordejon, Drabold, Grumbach Martin (1993),Ordejon, Galli, Car (1993)]. However, even with such methods, only a few thousand atoms can be described with present day computational resources. NEMO 3-D uses an empirical tight-binding method [Vogl Hjalmarson, Dow (1983); Jancu, Scholz, Beltram, Bassani (1998)] that is conceptually related to the local orbital method and that combines the advantages of an atomic level description with the intrinsic accuracy of empirical methods. It has already demonstrated considerable success [Bowen, et al. (1997a); Klimeck, et al. (1997); Bowen, et al. (1997b)] in quantum mechanical modeling of electron transport as well as the electronic structure modeling of small quantum dots [Lee, Joensson, Klimeck (21)]. The underlying idea of the empirical tight-binding method is the selection of a basis consisting of atomic orbitals (such as s, p, and d) which create a single electron Hamiltonian that represents the bulk electronic properties of the material. Interactions between different orbitals within an atom and between nearest neighbor atoms are treated as empirical fitting parameters. A variety of parameterizations of nearest neighbor and secondnearest neighbor tight-binding models have been published, including different orbital configurations [Vogl Hjalmarson, Dow; (1983); Boykin, Klimeck, Bowen, Lake (1997); Boykin (1997); Boykin, Gamble, Klimeck, Bowen (1999); Jancu, Scholz, Beltram, Bassani (1998); Klimeck, et al. (2); Klimeck, Bowen, Boykin, Cwik (2)]. NEMO 3-D typically uses an sp 3 s or sp 3 d s model that consists of five or ten spin degenerate basis states, respectively. For the modeling of quantum dots, three main methods have been used in recent years: k p [Pryor (1998); Stier, Grundmann, Bimberg (1999)], pseudopotentials [Canning, Wang, Williamson, Zunger (2)], and empirical tight-binding [Lee, Joensson, Klimeck (21)]. It is fair to note that each of these methods grapples with the same intrinsic difficulty: the full description of about a million interacting atoms and all of their electrons. It should also be emphasized that for most semiconductor compounds, only fragmentary experimental data exists for the band gaps and effective masses and their dependence on stress and strain. While ab-initio pseudopotential calculations beyond density functional theory do in principle predict such properties, the computational cost is high for even simple properties such as the electronic band gap [Hybertsen, Louie (1993)]. It should also be noted that effective masses, which are a crucial element in the determination of correct electronic state quantization, are rarely listed as a result of first principles calculations. On the other hand, more empirical approaches such as k p and tight-binding use quality bulk parameterizations and can achieve good experimental comparisons in quantum dot simulations. The question, however, remains whether these parameterizations are valid in presence of variations at the atomic scale. These on-going efforts can be viewed as complementary rather than mutually exclusive competitors, and each method can greatly benefit from insightful cross-fertilization. The perspective taken in this work is that empirical tightbinding models link the physical content of the atomic level wave functions of the pseudopotential calculations to the jellium approach of k p, and are the method of choice for realistic modeling of transport in quantum dot structures. Finally, as will be discussed in further detail, it should be emphasized that the quality of the empirical tight-binding results depends strongly on a good parameterization of the bulk material properties. 2.4 Nanoelectronics with Transport Nanoelectronic device simulation must ultimately include both, the sophisticated physics oriented electronic structure calculations and the engineering oriented transport simulations. Extensive scientific arguments have re-

7 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 67 cently ensued regarding transport theory, basis representation, and practical implementation of a simulator capable describing a realistic device. Starting from the field of molecular chemistry, Mujica, Kemp, Roitberg, Ratner (1996) applied tight-binding based approaches to the modeling of transport in molecular wires. Later, Derosa and Seminario (21) modeled molecular charge transport using density functional theory and Green functions. Further significant advances in the understanding of the electronic structure in technologically relevant devices were recently achieved through ab initio simulation of MOS devices by Demkov and Sankey (1999). Ballistic transport through a thin dielectric barrier was evaluated using standard Green function techniques [Demkov, Liu, Zhang, Loechelt (2)], Demkov, Zhang, Drabold (21)] without scattering mechanisms. Conversely, starting from the field of semiconductor device simulation, various efforts have been undertaken over the past eight years to develop quantum mechanicsbased device simulators that incorporate scattering mechanisms at a fundamental level. The Nanoelectronic Modeling tool (NEMO 1-D ) built at Texas Instruments / Raytheon from is possibly the first large-scale device simulator based on the non-equilibrium Green function technique (NEGF) to meet the challenge. Its initial objective was to achieve a comprehensive simulation of the electron transport in resonant tunneling diodes (RTDs). NEGF is a powerful formalism capable of combining tight-binding band structure, self-consistent charging effects, electron-phonon interactions, and disorder effects with the important concept of charge transport from one electron reservoir to another. The concept of electron transport between reservoirs was pioneered in a simpler approach by Landauer (197) and Büttiker (1986), and later expanded for the NEGF formalism by Caroli, Combescot, Nozieres, Saint-James (1971) Tunneling through silicon dioxide barriers, which is a classical problem of great technological interest for the development of thin dielectrics, was studied using tightbinding models within NEMO [Bowen, et al. (1997b)] as well as in a large 3-D cell model by Städele, Tuttle, Hess (21). Other research groups [Ren (2); Ren et al. (2); Ren, Venugopal, Datta, Lundstrom (21)] have since then started to develop NEGF-based simulators to model MOSFET devices in a 2-D simulation domain. These simulations are computationally extremely intensive, and fully exploit the computing power of realistically available parallel supercomputers and cluster computers. Quantum mechanical simulations of electron transport through 3-D confined structures such as quantum dots have not yet reached the maturity of the 1-D and 2- D simulation capabilities mentioned above. Early efforts were rate equation based [Klimeck, Lake, Datta, Bryant (1994); Klimeck, Chen, Datta (1994); Chen et al. (1994)], where a simplified electronic structure was assumed. In the related area of molecular structures, detailed studies of charge transport have recently become a hot research topic where simulations are providing an improved understanding of experimental data [Damle, Ghosh, Datta (21); Anantram, Govindan (1998)]. NEMO 3-D focuses on the atomistic electronic structure calculation of realistically sized quantum dots at this development stage. This work is a complement to quantum dot simulations [Williamson, Wang, Zunger (2); Wang, Kim, Zunger (1999); Stopa (1996); Pryor (1998); Stier, Grundmann, Bimberg (1999), Sheng, Leburton (21)] performed with other methods discussed in this section. NEMO 3-D currently does not include carrier transport. However, the Lanczos algorithm (see Section 3.6) has been tested successfully already for non- Hermitian matrices, introduced by open boundary conditions (see Section 3.3) and the code is structured such that transport simulations can be incorporated in the future without major re-writes of the software. 3 Theoretical, Numerical, and Software Methods 3.1 Tight Binding Formulation Without Strain Quantum dots are characterized by confinement in all three spatial dimensions so that the Hamiltonian no longer commutes with any of the discrete translation operators. The wave vector is hence not a good quantum number in any direction. The most natural basis for representing such a highly confined wave function is, therefore, one consisting of atomic-like orbitals centered on each atom of the crystal. Solving for the electronic structure of a quantum dot requires detailed modeling of the local environment on an atomic scale, and, therefore, introduces material considerations into the calculation. While quantum dots may be fabricated in any number of materials systems, from an electronic structure point of view, the treatment employed mainly

8 68 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 depends on whether the bulk lattice constants of all materials are the same. When the bulk lattice constants are the same the system is said to be latticematched; when they are not, the system is said to be lattice-mismatched. Lattice-matched examples include GaAs/AlAs and its alloys Ga x Al 1 x As, as well as In.3 Ga.47 As/In.2 Al.47 As. An InAs quantum dot surrounded by Al x Ga 1 x As and an InAs or AlAs layer in a high performance In.3 Ga.47 As/InP resonant tunneling diode are examples of lattice-mismatched devices. The treatment of the two cases is necessarily somewhat different, since a matrix element of the Hamiltonian between two orbitals centered on different atoms depends, in general, on the position of the atoms. In this work the two-center approximation is made, so that only the relative position of neighboring atoms is important. In a lattice-matched system, the atoms constitute a perfect crystal with uniform unit cells; in a lattice-mismatched system, the atomic positions vary and are only semi regular. In other words, in such a system one can roughly discern unit cells, but these cells vary somewhat in size, and the atomic positions within them vary. The Keating [Keating (1966)] valence force field model described later is employed in NEMO 3-D to determine the atomic positions. For both types of materials systems, the atomic-like orbitals are assumed to be orthonormal, following Slater and Koster (194). Bravais lattice points can describe a crystal in a lattice-matched system: R n1,n2,n3 = n 1 a 1 +n 2 a 2 +n 3 a 3 (1) where a i are primitive direct lattice translation vectors and n i are integers. If there is more than one atom per cell, as is the case with, for example, GaAs or Si, the atoms within a cell are indexed by µ, and the location of the µ th atom within the cell located at Eq.(1) is given by R n1,n2,n3 +v µ, where v µ is the displacement relative to the cell origin. The wavefunction is normalized over a volume consisting of N i cells in the a i (i = 1,2,3) direction, and the state is represented as a general expansion in terms of localized atomic-like orbitals: Ψ> = N 1 n 1 =1 1 N1 N 2 N 3 (2) N 2 n 2 =1 N 3 n 3 =1 α C n (αµ) 1 n 2 n 3 αµ;r n1,n2,n3 +v µ > µ In Eq.(2), α indexes the atomic-like orbitals centered on the µatoms within each cell (n 1,n 2,n 3 ). The Schrödinger equation thus appears as a system of simultaneous equations given by: <αµ;r n1,n2,n3 +v µ H E Ψ> = (3) In Eq.(3) the matrix elements between localized orbitals are expressed as tight-binding parameters with the additional limitation of interactions to nearest neighbors. The sp 3 s model of Vogl et al. (1983), as well as the sp 3 d s model of Jancu et. al.(1998), are employed within the two-center approximation, in which the matrix elements depend only upon the relative positions of the orbitals. The expressions for the matrix elements between these types of orbitals in the two-center approximation are given by Slater and Koster (194) as functions of the relative atomic positions. 3.2 Tight Binding Formulation With Strain In a lattice-mismatched system several additional complications arise. First, the cells are no longer regularly placed so that the R n1,n2,n3 are no longer representable in a form given by Eq.(1). In a lattice-mismatched quantum dot fabricated from zincblende crystal materials, the R n1,n2,n3 are best considered as giving the location of an anion-cation pair. Likewise, in Eq.(3), the displacements now depend on both the specific cell and atom type, and are more correctly written as v n 1n 2 n 3 µ. These complications, though important, are rather minor and are automatically accommodated since there is no assumption of a wave-vector in any dimension in Eq.(2). The second complication affects the nearest neighbor parameters. As mentioned above, in the two-center approximation these nearest neighbor parameters depend upon the relative atomic positions. For example, the Hamiltonian matrix element between an s-orbital centered about an atom at the origin and a p x -orbital centered about an atom located at d = lˆx + mŷ + nẑ, where d is the distance between the atoms and l, m, and n are the direction cosines is: E sx = lv spσ (4) Since the bond angle between atoms is no longer uniform in a lattice-mismatched system, the direction cosines vary in magnitude for different pairs of nearest neighbor atoms, even in nominally zincblende or diamond structure materials. Furthermore, the two-center parameters

9 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 69 such as V spσ no longer take on their ideal values as distance d between the atoms in each pair is in general different from its ideal (bulk crystal) value. The two-center parameters are assumed to scale as: atoms: E iα E () iα + jβ C iα, jβ ( ) E () 2 ) 2 (iα, jβ) (E (iα, jβ) E () iα +E() jβ (6) V αβγ = ( d ) ηαβγv () d αβγ () where for the given pair of atoms d is the ideal separation, d is the actual separation, and V () αβγ is the ideal parameter for the orbitals involved. The exponents are chosen to reproduce known bulk behavior under conditions such as hydrostatic pressure. From the work of Harrison (1999)], it is expected that most of these exponents should be approximately 2. Also the same-site parameters are, generally, changed from their bulk values. In a lattice-matched system, however, the changes are usually small. In the sp 3 d s model, there may be no change at all, since in this model it is often possible to use a single set of onsite parameters for a given atom type, independent of the material. For example, As has the same parameters in GaAs, AlAs, and InAs (see Table 3). In a lattice-mismatched system, atom displacements affect the same-site parameters more strongly. To understand the reason for this shift, recall that the atomic-like orbitals are assumed to be orthogonal. They are, thus, not true atomic orbitals, but are more properly Löwdin functions [Loewdin (19)], which are orthogonal yet transform under symmetry operations of the crystal, as would the atomic orbital whose label they bear. When atoms are displaced in a lattice-mismatched system, not only do the tight-binding parameters of Eq.(4) change, so, too, do the overlaps of the true atomic orbitals from which the Löwdin functions are constructed. While the overlaps do not appear in an orthogonal, empirical tightbinding approach such as the one employed here, a reasonable approximation is to assume that the overlap between two nearest neighbor orbitals is proportional to their Hamiltonian matrix element divided by the sum of the vacuum-referenced onsite energies of the orbitals [Harrison (1999)] With this approximation Löwdin s formula is used to first order in the orbital overlaps to obtain an onsite Hamiltonian matrix element, which includes the effect of the displacement of the nearest neighbor where E () iα is the vacuum-referenced ideal same-site Löwdin orbital parameter for an α-orbital on the ith atom, E iα is the shifted vacuum-referenced corresponding ) same-site Löwdin orbital parameter, E () (iα, jβ) (E (iα, jβ) the ideal (lattice-mismatched) nearest neighbor parameter between an α-orbital on the ith atom and a β-orbital on the jth atom, and C i,α, jβ is a proportionality constant fit to properly reproduce bulk strain behavior. The sum covers all orbitals β and atoms j that are nearest neighbors of the atom i. The difference in squared matrix elements effectively removes the onsite shift implicit in the ideal onsite parameter, and replaces it with the latticemismatched shift. Parameterizations of InAs and GaAs, including the strain-induced shift of the on-site elements, are discussed in Section Electronic Structure Boundary Conditions The finite simulation domain that is represented in the electronic structure calculation as a sparse matrix must be terminated by physically meaningful boundary conditions. There are currently 2 kinds of boundary conditions implemented in NEMO 3-D : periodic and closed system. Periodic boundary conditions which satisfy Bloch s theorem allow for a study of the bulk properties of alloys as long as the periodicity of the domain is much larger than the largest feature size within the domain. Closed system boundary conditions terminate the bonds of the surface atoms abruptly. The dangling bonds are passivated with fixed potentials to avoid the inclusion of surface states in the energy range of interest. The thickness of an isolating GaAs buffer around a InAs quantum dot does influence the energy of the confined states, and the buffer size must be chosen adequately large. Another desirable boundary condition developed in the NEMO 1-D code is the open boundary through which particles can be injected from reservoirs and through which particles can escape to reservoirs. The boundary conditions developed [Klimeck, et al. (199); Lake, Klimeck, Bowen, Jovanovic(1997)] for NEMO 1-D were the key to the success in the transport simulations through realistically sized resonant tunneling diodes [Bowen, et

10 61 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 al. (1997a); Klimeck, et al. (1997)] and MOS devices [Bowen, et al. (1997b)]. These boundary conditions change the character of the Hamiltonian matrix from Hermitian to non-hermitian, and the imaginary part of the quasi-bound state eigen-energies now corresponds to the lifetime of the state in the confinement. To enable the simulation of charge transport in NEMO 3-D, an open boundary condition for the 3-D system is currently under development. 3.4 Atomistic Strain Calculation An accurate calculation of the electronic structure within the tight-binding model necessitates an accurate representation of the positions of each atom. The atom positions in strained materials are shifted from the ideal bulk positions to minimize the overall strain energy of the system. NEMO 3-D uses a valence force field (VFF) model [Keating (1966); Pryor, Kim, Wang (1998)] in which the total strain energy, expressed as a local nearest neighbor functional of atomic positions, is minimized. The local strain energy at atom i is given by: E i = 3 16 j + n k> j [ αij 2d 2 ij ( ) 2 R 2 ij d2 ij βij β ( ) ik 2 ] R ij R ik d ij d ik d ij d ik where the sum is over neighbors j of atom i. Here, d ij and R ij are the equilibrium and actual distances between atoms i and j, respectively. Eq. 7 is included as Eq. 14 in reference [] except for some corrected coefficients. The local parameters α ij and β ij represent the force constants for bond-length and bond-angle distortions in bulk zinc-blende materials, respectively, and, in the absence of Coulomb corrections, are related to the bulk elastic moduli by: 3 ( ) C 11 +2C 12 = 3α ij +β ij (8) 4d ij 3 C 11 C 12 = β ij d ij 3 4α ij β ij C 44 = 4d ij α ij +β ij (7) In zinc-blende materials, however, these relations are modified by the inclusion of Coulomb effects due to the unequal charge distribution between the anion and cation sublattices. In this paper, α s and β s obtained by Martin (197) to account for the Coulomb correction are used. The total strain energy is computed as the sum of the local strain energies over all atoms. 3. Atomistic Strain Boundary Conditions Several boundary conditions for the strain calculation are currently implemented in NEMO 3-D. To model systems of finite extent, three boundary conditions are available: 1) the hard wall condition in which all outer shell atoms are fixed to user determined lattice constants, 2) the soft wall condition in which no atom position is fixed, and 3) the softwall boundary condition in which one atom position in the system is fixed. To enable the simulation of bulk systems, periodic boundary conditions have been implemented. In this case the dimensions of the fundamental domain and, therefore, the separations between neighboring boundary atoms are not known a priori. Thus, the crystal is allowed to breathe such that the strain energy is also minimized with respect to the period in each direction in which periodic boundary conditions are applied. 3.6 Eigenvalue Solution One simulation objective is to solve the eigenvalue problem for low lying electron and hole states near the bandedge. The nearest neighbor tight-binding Hamiltonian can be represented in a sparse matrix. A one million atom system represented in the sp 3 d s basis establishes a matrix size of 2 million 2 million. A direct solver, in which the entire column space is worked on is completely unfeasible for a variety of reasons, especially due to the full matrix storage requirement of (2 1 6 ) 2 16 bytes=64tb. A variety of sparse matrix eigenvalue and eigenvector algorithms have been developed, some of which are available publicly 12. Most of these eigenvalue/vector algorithms are some form of a Krylov/Lanczos/Arnoldi subspace approach [Gloub, Van Loan (1989)]. These methods approximate the solution on a small subspace which is increased until a desired tolerance is achieved. One the major advantage is that only require memory of the order of the length of several eigenvectors is required. At the lowest level of the algorithm, trial vectors are repeatedly multiplied by the 12 See ARPACK at software/ ARPACK/ index.html

11 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 611 matrix of interest. Storage of the matrix is not mandatory if the matrix can be reconstructed on the fly during the matrix-vector multiply process. The performance of these algorithms operating on large systems, therefore, strongly depends on the efficient implementation of a matrix-vector multiply algorithm for the problem at hand. The Lanczos-based solver technology of non-hermitian matrices developed [Bowen, Frensley Klimeck, Lake (199)] for NEMO 1-D was applied for NEMO 3-D as well. Early in the development of NEMO 3-D, the Lanczos-eigenvalue solver prototype with was compared ARPACK. For a system of about 1, atoms it was found that our custom solver was significantly faster 13 than ARPACK. Therefore, parallelization of our custom solver was implemented to attack large-scale problems. The folded-spectrum method [Wang, Zunger (1994)], which is based on a minimization of the squared target matrix, has been proposed, implemented, and heavily used by Zunger et al. Before the matrix is squared it is shifted to the energy range of interest, i.e. close to the expected eigenenergies. The overall algorithm is then based on a conjugate gradient minimization of a trial vector. This method also relies heavily on a matrix-vector multiply algorithm and it has been implemented in NEMO 3-D. 3.7 Software Methods The NEMO 3-D project leverages some of the software technology developed in the original NEMO 1-D project [Blanks, et al. (1997); Klimeck et al. (1997)] as well as improvements of NEMO 1-D undertaken at JPL 14 [Klimeck (22)]. NEMO 1-D contains roughly 2, lines of C, FORTRAN and F9 code. Data management is performed in an object oriented fashion in C, without using C++. On the lowest level, FORTRAN and F9 are used to perform small matrix operations such as matrix inversions and matrix-vector multiplication. The language hybrid structure was introduced to utilize fast FORTRAN and F9 compilers that were available on the SGI, HP, and Sun development machines in the early stages of NEMO 1-D. At that time identical algorithms 13 We speculate that this is in part due to our utilization of the Hermiticity of H. 14 JPL Technical Report, NEMO Benchmarks on SUN, HP, SGI, and Intel Pentium II. PEP/ gekco/ parallel/ benchmark.html written in FORTRAN and C showed that FORTRAN could outperform C by about a factor of 4. On today s Intel cluster based computers such a speed discrepancy may not really exist anymore in part due to the advancements in C compilers and the lack of competition for fast FORTRAN compilers. One major software component in NEMO 1-D is the representation of materials in a tight-binding basis including various orbitals and nearest neighbor counts. Adding a new tight-binding model amounts to adding a new Hamiltonian constructor. Bulk band structure and charge transport calculations are almost independent of the underlying Hamiltonian details and form a higher level building block by themselves. This modular design enables the introduction of more advanced tightbinding models as they become available, without interfering with higher level algorithms. The sp 3 d s model has been added at JPL recently within this architecture. A hierarchically higher software block in NEMO 1-D accesses the bulk bandstructure routines through a scriptbased database module. The ASCII database can be modified outside the NEMO 1-D core to contain arbitrary tight-binding input parameters as well as a variety of different database entries. The relatively simple database access to bulk bandstructure has enabled a straight-forward integration of NEMO into a genetic algorithm based optimization tool. This tool is used for tight-binding parameter optimization as discussed on Section.1. The material parameter database is also accessed in the new NEMO 3-D code. Most research oriented simulators must be fed a wash list of parameters, some of which are dependent on others, some of which may be superfluous, or some of which may cause crashes unless some other options have been set. Often these dependencies require an expert user increasing the initial barrier to simulator usage. The NEMO 1-D input has been structured hierarchically such that the user can provide information in automated dependent blocks. Information is, therefore, requested from the user as a progressively dependent input. Such input presentation is customary in a properly implemented in a graphical user interface (GUI). Such well organized user input presentation is relatively simply incorporated with a static GUI in software whose input is well specified. Research software under rapid development, however, tends to change its requirements frequently. Rapid changes force a static GUI to always

12 612 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 lag behind the actual theory software that it operates. Such static design also creates a maintenance nightmare, since new options must be added at two places independently, in the theory code and the GUI. Such issues are addressed in NEMO 1-D and NEMO 3-D in a way that is at least novel in the electronic device simulator and electronic structure simulator field. The input groups are formulated as hierarchical C data structures that are used by the theory code as well as the GUI. The input structures are formatted by translator functions into user-friendly and storage-friendly representations, such as windows and html-like text, respectively. With the translators in place GUI options are generated dynamically from the data structures that are determined by the requirements of the theory code. The theory programmer can add more options and data structures as needed, without concern for the representation of that information to the user or the transfer of it in and out the simulator. With the design of the data structure translators the development of the GUI and the theory code are essentially decoupled, and GUI, theory, and numerical developers can work on their respective blocks of code independently. The input/output design has been presented in some detail in reference []. In NEMO 3-D this approach has been generalized significantly. The architecture of the threading of the various input/output options and data structures has been implemented in NEMO 3-D as an object oriented, table-based inheritance. Options that require more input are associated with the creation function of that child data structure. As the user input is translated into the content of the data structure, new creation functions are put on the stack of non-entered user input. User input is requested until the stack of required user input is empty. This object-oriented input completely precludes if... then... else input parsing in NEMO 3-D. To tackle the data management on the various cluster computers in the High Performance Computing (HPC) group at JPL a Tcl/Tk client-server based interface was built. This interface works with NEMO 3-D and other completely independent simulators such as genetic algorithm-based optimization tools entitled GENES (Genetically Engineered Nanostructured Devices)[Klimeck, Salazar-Lazaro, Stoica, Cwik (1999) and EHWPack (Evolvable Hardware Package) [Keymeulen et al. (2)]. To improve the generality of this approach and to enable a web-based treatment of the overall device simulation on a remote computing cluster a JAVA / XML based approach 1 is currently developed. 4 Numerical Implementations and Parallel Performance 4.1 Hardware and Software Specifications The performance of the parallelized eigenvalue solver and strain minimization algorithm implemented in NEMO 3-D is benchmarked on four different parallel computers. Three of these computers are commodity PC clusters (Beowulf) of various generations, and the fourth one is a shared memory SGI Origin 2. The three Beowulf clusters (P4, P8, and P933) are based on Intel Pentium III processors running at 4MHz, 8MHz, and 933 MHz in various memory, CPU, and network configurations. Details are shown in Table 1. The P8 has two networking systems that can operate simultaneously: 1) the standard 1Mbps Ethernet, and 2) the advanced, low latency, high bandwidth (and high breakdown experience) 1.8Gbps Myricon network 16. Most of the benchmarks discussed here are based on the P8 performance. The other machines are used to analyze issues of memory latency and speed increase with increased clock and communication speed. Hyglac, the grandfather of Beowulf clusters was built in the High Performance Computing (HPC) Group at JPL by Thomas Sterling et al. in 1997 and it won the the Gordon Bell prize for lowest Cost/Performance at Supercomputing Hyglac is based on a cluster of 16 2MHz Pentium Pro processors with 128MB RAM each. JPL s HPC group continued to push on Beowulf computers and is currently focused on the use of high-speed networks with real world MPI applications and large memory usage. All of the parallel algorithms discussed in this paper are implemented with the message passing interface (MPI) [Gropp, Lusk, Skjellum (1997); Gropp, Lusk (1997)]. The SGI has its own proprietary implementation of MPI which utilizes the fast SGI interconnect as well as the shared memory within one 4-CPU board. Various MPI/MPICH [Groupp, Lusk, Skjellum (1997); Gropp Lusk (1997)] releases have been installed on the hardware in Table 1 throughout the last three years. On the dual CPU Beowulf, the shared memory versus distributed memory configurations of MPICH have been 1 See WIGLAF at subpages/ reports/ 1report/ WIGLAF/ WIGLAF-1.htm 16 See Myricom, in

13 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 613 Table 1 : Specifications of the parallel computers used in this work. Clock RAM Bus CPUs RAM Name CPU MHz node GB MHz node Nodes CPUs GB Network Purchase Motherboard SGI R SGI Origin 2 P4 PIII Mbps 1999 Shuttle Intel 44BX chipset P8 PIII Mbps, 1.8Gbps 2 Supermicro 37DLE, Intel LE chipset P933 PIII Mbps 21 Supermicro 37DL3, Intel LE chipset examined for their relative performance. Small performance increases due to the shared memory / reduced communication cost have been found in the electronic structure calculation. Even if the shared memory option is turned off, the communication from one CPU to the other on the same board is faster than to a CPU off-board. Apparently the network card relays the communication back to the on-board CPU without actually sending the message to the switch. A disadvantage of the shared memory implementation is the a priori determination of a maximum message buffer size as an environment variable before the software is executed. The simulation will fail if the simulation exceeds that maximum communication buffer size. Due to this static handicap and the minimal performance increase, the non-shared memory model is typically chosen. Parallelization efficiency using OpenMP has been explored in the early stages of the development process as an enhancement to MPI. The objective is to communicate from CPU board to CPU board with MPI and within a board with OpenMP and shared memory. In the example algorithms that have been explored the creation and destruction of threads using OpenMP were found to cause a significantly large overhead such that the parallel efficiency was unsatisfactory. For that reason the combined MPI and OpenMP approach was abandoned. OpenMP was not pursued as an overall parallel communication scheme across the cluster, since no reliable cluster-based OpenMP compilers were available. 4.2 Parallel Implementation of Sparse Matrix-Vector Multiplication The numerically most intensive step in the iterative eigenvalue solution discussed in Section 3.6 is the sparse matrix-vector multiplication of the matrix H and the trial vector Ψ n >. For example, the matrix-vector multiplication of the tight-binding Hamiltonian in a 1 million atom system with 4 neighbors per atom in a 1 orbital, explicit spin basis (sp 3 d s ) requires roughly million full 2 2 complex matrix-vector multiplications. This corresponds to = complex multiplications or roughly double precision multiplications and additions. The single matrixvector multiplication step can, therefore, be estimated as =12 Gflop. In the sp 3 s basis used in the benchmarks shown in Section 4.4 the operation count is reduced by a factor of 4 to about 3 Gflop. These estimates exclude overhead for the sparse matrix reconstruction, memory alignment, and construction of the fully assembled target vector Ψ n+1 >. With an expected iteration count in the Lanczos algorithm of 2, a total number of operations of 3 Tflop and 12 Tflop are anticipated for the sp 3 s and sp 3 d s model, respectively. With a single CPU operating at. Gflops, such computations continue through.7 and 2.8 days, respectively. Actually,. Gflops appears to be a high estimate for sustained computational throughput on the latest 2 GHz Pentium 4 chips. Three years ago, when this project was initiated, peak performance was about a factor of slower. The reduction in wall clock time for the completion of such a computation is highly desirable. This is particularly true

14 614 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 for systems in excess of ten million atoms. The 3 to 12 Gflop needed to perform a single matrixvector multiplication correspond to 3 or 12 seconds on a single. Gflop machine. This load is large enough to warrant parallelization on multiple CPUs. For implementation on a distributed memory platform, data must be partitioned across processors to facilitate this fundamental operation. For good load balance, the device is partitioned into approximately equally sized sets of atoms, which are mapped to individual processors. Because only nearest neighbor interactions are modeled, a naive partition of the device by parallel slices creates a mapping such that any atom must communicate with neighbors that are, at most, one processor away. This scheme, shown in Figure 2a), lends itself to a 1D chain network topology, and results in a blocktridiagonal Hamiltonian for non-periodic boundary conditions in which where each block corresponds to a pair of processors, and each processor holds the column of blocks associated with its atoms (Figure 2b). The gray squares in the corners symbolize fill-in regions due to periodic boundary conditions. Communication cost, roughly proportional to the boundary separating these sets, scales only with surface area (O(n 2/3 )) rather than with volume (O(n)), where n is the number of atoms. In a matrix-vector multiplication, both the sparse Hamiltonian and the dense vector are partitioned among processors in an intuitive way; each processor p, holds unique copies of both the nonzero matrix elements of the sparse Hamiltonian associated with the orbitals of the atoms mapped to processor p and also the components of the dense vector associated with atomic orbitals mapped to p. The matrix-vector multiplication is performed in a column-wise fashion as shown in Fig. 2b). That is, processor j computes: y i, j = H i, j x j (i = j, j ±1) (9) where H i, j is the block of the Hamiltonian associated with nodes i and j, and x j are the components of x stored locally on node j. There are three results generated by the multiplication on processor j: the diagonal components y j, j, which are needed locally by processor j; and two off-diagonal components y j 1, j and y j+1, j, which need to be communicated to processors j 1 and j+1, respectively. Within the same scheme processors j 1 and j+1 share one of their off-diagonal results with processor j. This scheme lends itself to a two-step communication process. In the first step or the two-step process all even numbered CPUs, 2n, communicate to the CPU to the right, 2n+1. All odd numbered CPUs, 2n+1, issue a communication command to CPUs, 2n. This communication is issued with the MPI command MPI_SendReceive, which can be implemented in the underlying MPI library as a full duplexing operation. That means that once the communication channel is established, which can take a significant time on a standard 1 Mbps Ethernet, the information packages can be exchanged in both directions simultaneously. In the second communication step all even numbered CPUs, 2n, communicate to the left, 2n 1. Simultaneously all odd CPUs 2n 1 communicate to the even CPUs 2n. Within this communication scheme collisions between messages do not occur and messages do not accumulate on one CPU while other CPUs wait for the completion of the communication 17. The message size can be reduced by a compression scheme, since most of the off-diagonal blocks are zero. The sparse structure of the blocks depends on the particular crystal structure in question. In practice a sufficient fraction of zero rows exists such that compressing the matrix-vector multiplication by removing structurally guaranteed zeros is worthwhile despite the additional level of indirection required to track the non-zero structure. The 1-D decomposition scheme performs well when the ratio of the number of atoms on the surface of the slab to the total number of atoms in the slab is small. As the number of CPUs in the parallel computation increases, for a given problem size, the surface to volume atom ratio increases to a limit of one, and the communication to computation ratio increases as well. Spatial decomposition schemes more elaborate than the 1-D scheme presented here can be implemented. One example is the 3- D decomposition in small cubes. Such schemes would probably enable the efficient participation of more CPUs in the computation; however such schemes come with immediately increased communication overhead, as six, since each CPU must exchange data with six rather then two surrounding CPUs. Sections explore the scaling of the simple 1-D topology parallel algorithms 17 Only if periodic boundary conditions are applied with an odd number of CPUs in the MPI run one needs three communication cycles due to a conflict at the first and the last CPU communication.

15 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 61 (a) (b) H x y H j-1,j y j-1 processors i-1 i i H jj x j y j H j+1,j y j+1 column n stored on processor n Figure 2 : (a) The device is decomposed into slabs (layers of atoms) which are directly mapped to individual processors. The gray blocks in the corner indicate the optional filling due to periodic boundary conditions. (b) Example matrix-vector multiplication on processors performed in a column-wise fashion, so that the j th block column and section x j are stored on processor j. The nearest neighbor model with non-periodic boundary conditions guarantees that the Hamiltonian is block-tridiagonal, so that communication is performed only with nearest neighbor processors. and show reasonable scaling for the mid-size clusters that are available at the High Performance Computing Group at JPL. 4.3 Hamiltonian Storage and Memory Usage Reduction The first NEMO 3-D prototypes were focused on the generality of the tight-binding orbitals and explored the reduction of the memory requirements to simulate realistically sized structures of several million atoms. The memory requirement for storing the sparse matrix tight-binding Hamiltonian for a 1 million atom system in a 1 spin-degenerate orbital basis can be estimated as 1 6 atoms diagonals (2 2 basis) 16bytes/2( f orhermiticity) =16 GB. Additional memory storage is needed for atom positions, eigenvectors, etc; therefore the 16 GB available in the P4 is inadequate. If the system of interest is unstrained, as is the case for free standing quantum dots [Lee, Joensson, Klimeck (21)], the memory requirement is reduced dramatically, since only a few uniquely different neighbor interactions need to be stored. The overall Hamiltonian can be generated from the replication of the few unique elements. Since immediate interest was focused on solidstate implementations on a bulk substrate, such simplifications were not in the immediate development path and they have not yet been implemented in NEMO 3-D. However, such a scheme was pursued in the NEMO 1- D transport code where the memory storage was arranged such that the Hamiltonian matrix elements fit completely into cache memory. This scheme allowed the rapid computation of the transport kernel [Bowen et al. (1997)] using the recursive Green function algorithm which scales linearly with the order N of lattice sites. The resulting computation time for a single energy pass through the whole Hamiltonian is so small, that the parallelization of the computation of a single transport kernel element cannot be parallelized efficiently [Klimeck (22)]. The individual tight-binding Hamiltonian construction can be formulated as a table look-up operation, which is not, in principle, time consuming, except for the scaling of the nearest neighbor coupling elements due to strain (Eqs. and 6). Therefore, the first implementation of the matrix-vector multiplication does not store the Hamiltonian, but re-computes the Hamiltonian on the fly in each multiplication step. Hamiltonian storage became more feasible for million atom size systems when P8 with its 64 GB of total memory came on-line in the year 2. The first Hamiltonian storage implementation stores the entire block of size basis basis for each atom and its neighbor interactions. This storage scheme preserves the generality of the code and the independent choice of number of orbitals. Timing experiments similar to those presented in Section 4.4 show that the speed increase due to Hamiltonian

16 616 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 storage is surprisingly small on the Beowulf systems, but is significant on the SGI. The low speed increase on the Beowulf may be associated with memory latency issues of the Pentium architecture. A further reduction in memory usage is, therefore, desirable. A more detailed analysis of the sp 3 s and sp 3 d s Hamiltonian blocks provides insight into the memory allocation actually needed to store the Hamiltonian. The diagonal blocks are only filled on their diagonal and on a small number of off-diagonal sites. These off-diagonals are in general complex and describe the spin-orbit coupling of the spin-up and the spin-down Hamiltonian blocks. The off-diagonal blocks of the Hamiltonian can be separated into a smaller spin-up and spin-down components which are identical and real. This symmetry can be exploited to reduce the Hamiltonian storage requirement by a factor of 8 for both the sp 3 s and the sp 3 d s models. A priori knowledge on which matrix elements are real and which are complex can be utilized to increase the speed of the custom matrix-vector multiplication. A speed increase due to the compact storage scheme of slightly over compared to the original storage scheme has been observed. This custom storage and matrix-vector multiplication scheme is used in the benchmarks in this paper when the Hamiltonian is stored. The utilization of C data management and the simple explicit access to real and imaginary elements of complex numbers leads to significantly faster small matrix-vector multiply algorithms in C compared to FROTRAN or F Lanczos Scaling with CPU Number This section describes the performance analysis of 3 Lanczos iterations on P8 in a variety of load distribution and memory storage schemes as a function of utilized CPUs. The execution time for seven different systems consisting of 1/4 to 16 million atoms for a Hamiltonian matrix that is reconstructed at each matrixvector-multiplication step is shown in Figure 3a). The sp 3 s model is used in these simulations, resulting in 1 1 Hamiltonian matrix sub-blocks. In the 1 million atom system case, the problem is equivalent to a matrix of , and the myricom communication path is utilized. The nearest neighbor CPU communication limitation (discussed in Section 4.2) limits the 1/4, 1/2, 1, and 2 million atom systems to a maximum number of parallel processes to 32, 4, 1, and 63, respectively. The 4, 8, and 16 million atom systems cannot run on a single CPU, because the single CPU RAM on P8 would be exceeded. Even without Hamiltonian storage, these larger systems require at least 2, 1, and 16 CPUs, respectively, to avoid swapping. Since P8 consists of 32 dual CPU nodes, a variety of loading schemes are possible in the distribution of MPI processes to the various CPUs. Figure 3a) explores two schemes: 1) dashed lines with crosses - one process per node (1 CPU idle), and 2) solid lines with circles - two processes per node (both CPUs active). Although the single process per node distribution incurs an increased cost in communication off the node, the overall computation time is slightly less when compared to the 2 processes per node case, for system sizes 1/4-4 million atoms. Larger systems (8 and 16 million atoms) produce a significantly better performance with the 1 process per node configuration. It appears more efficient to leave one CPU idle and utilize all the memory on board, rather than use all the CPUs and share the memory between two CPUs on the same board. This behavior can be associated with a memory latency / competition problem, and it is examined further below. The green dashed lines in Figure 3a) indicate perfect scaling for the 1 and 4 million atom system sizes. An increasing deviation from ideal scaling is observed with an increased number of CPUs. However, the computation time is still reduced when the number of CPUs is increased. Figure 3b) shows the efficiency computed as the ratio of ideal time and actual time (1 and 4 million atom systems in red and blue, respectively). A serial to parallel code ratio of 1.6% can be extracted if the 1 million atom, two processes per node efficiency curve is fitted to Amdahl s law. This ratio indicates a high degree of parallelism in the code. The reconstruction of the Hamiltonian matrix at each matrix-vector-multiplication step saves memory, but does require additional computation time. The performance of the matrix-vector-multiplication step can be improved through Hamiltonian matrix storage and the utilization of the sp 3 s and sp 3 d s Hamiltonian submatrix symmetries (see Section 4.3). Analogous to Figure 3a), Figure 3c) shows the parallel performance in the case of Hamiltonian storage similar. With the increased storage requirements, the minimum number of CPUs required for the swap-free matrix-vector multiplication for systems containing 1, 2, 4, and 8 mil-

17 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 617 Wall Time (s) Wall Time (s) Sped increase by storage (a) /2 1/4 1 1 Number of Processors 64 1/2 1/4 (c) 1 1 Number of Processors proc/node 2proc/node (e) Number of Processors Efficiency Efficiency Wall Time (s) Number of Processors Number of Processors 1 (f) 2Pr ~N^ Pr ~N^ (b) (d) 2Ps ~N^ Ps ~N^ Number of Atoms in Millions Figure 3 : (a) Execution time of 3 Lanczos iterations P8. The green dashed lines illustrate ideal scaling. Solid line: 2 processes per node (2Px), dashed line 1 process per node (1Px). First row: recomputed Hamiltonian (x=r), Second row: stored Hamiltonian (x=s). (b) Efficiency as defined as the ratio of actual compute time to ideal compute time. (c) and (d) similar to (a) and (b) except the Hamiltonian in not recomputed in each step, but stored in the first step. (e) Speed-up due to Hamiltonian storage for 1 and 4 million atom systems. (f) Execution time on 24 processors as a function of system size.

18 618 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 lion atoms increases to 2, 4, 6, and 16 CPUs from 1, 1, 2, and 1 CPUs. The 16 million atom system no longer fits onto P8. With the increased memory requirement, the distribution of processes onto different compute nodes becomes much more critical, even for smaller problem sizes. This result indicates clearly that the 2 CPUs on each motherboard compete for memory access at a significant performance cost. It appears to be more efficient to place a single process on each node for system sizes that are larger than about 4 million atoms when the Hamiltonian is stored, compared to 8 million atoms when the Hamiltonian is reconstructed. The 8 million atom simulation incurs dramatic performance losses if run on 2 processes per node, similar to the 16 million atom case without Hamiltonian storage shown in Figure 3a). Figure 3d) shows a greater parallel efficiency of the stored Hamiltonian algorithm versus the recomputed Hamiltonian algorithm of Figure 3b). However, the point of ideal performance increases from 1 CPU since the problems no longer fit onto a single CPU. Comparing the ideal scaling indicated by the green lines in Figure 3a) and (c) shows that the stored Hamiltonian algorithm scales better with an increasing number of CPUs. This observation contradicts the expectation that a more CPU intensive calculation such as the slower recomputed Hamiltonian algorithm should scale better than a lower intensity job such as the faster stored Hamiltonian algorithm. At this time an explanation why the stored Hamiltonian algorithm scales better than the recomputed Hamiltonian algorithm is not available. Figure 3e) shows the speed increase due to Hamiltonian storage for a system of 1 and 4 million atoms derived from the data shown in Figure 3a) and (c). Both system sizes show a greater speed increase when one process rather than two resides on a node. The speed increase due to storage is not constant, but increases with an increasing number of CPUs. The total memory used per CPU decreases with an increasing number of participating CPUs. This memory reduction reduces the competition for memory access and the speed increase curves increase with increasing number of CPUs. Competition for memory between the 2 processes on a single node with 2 CPUs is again visible. With an estimate of 3 Gflop for a single matrix-vector multiplication in a 1 million atom system (see Section 4.2), the execution time of about 1247 seconds for 3 iterations in Figure 3a) on a single CPU, a operation rate of.7 Gflops is obtained. Using 24 CPUs and 81 seconds the operation count is 1.1 Gflops. For the largest achievable 16 million atom system running on 2 CPUs for 23 seconds a.61 Gflops rating can be achieved. These operation counts exclude the operations needed to reconstruct the Hamiltonian on the fly. Hamiltonian storage roughly triples or quadruples these Gflops ratings. Figure 3 shows that the Lanczos algorithm performs well enough to enable the simulation of 8 and 16 million atom systems on reasonably sized Beowulf clusters. The sustained Gflop results are well within the expectations of a realistic application. 4. Lanczos Scaling With System Size The preceding Section 4.4 presented the scaling of the Lanczos algorithm as a function of employed number of CPUs for different system sizes on the P8 cluster. This section discusses a subset of the same data as a function of system size for a fixed number of 24 CPUs. Four different data sets are considered based on the cross-product combination of 1 or 2 processes per node (symbol 1Px and 2Px, respectively) and stored or recomputed Hamiltonian (x=s and x=r, respectively). Figure 3f) shows a plot of wall clock time as a function of the number of atoms in the simulation domain, N. The curves appear to be almost linear in N. Through linear regression the curves can be fitted to: T (2Pr) = N , R = T (1Pr) = N.99821, R = T (2Ps) = N 1.264, R = T (1Ps) = N , R = The fitted exponentials range from N.998 to N with a high regression value R>.999. The total computation time not only depends on the time consumed on matrix-vector multiplication, but also on the number of iterations needed for convergence within the Lanczos algorithm. Experience shows that the number of iteration needed to obtain a certain number of bound eigen-states in a quantum dot system depends weakly on the system size. Typical iteration counts are of the order of 1 to. The Lanczos solver presented in this work, therefore, scales roughly linearly with the system size.

19 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations Lanczos Performance with different Network Speeds The Myricom 1.8 Gbps networking system isutilized in the simulations shown in Figure 3. The Myricom network can be directly compared to a standard 1 Mbps Ethernet on P8, since both networks are installed independently. For the benchmarks shown in Figure 3 virtually identical results are obtained, if the simulation is performed on the significantly slower Ethernet network. This result indicates that the algorithm is not communication limited. 4.7 Examination of Memory Latency by Comparison of Different Machines Section 4.4 showed that the dual CPU P8 machine suffers from performance degradation due to memory access in the computation of large systems or a stored Hamiltonian. This section examines this performance bottleneck further by comparing the Pentium-based cluster machines with the SGI machine (see Table 1 for the machine specifications). Figure 4a) compares execution times of 3 Lanczos iterations on P8 (red), P4 (blue), and SGI (black) with (dashed line) and without (solid line) storage of a 2 million atom Hamiltonian. The P4 outperforms the SGI without Hamiltonian storage by a factor of 1.6 to 1.9. The fast, yet expensive memory of the SGI produces a more dramatic speed increase compared to P4 and the two machines have roughly the same performance on this problem. Figure 4b) shows that the speed increase for SGI reaches a factor of about 9 while it reaches a factor of. on P4. P8 only achieves speed increase factors of about 3 to 4 due to Hamiltonian storage, depending on the node load configuration; however, P8 still outperforms the significantly more expensive (and 2 years older) SGI by a factor of approximately 2. The memory latency problem can also be examined by comparison of execution times of the same executable and the same communication network type (1Mbps) on the P4, P8, and P933 machine when the number of CPU cycles is plotted as a function of employed number of parallel CPUs. The number of cycles is estimated as the total wall time multiplied by the frequency rating of the CPU in MHz. Figure 4c) shows such a plot for a system of 1 million atoms. If the Hamiltonian is recomputed on the fly and the required memory is small all three machines require almost identical number of cycles to compute the 3 Lanczos iterations and the curves lay on top of each other. By contrast, if the memory usage is increased due to the Hamiltonian storage, P4 requires fewer CPU cycles to compute the same problem as the machines with a high frequency rating. The additional cycles are spent waiting for the memory to arrive at the fast CPUs, which perform the computation faster than the memory delivery takes place. 4.8 Parallel Strain Algorithm Performance The minimization of the total strain energy is numerically significantly less taxing than the electronic structure calculation. The strain computation was therefore not immediately parallelized. However, simulating system sizes of 1 million atoms or more, shows that the serial strain computation becomes computationally as taxing as the parallel electronic structure calculation that it precedes. The mechanical strain calculation has therefore been parallelized as well. This strain parallelization combined with the parallel electronic structure calculation enabled some of the alloy simulations shown in this paper as well as the bulk alloy simulations shown previously [Oyafuso, Klimeck, Bowen, Boykin (22)]. Data are distributed in the same manner as in the electronic calculation: the simulation domain is decomposed into slabs such that atomic information associated with atoms within a slab is held by only one processor (see Figure 2a)). Message passing then takes place only between neighboring processors and the message size is proportional to the surface area of each slab, since the locality of the strain energy requires only that positions of atoms on the boundary be passed. Since the gradient of the strain energy in Eq.(7) is just as computationally inexpensive to determine as the total strain itself, a conjugate-gradient-based method that uses the derivative with respect to atomic configuration and periodicity to perform the line search 18 is used to minimize the strain energy. The parallelization of the algorithm occurs on two levels. First, the conjugate-gradient-based minimization involves computation of various inner products through a sum reduction and broadcast. Second, the function (and gradient) call to determine the local strain energy at an atomic site requires information about neighboring atoms that may lie on neighboring processors. Only position information of atoms on neighboring 18 See the for example macopt in mackay/c/ macopt.html

20 62 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 Wall Time (s) (a) recomputed stored SGI P4 P Number of Processors Speed increase by storage (b) SGI P4 P8 P8 1p/n Number of Processors CPU Cycles (s MHz) (c) recomputed stored P933 P8 P Number of Processors Figure 4 : (a) Execution time of 3 Lanczos iterations on three different machines: P8 (red), P4 (blue), and SGI (black) with (dashed line) and without (solid line) storage of a 2 million atom Hamiltonian. (b) Corresponding speed increase due to Hamiltonian storage (solid lines). Dashed line corresponds to P8 with 1 process per node. (c) Number of CPU cycles (wall time times CPU frequency) for the P4, P8, and P933 machine with and without stored Hamiltonian. processors that are on the boundary is sent. Figure shows scaling results for the wall time required to achieve convergence for a system of size nm consisting of approximately one million atoms. The simulation was run on two different hardware configurations P8 connected by the 2 Gbps Myrinet (solid line with stars) and P933 connected by standard 1 Mbps ethernet (dashed line with circles). No shared memory was used in either case. On a single processor, there Wall Time (s) P933, 1 Mbps P8, 2 Gbps Number of Processors Figure : Wall clock time to compute the strain in a 1 million atom system on P933 with its 1Mbps network (dashed line with circles) and on P8 with its 1.8Gbps network (solid line with stars). are no communication costs, and P933 outperforms P8 by about a ratio of the clock cycles of 8/933. As the number of processors is increased, however, the ratio of communication cost to computational cost increases; the communication expense is proportional to the surface area of the slabs which remains fixed while the computational cost is proportional to the slab volumes and thus inversely proportional to the number of processors. This reduction in efficiency with processor number is most evident for the slow 1 Mbps network. Using Ethernet the execution time is more than a factor of two greater than using Myricom 1.8 Gbps network. For the mechanical strain calculation a significant improvement of the scaling with increasing number of CPUs with the usage of a faster, low latency network is observed. This result differs from the electronic structure calculation discussed in Section 4.4. In that case no speed increase of improved performance with increasing number of CPUs was observed (and therefore not shown in a graph). This discrepancy is a result of the larger computational demand in the electronic structure calculation. The mechanical strain calculation deals only with three real numbers (the displacements from some ideal position) for each atom and with the relative distance to its surrounding four neighbors. The electronic structure calculation by contrast deals with 1 1 and 2 2 complex matrices for each atom and its four neighbors. Bulk Material Parameterizations and Properties.1 Genetic Algorithm-Based Fitting Electronic structure calculations in the lowest conduction and the highest valence band require a good pa-

21 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 621 rameterization of the band gaps, effective masses and band-anisotropies (for the holes). One of the drawbacks of the empirical tight-binding models is that there is no simple relation between these physical observables and the orbital energies. The analytical formulas that have been developed in the past [Boykin, Klimeck, Bowen, Lake (1997); Boykin (1997); Boykin, Gamble, Klimeck, Bowen (1999)] serve as a guide for the general capability of a particular model and show that the optimization space is not smooth. The fitting of the parameters using these formulas has led to dramatic improvements in the simulation capabilities of high performance resonant tunneling diodes [Bowen, et al. (1997a); Klimeck, et al. (1997)], although the process remained tedious at best. A very nice and diligent parameterization of the sp 3 d s model has been published by Jancu Scholz, Beltram, Bassani (1998)]. A large number of the technically relevant III-V materials as well as elemental semiconductors have been parameterized in their work. They have also optimized orbital-dependent distance scaling exponents η to fit strain-dependent quantities such as deformation potentials. To enhance the performance of the model with strain in a layered superlattice configuration Jancu et al. have developed a method where the d orbital on-site energy is shifted as a function of strain. For the general 3-D electronic structure case that is subject to this work, a more general treatment of the on-site energies as a function of strain must be included. In the NEMO 3- D implementation of the tight-binding model, all on-site energies can be shifted due to strain in an arbitrary 3-D configuration. To automate the fitting of the orbital tight-binding parameters to the desired bulk material properties [Madelung (1996); Landolt-Bornstein (1982), Jancu, Scholz, Beltram, Bassani (1998)] a genetic algorithm (GA) based software package. The details of this algorithm and several improved material parameterizations are described elsewhere [Klimeck et al. (2); Klimeck, Bowen, Boykin, Cwik (2)] was developed. The general idea of the GA is the stochastic exploration of a parameter space with a large set of individuals, which represent different parameter configurations. The individuals are measured against a certain desired fitness function and ranked. Some of the individuals (for example 1% of the worst performers) are thrown out of the gene pool and replaced by new individuals that are derived from better performers by cross-over and mutation operations. The parameterizations used in this work have been obtained using this GA approach, starting from earlier parameterizations [Boykin, Klimeck, Bowen, Lake (1997); Boykin (1997); Boykin, Gamble, Klimeck, Bowen (1999); Jancu, Scholz, Beltram, Bassani (1998)]. The following sections present the parameterization data, and the resulting unstrained and strained bulk-properties..2 Parameter Tables and Bulk Properties Table 2 lists the parameters that enter the sp 3 s model used in this paper. The parameterization for InAs was obtained from a GA, while the GaAs data was originally delivered by Boykin to the NEMO 1-D project. No effort has been made in this parameterization to fit the offdiagonal or the diagonal matrix element strain corrections. All off-diagonal matrix elements are scaled with the ideal exponent [Harrison (1999)] of η=2 and the diagonal correction is set to zero. An explicit InAs valence band offset vs. GaAs of.2279 is used in this parameterization. The sp 3 d s parameterization in contrast is based on common atom potentials and has the valence band offset built into the parameter set. Table 3 shows the complete parameterization of GaAs and InAs in our sp 3 d s tight-binding model including the off-diagonal and diagonal strain scaling parameters. In this model a good fit based on common atomic potentials of the As in the GaAs and InAs has been obtained. A valence band offset of the unstrained materials of.229ev is built into the parameter set. The sp 3 d s model is rich enough in its physical content to enable the fitting of GaAs, InAs, and AlAs with comon As potentials and built-in valence band offsets. Common atom potentials and built-in valence band offsets cannot be achieved in the sp 3 s model, unless some of the fitting requirements are severely relaxed. Table 4 summarizes the major unstrained bulk material properties that have been targeted in the sp 3 s and sp 3 d s parameterization for GaAs and InAs. The target parameters are taken from various experimental and theoretical references [Madelung (1996); Landolt-Bornstein (1982), Jancu, Scholz, Beltram, Bassani (1998)]. The major parameters of interest are associated with the lowest conduction and the two highest valence bands. In Table 4 these properties are separated by a horizontal line

22 622 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 Table 3 :sp 3 d s tight-binding model parameters for GaAs and InAs. All energies are in units of ev, the lattice constant is in units of nm and the strain parameters η and C are unitless. TB Parameter GaAs InAs η strain GaAs InAs C strain GaAs InAs lattice E(s a ) E(p a ) E(s c ) E(p c ) E(s a) E(s c ) E(d a ) E(d c ) a / c / E shi ft V(s,s) ss σ..68 C(s,s) V(s,s ) s s σ C(s,s ) V(s a,s c ) ssσ C(s a,s c ) V(s a,s c) spσ C(s a,s c) V(s a, p c ) ppσ C(s a, p c ) V(s c, p a ) ppπ C(s c, p a ) V(s a, p c) sdσ C(s a, p c) V(s c, p a ) s pσ C(s c, p a ) V(s a,d c ) pdσ C(s a,d c ) V(s c,d a ) pdπ C(s c,d a ) V(s a,d c ) C diag C(s a,d c ) V(s c,d a ) ddσ C(s c,d a ) V(p, p, σ) ddπ C(p, p) V(p, p, π) ddδ V(p a,d c,σ) s dσ C(p a,d c ) V(p c,d a,σ) C(p c,d a ).. V(p a,d c,π) V(p c,d a,π) V(d, d, σ) C(d, d) V(d, d, π) V(d, d, δ)

23 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 623 Table 2 :sp 3 s tight-binding parameters for GaAs and InAs. All energies are in units of ev and the lattice constant is in units of nm. For this parameterization all relevant off-diagonal stain scaling parameters are set to η=2 and all diagonal strain scaling parameters are set to C=. Parameter GaAs InAs lattice/(nm) E(s a ) E(p a ) E(s c ) E(p c ) E(s a) E(s c) V (s, s) V (x, x) V (x, y) V (s a, p c ) V (s c, p a ) V (s a, p c ) V (p a, s c) a c of fset Ev from parameters that are outside these central bands of interest. The upper and lower band edges as well as the minimum point in the [111] direction k L are included in the optimization target with a relatively small weight. These properties are included in the optimization to preserve an overall good shape of the bands outside the major interest. If they are not included, upper and lower bands will distort significantly to aid the desired perfect properties of the central bands. This distortion can lead to undesired band crossings on and off the zone center. Also included (yet not shown in the table) is another restriction on the GaAs and InAs parameters to alloy well within the virtual crystal approximation (VCA). It has been found that parameter sets that represent the individual GaAs and InAs quite well can result in a In x Ga 1 x As alloy representation that has completely wrong behavior of the bands as a function of x (dramatic non-linear bowing). Typically a target that linearly interpolates the central conduction and valence band edges for In x Ga 1 x As from GaAs and InAs as a function of x is included. Bowing is not built into these VCA parameters, but establishes itself in the 3-D disordered system (see reference [] for an Al x Ga 1 x As example and Section 6.2 for a discussion on In x Ga 1 x As). Compared to the sp 3 s model, the sp 3 d s model generally provides better fits to the hole effective masses and the electron effective masses at Γ and L. The failure of the sp 3 s model to properly reproduce the transverse effective mass on the line towards X is well understood [Klimeck, et al. (2)]. The sp 3 d s model does allow the proper modeling of the effective masses in that part of the Brillouin zone. Figure 6 shows the bulk dispersion of GaAs (left column) and InAs (right column) computed from the tight-binding parameters listed in Tables 2 and 3 without strain. The dispersion corresponding to the sp 3 s model is plotted in a dashed line and compared to the results from the sp 3 d s model in a solid line. The first row in Figure 6 shows the bands in a relatively large energy range including the lowest valence band in the models as well as several excited conduction bands. The second row in Figure 6 zooms in on the central bands of interest. The sp 3 s and sp 3 d s model agree reasonably well with each other at the Gamma point in their energies as well as their curvatures of the central bands of interest. Off the zone center the deviation between the two models become significant. Some of the band energies are hard to probe experimentally and are only known from other theoretical models [Madelung (1996); Landolt-Bornstein (1982); Jancu, Scholz Beltram, Bassani (1998)]. However the conduction band energies at X and L and their corresponding masses are well known, and the sp 3 s model does fail to deliver a good fit. The sp 3 s model generally appears to deviate strongly from the sp 3 d s model in the [111] direction even for the central bands of interest..3 Band Edges as a Function of Strain The deformation of atomic positions from their ideal values in a relaxed semiconductor crystal modifies the interaction between atomic neighbors and therefore the electronic bandstructure. The ability to form strained structures without defects opens a new design space exploited by many commercially relevant devices, including, for example, InGaAsP-based laser diodes operating at 1.µm. Although good qualitative results have been obtained for the strain-dependence of the effects of interest in these devices [Silver, Oreilly (199)], very precise measurements of all the empirical parameters that influence strain are still lacking. The baseline strain parameterization to which the calculation is compared and fitted to has been presented by Van de Walle (1989). Van de

24 624 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 Table 4 : Optimization targets and optimized results for the sp 3 s and s3ds* model for GaAs and InAs. Property GaAs sps %dev spds %dev InAs sps %dev spds %dev Target Target Eg Γ Ec Γ V hh m c [1] m lh [1] m lh [11] m lh [111] m hh [1] m hh [11] m hh [111] m so[1] Ec X EΓ c m X [long] m X [trans] k X Ec L EΓ c m L [long] m L [trans] k L so E6v Γ Eso E6c Γ E7c Γ E8c Γ E6v X E7v X E6c X E7c X E4v L Ev L E6v L E7v L E6c L E7c L

25 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 62 (a) (b) Energy (ev) - -1 GaAs spds sps - -1 InAs Energy (ev) 2-2 GaAs (c) L Γ X Momentum InAs (d) L Γ X Momentum Figure 6 : E(k) dispersion for GaAs (left column) and InAs (right column) computed with the sp 3 s (dashed line) and sp 3 d s (solid line model). Walle s parameterization is not purely empirically based, but partially dependent on a k p expansion following Pollak and Cardona (1968). For this work, Van de Walle s parameters have been slightly modified to represent room temperature bandgaps. Figure 7 shows the conduction and valence band edges for GaAs (left column) and InAs (right column) as a function of hydrostatic strain (top row) and bi-axial strain (bottom row). Three parameterizations are compared in each graph: 1) reference data by Van de Walle (dashed line), 2) data computed from the sp 3 d s model (circles), and 3) data computed from the sp 3 s model (solid line). The test application in this paper is the modeling of a strained InGaAs system grown on top of a GaAs substrate. Since InAs has a larger lattice constant than GaAs one needs to model effects on InAs as it is compressed towards the GaAs lattice constant (7% negative strain). GaAs bonds, by contrast, are expected to be stretched towards the InAs bondlength at interfaces (positive strain). Since the InGaAs quantum dots grown on GaAs which are considered in the next two sectins 6 and 7 are significantly larger in their width than their height, one can expect the strain in the dot to be mostly bi-axial. However some hydrostatic strain distributions can be expected as well, due to the finite extent of the InAs quantum dots inside the GaAs buffer. The z-directional strain component in the bi-axial strain case is computed as ε zz =2 c 12 c 11 ε xx and ε yy =ε xx. Both tight-binding models follow the trends set by the Van de Walle reference reasonably well. Generally speaking the sp 3 d s model performs better than the sp 3 s model (which actually was not optimized for its strain performance). It has been particularly hard to improve the under-prediction of the InAs band gap (Figure 7d)) for large compressive bi-axial strain. The reasonably good fit has been obtained by compromising the fit of the conduction band under hydrostatic compressive strain. In contrast, the InAs valence bands has not not posed any problem at all to be fit to the Van de Walle data..4 Effective Masses as a Function of Strain Previous nanoelectronic transport simulations have shown that it is essential [Bowen, et al. (1997a); Klimeck, et al. (1997); Bowen, et al. (1997b)] to properly model the band edges and effective masses in the

26 626 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 GaAs Energy (ev) GaAs Energy (ev) E c E HH =E LH Hydrostatic Strain ε =ε =ε xx yy zz E c E LH GaAs (a) E so GaAs (c) E HH InAs Energy (ev) InAs Energy (ev) spds sps Walle Hydrostatic Strain ε xx =ε yy =ε zz E so InAs (d) Biaxial Strain ε xx =ε yy Biaxial Strain ε xx =ε yy Figure 7 : GaAs and InAs conduction band, heavy-hole, light-hole and split-off band edges as a function of hydrostatic and bi-axial strain. Dashed line from parameterization of Van de Walle (1989). Circles from sp 3 d s model and solid line from sp 3 s model... InAs (b) E HH =E LH E HH E LH E c E so E c E so

27 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 627 m*(el).1. GaAs (a).1. InAs (b) sps spds hydro no sym bi-ax.2 GaAs (c).3 InAs (d) m*(lh) m*(hh)..4.4 GaAs (e).4.4 InAs (f) Strain Strain Figure 8 : Electron, heavy hole (HH) and light-hole (LH) effective masses in the [1] direction as a function of hydrostatic (circles) and bi-axial (no symbols) strain for the sp 3 s and the sp 3 d s model. Left column GaAs, right column InAs. Negative strain numbers correspond to compressive strain.

28 628 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 heterostructure. In a single band model the dependence of eigen-energy of a confined state is inversely proportional to the effective mass. With the strong dependence of the band-edges on the strain shown in Figure 7 one can also expect a strong dependence of the effective masses on the strain. Figure 8 shows the electron (first row), light hole (second row), and heavy hole (third row) effective masses for GaAs (first column) and InAs (second column) as a function of hydrostatic strain (lines with circles) and bi-axial strain (lines without symbols) for the sp 3 s (dashed line) and the sp 3 d s model (solid line) computed in the [1] direction. Negative strain values correspond to compressive strain. For the electron mass the sp 3 s and the sp 3 d s model show roughly the same trends for GaAs as well as InAs. The GaAs mass drops towards the smaller InAs mass as GaAs is stretched towards InAs. In InAs the electron mass is increased towards the heavier GaAs mass as the material is compressed towards GaAs. The sp 3 d s model shows a larger difference between the effect of hydrostatic and bi-axial strain than the sp 3 s model. The change in the effective mass in InAs under compressive bi-axial strain is quite important. Under 7% bi-axial strain the effective mass approximately doubles. This increase in the effective mass lowers the confinement energies in the the quantum dots, effectively increasing the confinement. The spacing between the confined electron states will also be significantly reduced. The light hole masses (Fig 8c,d)) show a similar linear dependence to hydrostatic strain as the electron masses for both band structure models. Under bi-axial compressive strain, however, the light hole mass increases dramatically towards the heavy hole mass. Both tightbinding models predict roughly the same behavior. In the case of thin InGaAs quantum dots strained on GaAs this implies that the light hole confinement is much stronger and the light hole state separation is much smaller than the unstrained LH effective mass would indicate. Note however that the LH band is significantly separated from the HH band due to strain as indicated in Figure 7d). While the two tight-binding models show similar trends for the electron and light hole effective masses for GaAs and InAs under both pressure types, the two models show different trends for the heavy hole masses. In the case of GaAs under hydrostatic pressure the two models still predict the same trends for the HH mass. However, with increasing bi-axial strain the sp 3 s model predicts an increase in the GaAs HH mass, while the sp 3 d s model shows the opposite trend. In the case of InAs the two models predict conflicting trends in both strain regimes. Note that both models have slightly different zero-strain origins as indicated in Table 4. The difference in the strain dependence trends for the HH mass in the two tight-binding models may result in different hole confinements and hole state separations predicted by the two models. Although the conflicting trends are somewhat disturbing and warrant further examination on their effects on confined hole masses, it is also important to note that the overall variation due to strain is small to within about 1%. Variations with strain in the electron and light hole masses are much more significant on the order of 1% and both models predict the same trend. 6 Application of NEMO 3-D to InGaAs Alloyed Systems The previous Section discusses the parameterization of GaAs and InAs in the sp 3 s and sp 3 d s model. All the material properties in that section were computed on the basis of a single primitive fcc-based cell. This section 6 and section 7 focus on the properties of the alloy In x Ga 1 x As modeled by the constituents of GaAs and InAs in a 3-D chunk of material consisting of tens of thousands to over 6 million atoms. Two different systems are considered in detail: 1) bulk In x Ga 1 x As and its properties as a function of In concentration x, and 2) In.6 Ga.4 As dome shaped quantum dots embedded in GaAs. Within each system the strain properties are examined first, followed by an analysis of the electronic structure. Throughout this section the sp 3 d s model is used for all the electronic structure calculations. 6.1 Strain Properties of Bulk In x Ga 1 x As The starting point of many atomistic electronic structure calculations is a determination of the atomic configuration through a minimization of the total strain energy. The strain calculation discussed earlier is applied to a small, periodic In x Ga 1 x As system consisting of approximately 13 atoms. Figure 9 shows the mean bond lengths for such a small system. The curve in red (blue) corresponds to the mean In-As (Ga-As) bond length and is bounded by dotted curves that delimit the range of bond lengths that lie within one standard deviation of the mean. Clearly, as the material in question becomes less alloy-like (i.e more GaAs-like or InAs-like) the standard

29 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 629 deviations approach zero. The curve in green is the average of the Ga-As and In- As bond lengths weighted by the concentration of each cation and represents the mean bond length throughout the crystal. Note that this mean is strongly linear with a very slight upward bowing and is consistent with Vegard s law [Chen, Sher (199)]. Also evident is the bimodal nature of the bond length distribution which demonstrates that on a local scale the crystalline structure around any particular cation retains to a large degree the character of the binary bulk material. The computed bond lengths show reasonable agreement with those determined from experiment [Mikkelsen, Boyce (1982)] (shown in black), but tend closer to the mean crystal value. bond length (nm) mean In-As bond length ± std mean Ga-As bond length ± std Mikkelsen/Boyce (1982) mean bond length of entire crystal In concentration Figure 9 : In-As (red) and Ga-As (blue) bond length average (with an error margin of one standard deviation) as a function of In concentration x. Black line corresponds to experimental data reported by Mikkelsen and Boyce (1982). Dot-dashed green line corresponds to a VCA result representing the mean bond length of the entire crystal. tight-binding Hamiltonian on a single unit cell with periodic boundary conditions, in which the cation-anion coupling potentials are determined by a strict average of the In-As and Ga-As coupling potentials. The lattice constant of the single cubic unit cell is determined by Vegard s law [Chen, Sher (199)]. The resulting energy gap is mostly linear, but displays a very slight upward bowing. The blue curve is obtained by diagonalizing the full Hamiltonian of the alloyed system. The system size is sufficiently large that variations of the energy gap due to configurational noise (see analysis in Section 7) is not visible on the energy scale shown in the figure. The determined energy gap differs from the VCA result by a maximum of 6 mev and displays a slight downward bowing, although significantly less than that of the experimental result [Landolt-Bornstein (1982)]. The linear behavior in the VCA computed bandgap is included in the tight-binding parameter fitting as discussed in Section.2. The random cation disorder in the 3-D bulk system can, therefore, be attributed with the bowing. In similar Al x Ga 1 x As simulations [Oyafuso, Klimeck, Bowen, Boykin (22)] much better agreement between the 3-D simulation and the experimental results has been achieved. Some bowing might have to be built into the VCA based parameterization of GaAs and InAs to accommodate the larger bowing in the In x Ga 1 x As system compare to the Al x Ga 1 x As. Energy gap (ev) experiment alloy vca 6.2 Electronic Properties of Bulk In x Ga 1 x As With the atomic configurations the electronic properties of the In.6 Ga.4 As system can be obtained. Figure 1 compares the experimentally measured [Landolt- Bornstein (1982)] energy gap (shown in green) of In x Ga 1 x As as a function of In concentration x with numerical results, obtained in two different ways. The red curve is the VCA result, obtained by diagonalizing the In concentration (x) Figure 1 : Experimentally measured [Landolt- Bornstein (1982)] energy gap (solid line) of In x Ga 1 x As as a function of In concentration x compared with results based on the 3-D random alloy simulation (circles) and a virtual crystal approximation (stars).

30 63 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , Strain in an Alloyed Quantum Dot This section demonstrates an example of the strain calculation in an alloyed quantum dot using NEMO 3-D. The model problem is a dome-shaped In.6 Ga.4 As QD of diameter 3 nm and height nm enclosed in a GaAs box of size nm 3. The entire structure, which contains roughly 6.3 million atoms, is allowed to expand freely to minimize the total strain energy (no fixed boundary conditions). The diagonal part of the local strain tensor is examined along the x-axis, which lies halfway between the top and bottom of the dome and parallel to its base. Figures 11a) and 11b) show the components ε xx (blue), ε yy (green), ε zz (red), and Tr(ε) (black) of the local strain tensor of the primitive cell centered about the Ga and In cations respectively. Within the QD, the In-As bonds (see Figure 11b) are compressively strained roughly equally in the x and y directions (approximately 4.69% and 4.99% respectively). There is a very slight tensile strain in the z direction (.2 %). There are three competing effects that determine the sign and magnitude of this strain. First, there is a negative hydrostatic component due to the smaller lattice constant of the buffer. Second, the flatness of the dome means that close to the center of the QD, the strain field should approach that of quantum well in which the cubic cell is compressively strained laterally (i.e. in x and y) and stretched in z. Finally, the presence of nearby Ga cations provides an additional negative hydrostatic component to the strain. The combination of these three effects gives rise to a large biaxial compressive strain and a nearly vanishing strain component normal to the flat dome. The Ga cations within the dome (see Figure 11a) are subject to only one of these effects, that of the biaxial strain of nearby In-As bonds. Interestingly, the Ga-As bond lengths are reduced laterally (-1.98% (x) and -1.9% (y)) from their bulk values. This reduction is likely an effort to match the very large z-component of the In-As bondlengths. The resulting average tension in the z direction is 2.14%. Just outside the dome, along x, the Ga atoms suffer tensile strain in y and in z (although more so in z) to match the effective lattice constant on the boundary of the dome. This stretching results in compressive strain along x as indicated in Figure 11(a) by the negative value of ε xx outside the QD. Strain Strain (a) Ga-As bonds -2 2 Position along x (nm) (b) In-As bonds ε xx ε yy ε zz Tr{ε} ε xx ε yy ε zz Tr{ ε} -1-1 Position along x (nm) Figure 11 : Components of strain tensor for primitive cells centered around (a) Ga and (b) In cations along an axis midway from the top and bottom of the dome and parallel to its base.

31 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations Local Band Structure in an Alloyed Quantum Dot Figure 12 shows the effect of the deformation of the primitive cells under strain on the local electron and hole band structure. Each point represents a local eigenenergy obtained by constructing a bulk solid from the primitive cell formed from the four As anions enclosing each cation. One sees that outside the QD, the tensile strain the GaAs cells experience reduces the conduction band edge slightly from its bulk value and splits the degenerate valence band (shown in black) into heavy hole (HH) and light hole (LH) bands. Energy (ev) Energy (ev) (a) (b) -2 2 Position along x (nm) -2 2 Position along x (nm) InAs (EL) GaAs (EL) InAs (HH) InAs (LH) GaAs (HH) GaAs (LH) Figure 12 : Local conduction (a) and valence (b) band edges determined by imposing periodic boundary conditions on a primitive cell constructed from the four anions surrounding a given cation. Within the QD, both GaAs and InAs cells are squeezed laterally resulting in a increase in local electron eigenenergies. The resulting mean electron band edge along the x-axis and within the QD is indicated by the thin solid line. Biaxial compressive strain also raises (lowers) the local HH (LH) eigen-energies within the QD for both InAs and GaAs cells, and, again, the average HH QD band edge is indicated by a thin solid line. Clearly, the random distribution induces a large variation in local potentials, which will shortly be seen to strongly affect shallow hole states. 6. Wave Functions in an Alloyed Quantum Dot This section examines the effect of disorder on electron and hole eigenfunctions. Three different alloy configurations are examined for the same quantum dot size, shape, and number of included atoms two different random alloy configurations that differ only by the random placement of the In and Ga atoms in the In.6 Ga.4 As, and a VCA-based configuration without spatial disorder. In the VCA representation all cations within the QD are of a fictitious type In.6 Ga.4 in which all tight-binding parameters (and the strain parameters) are linearly averaged between InAs and GaAs parameters 19. This case corresponds most closely to a jellium description and is used as our baseline reference. The disordered wave functions are shown to be significantly different from each other as well as from the homogeneous VCA system wavefunctions. A detailed statistical analysis of the computed eigen energies as a result of the wave function variations is deferred to Section 7. The quantum dot and composition is identical to the discussion above. However, to reduce the computational expense, the GaAs buffer is reduced to a size of nm 3 and contains roughly 3.6 million atoms. Figure 13 shows four different representations of the ground state electron wave function obtained for three different configurations. The first column shows results for a VCA implementation. The other two columns display results for two separate random distributions of In and Ga atoms within the QD. The first row depicts scatter plots of the probability density, where the red points mark atomic sites where the probability density exceeds onethird of the maximum value, and green and blue points mark higher values. Clearly there is not much differ- 19 The anion As parameters are in general averaged as well in the VCA approximation. However in the sp 3 d s parameterization discussed in Section.2 all As parameters are already identical.

632 Copyright c 22 Tech Science Press CMES, vol.3, no., pp.61-642, 22 (a) (b) (c) 2 2 2 z (nm) -2-2 -2 y (nm) - - x (nm) y (nm) - - x (nm) y (nm) (d) (e) 1 (f) - - x (nm) y (nm) - -1 E=1.31eV E=+.

32 632 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 (a) (b) (c) z (nm) y (nm) - - x (nm) y (nm) - - x (nm) y (nm) (d) (e) 1 (f) - - x (nm) y (nm) - -1 E=1.31eV E=+.meV E=1.3169eV E=+1.38meV 1 (g) (h) (i) y (nm) x (nm) 1 1 x x (nm) 1-1 x (nm) 1 (j) 1 x 1-3 (k) 1 x 1-3 (l) Y (r) y (nm) -1-1 x (nm) x (nm) x (nm) Figure 13 : Electron ground state wave functions without disorder - VCA (first column), and two different random alloy configurations (middle and right column). First row: scatter plot of wave function in 3-D. Second, third, and fourth row: colored contour plot, outlined contour and surface plot sliced through the middle of the quantum dot at a constant z, respectively. 1

33 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 633 ence between the three plots except that the VCA result is somewhat smoother. Also, the VCA plot is slightly larger indicating a slower decay as one moves away from the central axis of the QD. The next two rows are contour plots of a slice parallel to the base of the dome and midway up in height. Here, the impact of the disorder on the wave function is quite evident, although the s-like character of the wave function still closely resembles that of the homogeneous QD. Also, the difference between the two disordered QDs is not significant. The eigenenergies differ by about 1.38meV. The last row depicts a surface plot of the wave function (normalized to unity over the entire simulation domain) and shows that the homogenous result is a smoothed version of the disordered wave function with a lower maximum. Figure 14 shows a set of hole wave functions analogous to the those shown in Figure 13 for the electron ground state. First note that the VCA scatter plot looks similar to the ground state VCA electron wavefunction, except that it is flatter. The stronger localization in the z direction reflects the greater confinement due to the larger hole mass relative to that of the electron. The larger hole mass also makes the wave function more susceptible to perturbations in the local potential. This effect is demonstrated in the three hole scatter plots, where the disorder strongly changes the appearance of the wave function. Note, also, that different placements of cations can produce noticeably different results as seen in the contour plots where the location of the wave function peaks vary by several nm. The greater localization in systems with disorder also manifests itself by the much larger peaks in the surface plots, where the probability density is, again, normalized to unity over the entire simulation domain in each of the three cases. The hole eigenvalues differ by -3.44meV compared to a difference of +1.38meV for the electrons. A more detailed statistical analysis of the distribution of eigenvalues is the topic of the following section 7. Figure 1 shows the six lowest electron (rows 1 and 2) and hole (rows 3 and 4) states for a similar system with the same dome dimensions, but enclosed in a buffer of size nm 3. First, note that the electron states more closely resemble the states one would expect from a homogeneous QD. Also, the three lowest hole states correspond well to their electron counterparts, but higher energy states differ. There are two possible explanations. First, the Lanczos algorithm might not have yet converged on an intermediate eigenvalue. Second, the disorder in the system may rearrange the ordering of the eigenstates. Note that there exist several pairs of states (electron 2 and 3; electron and 6; and hole 2 and 3) that stem from degenerate states in the homogeneous system, yet the disorder splits their eigenenergies by up to 1.4 mev. Since this splitting due to disorder is roughly the same order of magnitude as the separation of the excited states, it is conceivable that the disorder can rearrange the ordering of the eigenstates. 7 Statistical Analysis of Random Disorder in Alloyed Quantum Dots 7.1 Set-up of the Numerical Experiment This section considers the same dome shaped In.6 Ga.4 As quantum dot as the previous sections. Since the In and Ga ions inside the alloyed dot are randomly distributed, different alloy configurations exist and optical transition energies from one dot to the next may vary, even if the size and the shape of the dot are assumed to be fixed. This section seeks to answer the question: What is the minimal optical line width that can be expected for such an alloyed dot neglecting any experimental size variations? To enable the simulation of about 1 different configurations the required simulation time was reduced by three additional approximations / simplifications: 1) the surrounding GaAs buffer is reduced to nm in each direction around the quantum dot. This results in a total simulation domain of approximately 1,, atoms with about 718, atoms in the quantum dot itself, 2) the use of sp 3 s model instead of the sp 3 d s model (a reduction of the required compute time by about 4 ), and 3) the computation of the eigenvalues without the corresponding eigenvectors (resulting in a reduction of compute time by exactly a factor of two). With these approximations and simplifications the wall clock time to obtain one set of eigenvalues for one particular alloy configuration took about 2 minutes on 31 processors of P different alloy samples therefore required approximately 42 hours or 17.4 days wall clock time or about 13, hours or 38 days single processor computing time. The mechanical strain is minimized using a valence force field method [Keating (1966); Pryor, Kim, Wang (1998)] as discussed in Section 3.4 for each alloy configuration. Changing the random seed

634 Copyright c 22 Tech Science Press CMES, vol.3, no., pp.61-642, 22 (a) (b) (c) 2 2 2 z (nm) -2 1 y (nm) - - -2 x (nm) y (nm) - - -2 x (nm) y (nm) (d) (e) (f) - - x (nm) y (nm) - -1 E=.228eV E=+.

34 634 Copyright c 22 Tech Science Press CMES, vol.3, no., pp , 22 (a) (b) (c) z (nm) -2 1 y (nm) x (nm) y (nm) x (nm) y (nm) (d) (e) (f) - - x (nm) y (nm) - -1 E=.228eV E=+.meV E=.2174eV E=-3.44meV 1 (g) (h) (i) y (nm) x (nm) 1 1 x x (nm) 1-1 x (nm) 1 (j) 1 x 1-3 (k) 1 x 1-3 (l) Ψ(r) y (nm) -1-1 x (nm) x (nm) x (nm) Figure 14 : Hole ground state wave functions without disorder - VCA (first column), and two different random alloy configurations (middle and right column). First row: scatter plot of wave function in 3-D. Second, third, and fourth row: colored contour plot, outlined contour and surface plot sliced through the middle of the quantum dot at a constant z, respectively. 1

Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 63 1 (a) electron 1 (b) electron 2 (c) electron 3 y (nm) - -1 E=1.361eV E=.meV E=1.3832eV E=+23.1meV E=1.

35 Development of a nanoelectronic 3-D (NEMO 3-D ) Simulator for multimillion atom simulations 63 1 (a) electron 1 (b) electron 2 (c) electron 3 y (nm) - -1 E=1.361eV E=.meV E=1.3832eV E=+23.1meV E=1.3843eV E=+24.2meV 1 (d) electron 4 (e) electron (f) electron 6 y (nm) - -1 E=1.438eV E=+43.7meV E=1.49eV E=+4.8meV E=1.463eV E=+46.1meV 1 (g) hole 1 (h) hole 2 (i) hole 3 y (nm) - -1 E=.289eV E=-.meV E=.1942eV E=-14.7meV E=.1928eV E=-16.meV 1 (j) hole 4 (k) hole (l) hole 6 y (nm) - -1 E=.1811eV E=-27.8meV -1 x (nm) 1 E=.178eV E=.176eV E=-3.9meV E=-32.4meV -1 x (nm) 1-1 x (nm) 1 Figure 1 : Electron and hole wave functions with disorder.

展开

Microsoft Word - TIP006SCH Uni-edit Writing Tip - Presentperfecttenseandpasttenseinyourintroduction readytopublish

Microsoft Word - TIP006SCH Uni-edit Writing Tip - Presentperfecttenseandpasttenseinyourintroduction readytopublish 我难度 : 高级对们现不在知仍道有听影过响多少那次么 : 研英究过文论去写文时作的表技引示巧言事 : 部情引分发言该生使在中用过去, 而现在完成时仅表示事情发生在过去, 并的哪现种在时完态成呢时? 和难过道去不时相关? 是所有