1 VASP wszhang@ustc.edu.cn April 8, 2018 Contents 1 2 2 2 3 2 4 2 4.1........................................................ 2 4.2..................................................... 3 5 4 5.1.......................................................... 4 5.2.................................................... 5 6 6 6.1..................................................... 6 6.2........................................................ 6 7 7 7.1 E5V4.................................................... 7 7.2 NCORE.................................... 8 7.3 E5V3.................................................... 9 7.4 E3V5.................................................... 10 7.5 Fat144.................................................... 11 7.6 KNL ( )............................................. 12 7.7.............................................. 14
2 1 VASP VASP VASP VASP 2 VASP Geun Ho Gu, University of Delaware For the NPAR, I recommend doing a test to find out the most efficient number. e.g. run a same calculations multiple times with different NPAR. Also, do the same for LPLANE parameter as well. The manual instructs to use the number of node as NPAR as each parallel calculation can be run at each node minimizing communication overhead between each node. If not optimized, VASP takes extra time to comminucate between nodes, eating up your computation time. However, I have found that this instruction does not always hold up, and, really, this parameter is heavily dependent on the batch server/ node configuration. So, it is wise to do your own test to optimize this parameter (and other parameters as well). VASP 50% [1] VASP VASP VASP 3 ZrNCl KPOINTs Gamma 1. 18 Atoms (Zr6N6Cl6) 272 irreducible k-points 2. 221 71 Atoms (Zr24N24Cl23) 36 irreducible k-points 3. 221 71 Atoms (Zr24N24Cl23) Gamma point 4. 441 284 Atoms (Zr96N96Cl92) Gamma point 5. 661 630 Atoms (Zr96N96Cl92) Gamma point A18K272 A71K36 A71K1 A284K1 A630K1 K-points vasp_std Gamma kpoints VASP Gamma vasp_gam 4 4.1 :
3 SYSTEM = ZrNCl ISTART = 0 ISMEAR = 0 SIGMA = 0.4 ENCUT=400 PREC=Normal NELM = 5 NELMIN = 5 NELMDL = 0 ISYM = 1 EDIFF = 1E-7 LREAL = Auto LPLANE =.TRUE. KPAR = $KPAR # 1 2 4 8 16 default: 1 NCORE = $NCORE # 1 2 4 8 16 default: 1 #NPAR = $NPAR # 4 6 8 16 #NSIM = $NSIM # default:4 KPAR NPAR/NCORE KPAR KPOINT NCORE BAND CPU NPAR BAND NCORE NPAR 32 KPAR=4 NCORE=4, 4 KPOINTs KPOINT 8 KPOINT 4 CPU BAND 2 BAND KPOINT BAND VASP VASP NSIM ECUT 4.2 VASP TC4600 KPOINTs NPAR = 4 ~ approx SQRT( number of cores) For optimal performance we recommend to set NCORE = 4 - approx SQRT( number of cores) NCORE specifies how many cores store one orbital (NPAR=cpu/NCORE). This setting can greatly improve the performance of VASP for DFT. KPOINT NPAR = number of cores per compute node [2] 2 not recommend attempting run with KPAR>compute nodes, even though you may have more k-points than compute nodes. [3] E5V4 KPOINTs KPAR NCORE NPAR
4 [1] NPAR NCORE NCORE NPAR NCORE ( 4 E3V5 ) NCORE = 8 [4,16] VASP KPAR=1 & NCORE=1 E5V4-A18K272 E5V4-A71K36 24 64 256 128 10 630 E5V4-A630K1 256 384 354s 183s K NKpoints Natoms [3] KPOINT Natoms/2 KPOINTs 8-16 KPOINTs NKpoints KPOINTs NCORE NCORE BAND KPOINT BAND KPOINT BAND KPOINT KPAR KPOINT 1-2 BAND BAND 7.2 A18K272 KPAR 1 4 NCORE 8/7 130/124 40/28 7.2 A18K272 56 KPAR 1 NCORE 8 7 NCORE 7 NCORE 8 79% 72% 50s 254s 182s 5 5.1 Intel Xeon CPU ( E5V3/E5V4 E3V5 Fat144) Xeon Phi KNL ( All2All Cluster mode & Flat Memory Mode AF Mode Quadrant Cluster mode & Cache Memory Mode QC Mode) GPU GPU VASP & GPU [4] CPU (DDR4) E5V4 2*E5-2680 v4(2.4ghz-3.3ghz, 35MB L3 Cache) 28 128GB 2400MHz 240GB 100Gbps OPA E5V3 2*E5-2680 v3(2.5ghz-3.3ghz, 30MB L3 Cache) 24 64GB 2133MHz 300GB 56Gbps FDR E3V5 1*E3-1240 v5(3.5ghz-3.9ghz,8mb Cache) 4 32GB 2400MHz 500GB 100Gbps EDR Fat144 8*E7-8860 v4(2.2ghz-3.2ghz,45mb L3 Cache) 144 1TB 2400MHz 480GB 100Gbps OPA KNL-AF 1*Xeon Phi 7210(64 1.3GHz-1.5GHz 16 GB MCDRAM 96GB 2133MHz 160GB 100Gbps OPA AF Mode) KNL-QC 1*Xeon Phi 7210(64 1.3GHz-1.5GHz 16 GB MCDRAM QC Mode) 96GB 2133MHz 160GB 100Gbps OPA Table 1:
5 5.2 TC4600 E5V4 E5V3 10%-30% E5V4 E5V3 Cache/ OPA E5V4 A284K1 E5V4 E5V3 E3V5 E5V4 E3V5 E5V4 VASP 100Gbps E3V5 VASP E3V5 A18K272 128 E3V5 24s E5V4 31s KPAR x NCORE 1 x 16 E3V5 E5V4 E3V5 VASP A71K36 E3V5 96 E5V4 128 Gamma Only A284K1 128 E3V5 E5V4 36s/44s 128 96 0.86/0.77 E5V3 E5V4 Fat144 8 18 CPU 32~48 E5V4 VASP 7.5 64 Fat144 VASP E5V4 CPU CPU KNL 64 4 QC AF KNL 64 VASP AF 7.6 KNL3 A284K1 KNL25 A284K1 128 2 AF 64 VASP QC KNL E5V4 A18K272 120 90 E5V4 A284K1, 130 140 Intel 56 E5V4 KNL 3637: 1997 KNL KNL KNL KNL VASP KNL E5V4 KNL 7.7 7.1 A18K272 KNL 96 123s E5V4 256 22s A284K1 KNL 48 133s E5V4 96 52s KNL
6 6 6.1 Intel MKL Sequential/OpenMP ScaLAPACK Enable / Disable FFT implementation: Intel wrapper / Juergen Furtmueller (JF) DCACHE_SIZE 4000 / 0 V8: Intel MKL Sequential & Enable ScaLAPACK & Intel FFT & DCACHE_SIZE = 4000 3 V12 : JF FFT V14 : Disable ScaLAPACK V16 : DCACHE_SIZE = 0 V16 CACHE_SIZE=0 has a special meaning. It performs the FFT s in x and y direction plane by plane, abc cba z direction 6.2 V8 V12 Intel FFT JF FFT VASP 2/5 1/5 FFT VASP V8 V14 ScaLapack NP>24/28 V8 V16 VASP DCACHE_SIZE 0 V16 Z
7 7 7.1 E5V4 Figure 7.1: E5V4 5
8 7.2 NCORE Figure 7.2: E5V4 28 NCORE 3 A18K272 56 KPAR 1 NCORE 8 7
9 7.3 E5V3 Figure 7.3: E5V3
10 7.4 E3V5 Figure 7.4: E3V5
11 7.5 Fat144 Figure 7.5: Fat144
12 7.6 KNL ( ) Figure 7.6: KNL AF Mode KNL3 KNL3 128 KNL24 KNL2 KNL4 64
Figure 7.7: KNL QC Mode KNL6 KNL78 KNL678 KNL6 KNL7 KNL8 60 13
14 7.7 NELMDL Figure 7.8:
Figure 7.9: 多种编译选项与数学库测试结果 15
16 References [1] http://www.hector.ac.uk/support/documentation/software/vasp/ncore_and_npar_summary.pdf [2] https://cms.mpi.univie.ac.at/vasp/vasp/parallelisation_npar_ncore_lplane_kpar_tag.html [3] https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores/ [4] https://www.slideshare.net/jmskelton/vaspgpu-on-balena-usage-and-some-benchmarks