Applied Mathematics and Mechanics Vol. 34 No. 9 Sep ISSN GPU Boltzmann *? GPU Boltz

34 9 2013 9 15 Appled Mathematcs and Mechancs Vol 34 No 9 Sep 15 2013 1000-0887 2013 09-0956-09 ISSN 1000-0887 GPU Boltzmann *? 710049 GPU Boltzmann lattce Boltzmann method LBM GPU graphc processng unt sngle-nstructon multple-thread SIMT LBM LBM GPU DNS drect numercal smulaton 8 GPU 6 7 10 7 Δ + = 1 41 3 10 6 24 h Moser Boltzmann Boltzmann GPU DNS TB126 O351 DOI 10 3879 /j ssn 1000-0887 2013 09 009 A Boltzmann LBM Naver-Stokes 3 1 Boltzmann 2 Boltzmann 3-4 LBM Moser DNS = 180 DNS 5-6 Re τ * 2013-05-30 2013-06-05 11242010 11102150 1980 E-mal dngxu@ mal xjtu edu cn 1977 E-mal wangxan@ mal xjtu edu cn 956

957 Moser 5 DNS 128 3 Δ + = 4 4 Kolmogorov KMM DNS CPU DNS DNS Re DNS DNS Re GPU CUDA OpenCL GPU GPU GPGPU general-purpose graphcs processng unt GPGPU 2003 Nvda CUDA 7 GPU GPU Tesla Kepler GPU 2 688 3 95TFlops C Tesla K20 2 5 Naver-Stokes MAC marker and cell GPU 30 ~ 40 Boltzmann LBM 100 8-11 2006 GPU GPU 10 12 ~ 10 15 12-15 GPU GPU GPU LBM GPU Re τ = 180 LBM GPU 1 Boltzmann Boltzmann Boltzmann-BGK Boltzmann f t + e!f = - 1 λ f - f eq 1 f f eq λ e = 0 1 2 N - 1 N N N N = 9 D2Q9 D3Q13 D3Q15 D3Q19 D3Q27 N = 13 15 19 27 f eq [ ] = ρω 1 + 3e u + 9 2 e u 2-3 2 u u 2

958 GPU Boltzmann ω D3Q19 1 ω 0 = 1 /3 ω 1 ~ ω 6 = 1 /18 ω 7 ~ ω 18 = 1 /36 ρ u 1 Fg 1 D3Q19-LBM D3Q19-LBM model { f f eq = 0 = 0 ρu = N f e = N f eq e = 0 = 0 ρ = N = N 3 LBM c s = 1 / 槡 3 p = ρc 2 s = ρ /3 4 1 x t f x + e Δt t + Δt - f x t = - 1 τ f x t - f eq x t 5 LBGK τ = λ /Δt ν = 1 ( τ - 2 ) Δt 6 5 Δt = 1 5 f - x t = f x t - 1 τ f x t - f eq x t 7 f x + e t + 1 = f - x t f f - f eq f - = f eq + f neq = f eq + f neq _neghbor u ρ 2 f neq f neq _neghbor f - _neghbor f eq _neghbor = f eq + f - _neghbor - f eq _neghbor 7 2 8 9 LBE 16 f eq 2 ρu eq = ρu + τg 10 G u eq u 2 f eq 2 NS Boltzmann GPU NSE LBM GPU 1 024 512 GPU GeForce GTX280 sngle CPU Intel Xeon E5420 2 5 GHz NSE Red-Black 1 2 1 GPU 13 7 CPU Posson GPU 82% CPU 57% NS GPU Posson GPU advecton-dffuson GPU 8 8% CPU 36%

959 GPU 2 LBM GPU 87 4 7 124 2 8 62 9 2 4 28 + 1 87 4 28 s 1 87 s LBM GPU LBM-GPU Table 1 1 024 512 1 NSE Elapsed tme performance and speed-up of CPU & GPU on smulatng 2D flow around a cylnder by NSE elapsed tme t /ms CPU performance P GFlops elapsed tme t /ms GPU performance P GFlops speed-up overall 282 12 1 24 20 65 16 89 13 7 advecton-dffuson 102 43 36% 1 15 1 83 8 8% 64 23 56 0 dvergence U 3 97 0 79 0 18 17 14 21 7 Posson 160 41 57% 1 32 16 94 82% 12 49 9 5 gradent p 6 48 0 65 0 26 16 03 24 8 S Table 2 2 LBE Elapsed tme performance and speed-up of CPU & GPU on smulatng 2D flow around a cylnder by LBE 1 024 512 elapsed tme t /s CPU performance P GFlops elapsed tme t /s GPU performance P GFlops speed-up overall 1 345 7 0 57 15 40 50 34 87 4 collson ncludng BC 763 6 0 88 6 15 4 28 + 1 87 109 10 124 2 streamng ncludng macro value computaton 582 1 0 18 9 25 11 30 62 9 3 2 x y z L x L z δ Reynolds Reynolds Re c = δu c /ν u c = ν /u τ u τ = 槡 Gδ / ρ t τ = l τ /u τ Reynolds Re τ δ /l τ = δ + Re c Re τ l τ = δu τ /ν Re τ = + Jmenez 17 Reynolds a x = L x /δ a z = L z /δ 17 Spasov 18 LBM Re τ = 180 a x = 4 a z = 1 Moser 5 Reynolds a x = 8 a z = 2 DNS Δ Kolmogorov η η + = 1 5 19 η + LBM S L y

960 GPU Boltzmann Δ + < η + = 1 5 LBM Δ = 1 Δ + = Δ /l τ = Re τ /δ < 1 5 δ > Re τ /1 5 Re τ = 180 δ > 120 Re τ = 180 L x L y L z 1 024 256 256 Δ + = 1 41 u τ u c /u τ = 1 /κ ln Re τ + b κ von K rm n κ 0 4 b 6 u c = 0 1 u = u - + u ' v = v - + v ' w = w - + w ' ρ = ρ 0 0 1 v - = w - = 0 u - u + = y + y + y w { ln y + /κ + b y + > y w u + u + = u - /u τ y w = 11 6 = 1 u ' = v ' = w ' = u c r rand - 0 5 r rand u c 11 9 z 16 5t f = a x δ /u τ 5t f 2 10 6 t f 3 10 6 LBM 3 2 3 Fg 2 Calculaton model for flow between plates Fg 3 Profle of average velocty wth tme at the central plane 4 LBM GPU Boltzmann LBM GPU 130 LBM D3Q19 3 G GPU 10 7 6 7 10 7 7 GPU GPU GPU CPU LBM GPU GPU 11 8 NVIDIA Tesla M2050 GPU MPI message passng nterface CUDA CudaMemcpy GPU GPU /CPU 11 GPU 10 x y z 1 /4 y x-z y GPU 1 024 32 256 4 D3Q19 11

961 CPU LBM 7 8 GPU-LBM 8 30% Fg 4 4 One-dmensonal doman decomposton 5 DNS 3 10 6 LBM 8 GPU 24 h 2 33 10 9 18 36 CPU 5 60 10 6 18 416 5 6 x-z u y DNS Moser DNS 5 6 = 4 a z = 1 a x = 8 a z = 1 Moser a x = 4 a z = 2 z a x 5 6 y Fg 5 Iso-surface of second nvarant Fg 6 Profle of average velocty n of velocty gradent tensor the normal y-drecton 7 Reynolds R uv y 7 a u rms y 7 b ~ d Moser DNS 5 7 a x = 4 a z = 1 a x = 8 a z = 1 a x = 4 a z = 2 Moser

962 GPU Boltzmann a R uv y = u ' v ' /u 2 τ b u rms y = u ' /u τ Fg 7 c v rms y = v ' /u τ 7 d w rms y = w ' /u τ Reynolds R uv y u rms y Profles of Reynolds stress R uv y and rms of velocty fluctuatons u rms y n the normal y-drecton 6 Boltzmann 1 DNS Moser DNS Boltzmann 2 GPU Boltzmann DNS 24 h 0 67 300 LBM 2 330MLUPS 2 10 9 LBM 3 NSE LBE GPU LBM GPU CFD DNS References 1 Chen S Y Doolen G D Lattce Boltzmann method for flud flow s J Annual Revew of Flud Mechancs 1998 30 329-364

963 2 Boltzmann M 2009 HE Ya-lng WANG Yong LI Qng Lattce Boltz m ann Method Theory and Applcatons M Bejng Scence Press 2009 n Chnese 3 Yu H Grmaj S S Luo L S DNS and LES of decayng sotropc turbulence w th and w thout frame rotaton usng lattce Boltzmann method J Journal of Com putatonal Physcs 2005 209 2 599-616 4 Yu H Luo L S Grmaj S S LES of turbulent square jet flow usng an MRT lattce Boltzmann model J Com puters & Fluds 2006 35 8 /9 957-965 5 Moser R D Km J Mansour N N Drect numercal smulaton of turbulent channel flow up to Re τ = 590 J Phys Fluds 1999 11 4 943-945 6 Km J Mon P Moser R D Turbulence statstcs n fully developed channel flow at low Reynolds number J J Flud Mech 1987 177 133-166 7 Nvda NVIDIA CUDA Programmng Gude K Verson 2 0 2008 8 Ogaw a S Aok T GPU computng for 2-dmensonal ncompressble-flow smulaton based on mult-grd method C / /Transactons of JSCES Paper No 20090021 2009 9 Harada T Smoothed partcle hydrodynamcs on GPUs C / /Proceedng of the Sprng Conference on Com puter Graphcs 2007 235-241 10 Rossnell D Bergdorf M Cottet G-H Koumoutsakosa P GPU accelerated smulatons of bluff body flow s usng vortex partcle methods J Journal of Com putatonal Physcs 2010 229 9 3316-3333 11 Wang X Aok T Mult-GPU performance of ncompressble flow computaton by lattce Boltzmann method on GPU cluster J Parallel Com putng 2011 37 9 521-535 12 Shmokaw abe T Aok T Takak T Endo T Yamanaka A Maruyama N Nukada A Matsuoka S Peta-scale phase-feld smulaton for dendrtc soldfcaton on the TSUBAME 2 0 supercomputer C / /Proceedngs of 2011 Internatonal Conference for Hgh Perform ance Com - putng Netw orkng Storage and Analyss New York USA 2011 13 Shmokaw abe T Aok T Ishda J Kaw ano K Muro C 145 TFlops performance on 3990 GPUs of TSUBAME 2 0 supercomputer for an operatonal w eather predcton C / /Proceedngs of the Internatonal Conference on Com putatonal Scence ICCS 2011 2011 4 1535-1544 14 Wang X Aok T Hgh performance computaton by mult-node GPU cluster-tsubame 2 0 on the ar flow n an urban cty usng lattce Boltzmann method J Internatonal Journal of Aerospace and Lghtw eght Structures 2012 2 1 77-86 15 Mk T Wang X Aok T Ima Y Ishkaw a T Takase K Yamaguch T Patent-specfc modelng of pulmonary ar flow usng GPU cluster for the applcaton n medcal practce J Com puter Methods n Bom echancs and Bom edcal Engneerng 2012 15 7 771-778 16 Lammers P Beronov K N Volkert R Brenner G Durst F Lattce BGK drect numercal smulaton of fully developed turbulence n ncompressble plane channel flow J Com puters & Fluds 2006 35 10 1137-1153 17 Jmenez J Mon P The mnmal flow unt n near-w all turbulence J Journal of Flud Mechancs 1991 225 1 213-240 18 Spasov M Rempfer D Mokhas P Smulaton of turbulent channel flow w th an entropc lattce Boltzmann method J Int J Num er Meth Fluds 2009 60 11 1240-1258 19 Pope S B Turbulent Flow s M Cambrdge Cambrdge Unversty Press 2000

964 GPU Boltzmann Drect Numercal Smulaton of the Wall-Bounded Turbulent Flow by Lattce Boltzmann Method Based on Mult-GPU XU Dng CHEN Gang WANG Xan LI Yue-mng State Key Laboratory for Strength and Vbraton of Mechancal Structures School of Aerospace X an Jaotong Unversty X an 710049 P R Chna Abstract The w all-bounded turbulent flow w as smulated drectly DNS by lattce Boltzmann method LBM through mult-gpu parallel computng The Data-parallel SIMT sngle-nstructon multple-thread characterstc of GPU matched the parallelsm of LBM w ell w hch led to hgh effcency of GPU on the LBM solver At the same tme t brought possblty for largescale DNS on the desk-top supercomputer In ths DNS w ork 8 GPUs w ere adopted The number of meshes of 6 7 10 7 w hch resulted n a non-dmensonal mesh sze of Δ + = 1 41 for the w hole soluton doman It took only 24 hours for the GPU-LBM solver to smulate 3 10 6 LBM steps As a result both the mean velocty and turbulent varables such as Reynolds stress and velocty fluctuatons agree w ell w th the results of Moser et al The capacty and valdty of LBM n smulatng turbulent flow are verfed Key words lattce Boltzmann method mult-gpu parallel computng w all-bounded turbulent flow DNS Foundaton tem The Natonal Natural Scence Foundaton of Chna 11242010 11102150