(science demonstration phase) 2 2 l = 30 b = 0 l = 59 b = 0 5 PACS 70 µm 160 µm SPIRE 250 µm 350 µm 500 µm Hi-GAL GPU 2 GPU 3 GPU GPU

31 4 Vol. 31, No. 4 2013 11 PROGRESS IN ASTRONOMY Nov., 2013 doi: 10.3969/j.issn.1000-8349.2013.04.05 GPU 1,2 1 ( 1. 100012 2. 100049 ) (GPU) GPU GPU GPU GPU CUDA PyOpenCL GPU GPU GPU N37 P141.91 A 1 (graphics processing unit GPU) GPU (NVIDIA) CPU GPU NVIDIA AMD CPU GPU GPU CPU NVIDIA Tesla C2075 448 1030 Gflops CPU GPU Hi-GAL [1] (Herschel infrared Galactic Plane Survey) SDP 2013-05-20 2013-06-26 (KJCX2-YW-T20)

470 31 (science demonstration phase) 2 2 l = 30 b = 0 l = 59 b = 0 5 PACS 70 µm 160 µm SPIRE 250 µm 350 µm 500 µm 10 5 10 6 Hi-GAL GPU 2 GPU 3 GPU GPU GPU 2 GPU GPU GPU NVIDIA CUDA (Compute Unified Device Architecture) GPU CUDA C C++ C GPU CUDA [2] CUDA NVIDIA GPU GPU OpenCL [3] (Open Computing Language) NVIDIA AMD IBM Intel OpenCL OpenCL OpenCL GPU CPU CUDA OpenCL C CUDA OpenCL [2, 3] (1) CPU ( host code) (2) GPU (kernel) CUDA CPU GPU (thread) (block) OpenCL (work item) (work group) (ID) GPU GPU GPU 3 GPU Hi-GAL GPU CUDA C

4 GPU 471 PyOpenCL (OpenCL Python ) GPU GPU 3.1 GPU 3.1.1 ( 0.01 µm) 50 µm [4] Hi-GAL CPU Hi-GAL l = 30 4 PACS 160 µm SPIRE 250 µm 350 µm 500 µm SPIRE 500 µm 313 917 GPU 3.1.2 GPU CLOUDY [5] GPU CUDA C 3 (1) CPU (2) CPU GPU CPU GPU CPU (3) GPU GPU CPU 1 /* GPU */ global static void pixels(float *gpudata,float *res) { long which; const int tid=threadidx.x;/* ID*/ const int bid=blockidx.x;/* ID*/

472 31 /* */ for(which=bid*threads_num+tid;which<line; which+=threads_num*block_num){ /* CPU */... } } /*CPU */ main() { /* */... /* GPU */... /* GPU */ cudamalloc((void**) &gpudata, sizeof(float)*width*line); cudamemcpy2d(gpudata, sizeof(float)*width, sed30, sizeof(float)*width,sizeof(float)*width, LINE, cudamemcpyhosttodevice);... /* GPU <<<...>>> GPU */ pixels<<<block_num,threads_num,parameters>>>(gpudata, res); /* GPU CPU */ cudamemcpy2d(result, sizeof(float)*width, res, sizeof(float)*width, sizeof(float)*width, LINE, cudamemcpydevicetohost);... } 1

4 GPU 473 3.1.3 GPU 87 500 NVIDIA GeForce 9400M IDL7.1 CPU (Xeon E5620 1 ) 34 h GPU (NVIDIA GeForce 9400M) 0.5 h 68 ( 1) NVIDIA GeForce 9400M GPU GPU CPU IDL7.1 CPU (Xeon E5620 1 ) 1000 45 s l = 30 827 h CUDA NVIDIA Tesla C2075 GPU 9 min 4 5513 ( 2) 1 CPU GPU (t/min) CPU(Xeon E5620 1 ) 2040 68 GPU(NVIDIA GeForce 9400M) 30 2 CPU GPU (t/min) CPU(Xeon E5620 1 ) 49620 5513 GPU(NVIDIA Tesla C2075) 9 1 2 GPU GPU 2 NVIDIA Tesla C2075 GPU l = 30 [6] 2 l = 30

474 31 3.2 GPU 3.2.1 Hi-GAL 10 pc 10 pc 10 pc 10 pc (spherical coordinates) (r, θ, φ) (d, π/2 b, l) d l b GRS [7] 100 pc 0.2 0.2 l = 30 4493 ( ) GPU CPU 100 pc 0.2 0.2 CPU (Xeon E5620 1 ) Python 2.7 6.8 min Hi-GAL 500 µm ( 0.01 ) CPU ( ) CPU GPU 3.2.2 GPU CPU CPU CPU GPU GPU 3 NVIDIA Tesla C2075 GPU 0.043 s (20 000 ) GPU 95% CPU 9535 ( 3) GPU GPU 10 5

4 GPU 475 3 3 CPU GPU (t/s) CPU(Xeon E5620 1 ) 410 GPU 0.043 9535 PyOpenCL GPU PyOpenCL Python Python OpenCL OpenCL GPU CUDA C PyOpenCL [8] (i, j, k) 0 i<max(i) 0 j<max( j) 0 k<max(k) MAX(i) MAX( j) MAX(k) GPU (500 pc, 0.4, 0.6 ) (5, 2, 3)

476 31 3.2.3 CPU GPU CPU GPU CPU GPU GPU (1) GPU CPU (2) GPU CPU GPU GPU GPU 3.3 GPU [3] (1) CPU GPU (2) (Local Memory) (Global Memory) GPU GPU NVIDIA Tesla C2075 GPU K20m GPU 0.1 s 4 4 K20m GPU C2075 GPU 40% GPU C2075 GPU 1024, 1024 64 GPU 2000 GPU 1000 GPU 1000 1000 GPU GPU GPU CPU GPU GPU GPU GPU GPU GPU

4 GPU 477 4 K20m C2075 (t/s) K20m C2075 K20m/C2075 20 000 0.027 0.043 63% 200 000 0.029 0.049 59% 4 GPU GPU GPU CPU GPU CPU GPU GPU GPU GPU Rainer Spurzem Peter Berczik Peter Schwekendiek NVIDIA Tesla K20m GPU [1] Molinari S, et al. A& A, 2010, 518: L100 [2] NVIDIA Corporation. NVIDIA CUDA C Programming Guide version 4.2, http://wwwteor.mi.infn.it/ vicini/cuda/doc/ CUDA_C_Programming_Guide.pdf, 2012 [3] Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing OpenCL Programming Guide, http://www. primeval-slayer.com/amd-accelerated-parallel-processing-opencl-programming-guide/read/41855/, 2012 [4] Draine B T. ARA&A, 2003, 41: 241 [5] Ferland G J, Korista K T, Verner D A, et al. PASP, 1998, 110: 761 [6] Zhu J L, Huang M. 2013, in preparation [7] Roman-Duval J, Jackson J M, Heyer M, et al. ApJ, 2009, 699: 1153 [8] Andreas K. PyOpenCL documentation, http://documen.tician.de/pyopencl/, 2013

478 31 Applying GPU Parallel Computing Technologies to Process Herschel Far Infrared Galactic Plane Survey Data ZHU Jia-li 1,2, HUANG Mao-hai 1 ( 1. National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China ) Abstract: The Hi-GAL (Herschel infrared Galactic Plane Survey) images provide data with extraordinary spatial coverage and resolution for studying the FIR emission in the Galactic Plane. Graphics processing unit (GPU) parallel computing technologies are well suitable for accelerating processing and mining of this massive data. We illustrate the application of GPU parallel computing technologies in two examples of Hi-GAL data processing. We spare the unnecessary physical details and focus on the method of using GPU in Herschel infrared data processing. In the first example, we demonstrate a simple and straightforward application of GPU parallel computing technologies by fitting the far-infrared spectral energy distribution of the dust continuum emission in the Hi-GAL l = 30 field. There are over 3 10 5 pixels in image of the l = 30 field. The fitting procedure for every pixel is performed in parallel by a GPU. Comparing the time-cost for fitting the entire image, the acceleration factor of the build-in GPU on a low performance laptop is 68, and a specialized GPU is 5513 times faster than a Xeon E5620 with one core. In the second example, we demonstrate a more sophisticated application of GPU parallel computing technologies. Based on the Hi-GAL l = 30 field data, the distribution of molecular clouds derived from GRS (Galactic Ring Survey) data, and the properties of H II regions, we construct a 3D model of the interstellar medium to calculate the absorption of dust grains associated with molecular clouds. The resolution of the 3D model is 100 pc 0.2 0.2 for a spherical grid. For this resolution, there are 4493 cells in total responsible for absorbing FUV photons. The absorption of these cells is calculated in parallel by a GPU. The resulted absorption is then compared with observations using Monte Carlo fitting method. In every iteration, CPU samples the free parameters and computes the goodness of fitting. The GPU part of the calculation is 95% of the total time. Comparing the time-cost for one iteration, NVIDIA C2075 GPU is 9535 times as fast as a Xeon E5620. Key words: GPU; parallel computing; Hi-GAL; data analysis