Parallel Scientific Computing by Computer Cluster Jiun-Hwa Lin Department of Electrical Engineering National Taiwan Ocean University
Outline Introduction Simple Cluster Setup Real Examples at NTOU Conclusions 2
What is Parallel Computing? Solve problems collaboratively and simultaneously by a bunch of processors. Processors are interconnected. While doing their own work, they need to talk to each other. from Dept. of Comput.Science & Information Management Providence Univ. 3
Why Parallel Computing? Solve larger and more complex problems Grand Challenge Problems 4
Large-scale First-Principles Simulations of Shocks in Deuterium http://www.llnl.gov/asci/ 5
ASCI White The simulation involved 1320 atoms and ran for several days on 2640 processors of ASCI White. 512 node SMP (16 CPUs/node) Peak speed 12+ TeraOP/s 6
High-Resolution Simulations of Global Climate (~300 km resolu.) (~75 km resolu.) (~50 km resolu.) 7
High-Resolution Simulations of Global Climate performing a series of global climate simulations using the NCAR CCM3 atmospheric model 8
More Applications from Dept. of Comput.Science & Information Management Providence Univ. 9
Why Parallel Computing? Researches that demand high-performance computing Nano-scale Electronics Computational Chemistry Aerospace Molecular Modeling Computational Electromagnetics Computational Acoustics Computational Fluid Dynamics Seismic Wave Propagation Plasma Physics And other 10
What is Computer Cluster? Poor men s supercomputer commodity-based cluster system designed as a cost-effective alternative to large supercomputers 11
PC Cluster M 2 COTS (Mass-Market Commodity-Off-The-Shelf) based systems hook up PCs or workstations from KAOS Univ. Kentucky 12
Beowulf-Class Systems Beowulf was the legendary sixth-century hero from a distant realm who freed the Danes of Heorot by destroying the oppressive monster. from Eric Fraser 13
Beowulf PC Clusters As a metaphor, Beowulf has been applied to a new strategy in high performance computing that exploits mass-market technologies to overcome the oppressive costs in time and money of supercomputing, thus freeing scientists, engineers, and others to devote themselves to their respective disciplines. from http://www.heorot.dk from Störtebeker Cluster Project 14
Beowulf PC Clusters Beowulf, both in myth and reality, challenges and conquers a dominant obstacles, in their respective domains, thus opening the way to future development. rank3 rank2 rank1 rank0 server hub or switch 15
Why Computer Cluster Low cost High performance Configurability Scalability High Availability from Störtebeker Cluster Project 16
How to Parallel Compute Processors PC, workstations, multi-cpu, SMP, DMP, Athlon 64 64-bit Itanlium 64-bit PowerPC G5 Dual CPU M.B. from ASUS 17
SMP Shared-Memory Multiprocessor System from Dept. of Comput.Science & Information Eng. Tunghai Univ. 18
DMP Distributed Memory Multiprocessor System from Dept. of Comput.Science & Information Eng. Tunghai Univ. 19
How to Parallel Compute Interconnecting networking, switch, Ethernet (10M bps) Fast Ethernet (100M bps) Gigabit Ethernet (1G bps) Myrinet (2G bps) 20
How to Parallel Compute Softwares O.S., languages, library, algorithms, compilers, from Dept. of Comput.Scienc e & Information Eng. Tunghai Univ.
How to Parallel Compute Logical view of cluster systems from Dept. of Comput.Science & Information Eng. Tunghai Univ. 22
Linux Free & Stable Software is like sex, it's better when it's free Linus Torvalds. Tux 23
Simple Cluster Set Up Recipe Hardware Configuration - Intel Pentium 3 500MHz - 512MB SDRAM - IDE Hard Disk - Fast Ethernet Interface Cards 100Mbps - Category 5 cables - Hub or Switch(100Mbps) - Monitor, Keyboard & Mouse PC Cluster in EMLAB at NTOU-EE 24
Simple Cluster Set Up Recipe Software Configuration - Operating System RedHat Linux 7.2 (kernel 2.4.7-10) - Compilers gcc-g++-2.96 - Parallel Interface MPICHP4-1.2.4 25
5.3 Directions 5.3.1 First Step - install Linux on each of the PCs - edit the file /etc/hosts on each of the 4 PCs 192.168.1.1 octopus0.ee.ntou.edu.tw octopus0 192.168.1.2 octopus1.ee.ntou.edu.tw octopus1 192.168.1.3 octopus2.ee.ntou.edu.tw octopus2 192.168.1.4 octopus3.ee.ntou.edu.tw octopus3 EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.3.2 Second Step - edit the file /etc/hosts.equiv on each of the 4 PCs octopus0 octopus1 octopus2 octopus3 - This is to configure the computers so that MPICH s P4 device may be used to execute a distributed parallel application. EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.3.3 Third Step - on the server node, make a directory /home/mpi_mirror. Configure the server to be an NFS server, and in /etc/exports add this line: /home/mirror ocotpus0(rw) octopus1(rw) ocotpus2(rw) octopus3(rw) EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.3.4 Forth Step - on the other(non-server ) nodes, make a directory /home/mirror. Add this line to /etc/fstab: octopus0:/home/mirror /home/mirror nfs rw.bg.soft 0 0 - This exports the directory /home/mirror from the server and mounts it on each of the clients for easy distribution of software between the nodes. -On the server node, install MPICH EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.3.5 Fifth Step - For each user that you create on the clusters, it is advised that you create a subdirectory owned by that user in the /home/mirror directory, such as /home/mirror/ryjou, where the user can put MPI programs and shared data files. EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.4 Installing Mpich-1.2.4 5.4.1 First Step - Downloading MPICH www.mcs.anl.gov/mpi/mpich/download.html & ftp.mcs.anl.gov in directory pub/mpi. Get mpich.tar.gz. - Unpack mpich.tar.gz % cd /tmp % tar zxovf mpich.tar.gz - If tar does not accept z option, use % cd /tmp % gunzip c mpich.tar.gz tar zxovf - EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.4.2 Second Step - Configuration directory: /usr/local/mpich-1.2.4/ %./configure prefix=/usr/local/mpich-1.2.4 & tee c.log -Making % make & tee make.log - Running examples % cd examples/basic % make cpi %../../bin/mpirun np 4 cpi EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.4.3 Third Step - Installing (root) % make install - Setting path edit /home/ryjou/.cshrc......... setenv PATH /usr/sbin:/sbin:${path}......... set path = ($path /usr/local/mpich-1.2.4/bin) EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.4.4 Forth Step - check the installation % source.cshrc % rehash % which mpirun /usr/local/mpich-1.2.4/bin/mpirun - setting machines used: edit /usr/local/mpich-1.2.4/share/machines.linux octopus0.ee.ntou.edu.tw octopus1.ee.ntou.edu.tw octopus2.ee.ntou.edu.tw octopus3.ee.ntou.edu.tw EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.5 Compiling, Linking, & running program At directory: /home/ryjou/pmlfma - Compiling % mpicc c mmtps.cpp - Linking % mpicc o mmtps mmtps.o - Compiling & Linking in a single command % mpicc o mmtps mmtps.cpp At directory: /home/mirror/ryjou/pmlfma - Running % mpirun np 4 mmtps mmtps have to be copied to this directory first EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.6 Some Mpi Statements - #include mpi.h 5.6.1 Basic - #include <mpi++.h > - int main(int argc, char *argv[]) - void MPI::Init(int& argc, char**& argv) - void MPI::Finalize() - MPI::Intracomm::Bcast(void* buffer, int count, const Datatype& datatype, int root) const - MPI::Intracomm::Reduce(const void* sendbuf, void* recvbuf, int count, const Datatype& datatype, const Op& op, int root) const EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
5.6.2 Some Statements Used for communication in NTOU PMLFMA - void Intracomm::Allgatherv(const void* sendbuf, int sendcount, const Datatype& sendtype, void* recvbuf, const int recvcounts[], const int displs[], const Datatype& recvtype) const) Gathers data from all tasks and deliver it to all - void Intracomm::Barrier() const Blocks until all process have reached this routine - Request Comm::Irecv(void* buf, int count, const Datatype& datatype, int source, int tag) const Begins a nonblocking receive - Request Comm::Isend(const void* buf, int count, const Datatype& datatype, int dest, int tag) const Starts a nonblocking send EM-LAB 國立臺灣海洋大學電機工程學系暨研究所 Department of Electrical Engineering National Taiwan Ocean University
Real Examples in EM Filed NTOU s PMLFMA EM wave inducted current Scattered EM wave conductor conductor 38
Multilevel Fast Multipole Method (MLFMA) Enclose the object in a cube. Each subcube is recursively divided into smaller subcubes until the subcube length is 0.5. Divide the cube into 8subcubes. Retain the nonempty cubes in the whole oct-tree structure.
Triangular Patches Modeling Objects Unknowns=120182 Incident frequency=0.9ghz
Current Distribution
RCS
Real Examples in EM Filed CPU time for EM-LAB PC Cluster on Linux CPU time(s) Unknowns 60,165 120,182 Pre-Iteration(sec) 4,249 52,486 Iteration(sec) 30,762 12,686 Each Iteration(sec) 67.6 111.2 Total(sec) 35,011 65,172 43
Real Examples in EM Filed Memory Requirement for EM-LAB PC Cluster on Linux Estimated Unknowns 60,165 120,182 243,706 Required Memory(KB)/node 91,960 388,884 1,000,000 44
Real Examples in EM Filed NTOU s PFDTD 三維 FDTD 空間的配置圖包含 PML 各台 PC 所負責處理的區塊 45
Real Examples in EM Filed 邊緣上的資料分送方法 (1) 邊緣上的資料分送方法 (2) 46
Conclusions What are required of users MPI Parallel algorithms Knowledge in parallel computing Cluster computing system are rapidly becoming the standard platforms for high-performance computing. Message-passing programming is the most obvious approach to take advantage of clustering performance. New trends in hardware and software technologies are likely to make clusters more promising. 47
GPGPU Generous-Purpose Graphics Processing Unit 現代的顯示晶片已經具有高度的可程式化能力, 由於顯示晶片通常具有相當高的記憶體頻寬, 以及大量的執行單元, 因此開始有利用顯示晶片來幫助進行一些計算工作的想法, 即 GPGPU CUDA (Compute Unified Device Architecture) 即是 NVIDIA 的 GPGPU 模型 NVIDIA 的新一代顯示晶片, 包括 GeForce 8 系列及更新的顯示晶片都支援 CUDA NVIDIA 免費提供 CUDA 的開發工具 ( 包括 Windows 版本和 Linux 版本 ) 程式範例 文件等等, 可以在 CUDA Zone 下載 48
GPGPU 的優缺點 使用顯示晶片來進行運算工作, 和使用 CPU 相比, 主要有幾個好處 : 顯示晶片通常具有更大的記憶體頻寬 例如,NVIDIA 的 GeForce 8800GTX 具有超過 50GB/s 的記憶體頻寬, 而目前高階 CPU 的記憶體頻寬則在 10GB/s 左右 顯示晶片具有更大量的執行單元 例如 GeForce 8800GTX 具有 128 個 "stream processors", 時脈為 1.35GHz CPU 時脈通常較高, 但是執行單元的數目則要少得多 和高階 CPU 相比, 顯示卡的價格較為低廉 例如目前一張 GeForce 8800GT 包括 512MB 記憶體的價格, 和一顆 2.4GHz 四核心 CPU 的價格相若 49
GPGPU 的優缺點 使用顯示晶片也有它的一些缺點 : 顯示晶片的運算單元數量很多, 因此對於不能高度平行化的工作, 所能帶來的幫助就不大 顯示晶片目前通常只支援 32 bits 浮點數, 且多半不能完全支援 IEEE 754 規格, 有些運算的精確度可能較低 目前許多顯示晶片並沒有分開的整數運算單元, 因此整數運算的效率較差 顯示晶片通常不具有分支預測等複雜的流程控制單元, 因此對於具有高度分支的程式, 效率會比較差 目前 GPGPU 的程式模型仍不成熟, 也還沒有公認的標準 例如 NVIDIA 和 AMD/ATI 就有各自不同的程式模型 50
GPGPU 的優缺點 整體來說, 顯示晶片的性質類似 stream processor, 適合一次進行大量相同的工作 CPU 則比較有彈性, 能同時進行變化較多的工作 51
CUDA 架構 CUDA 是 NVIDIA 的 GPGPU 模型, 它使用 C 語言為基礎, 可以直接以大多數人熟悉的 C 語言, 寫出在顯示晶片上執行的程式, 而不需要去學習特定的顯示晶片的指令或是特殊的結構 在 CUDA 的架構下, 一個程式分為兩個部份 :host 端和 device 端 Host 端是指在 CPU 上執行的部份, 而 device 端則是在顯示晶片上執行的部份 Device 端的程式又稱為 "kernel" 通常 host 端程式會將資料準備好後, 複製到顯示卡的記憶體中, 再由顯示晶片執行 device 端程式, 完成後再由 host 端程式將結果從顯示卡的記憶體中取回 52
CUDA 架構 53