mpic_2002

Size: px

Start display at page:

Download "mpic_2002"

辙杵禄
6 years ago
Views:

1 C 語言 MPI 平行計算程式設計編著 : 鄭守成期間 : 民國 91 年 1 月 1 日電話 : (03) x c00tch00@nchc.gov.tw 1

2 C 語言 MPI 平行計算程式設計...1 第一章前言 MPI 平行計算軟體國家高速電腦中心的平行計算環境在 IBM SP2 上如何使用 MPI IBM SP2 的 MPI C 程式編譯指令 IBM SP2 的 Job command file IBM SP2 的平行程式的執行指令在 PC Cluster 上如何使用 MPI PC Cluster 上 C 語言的 MPI 程式編譯指令 PC Cluster 上的 Job command file PC Cluster 上的平行程式執行指令...13 第二章無邊界資料交換的平行程式 MPI 基本指令 mpi.h include file MPI_Init, MPI_Finalize MPI_Comm_size, MPI_Comm_rank MPI_Send, MPI_Recv 無邊界資料交換的循序程式 T2SEQ 資料不切割的平行程式 T2CP MPI_Scatter,MPI_Gather,MPI_Reduce MPI_Scatter,MPI_Gather MPI_Reduce, MPI_Allreduce 資料切割的平行程式 T2DCP...31 第三章需要邊界資料交換的平行程式 MPI_Sendrecv, MPI_Bcast MPI_ Sendrecv MPI_Bcast 邊界資料交換的循序程式 T3SEQ 資料不切割的邊界資料交換平行程式 T3CP 資料切割的邊界資料交換平行程式 ( 一 ) T3DCP_ 資料切割的邊界資料交換平行程式 ( 二 ) T3DCP_ 第四章格點數不能整除的平行程式格點數不能整除的循序程式 T4SEQ MPI_Scatterv MPI_Gatherv MPI_Pack MPI_Unpack MPI_ Barrier MPI_ Wtime MPI_Pack MPI_Unpack MPI_Barrier MPI_Wtime

3 4.4 資料切割的平行程式 T4DCP...66 第五章多維陣列的平行程式多維陣列的循序程式 T5SEQ 多維陣列資料不切割的平行程式 T5CP 多維陣列第一維資料切割的平行程式 T5DCP 定義二維切割的 MPI 函式垂直座標圖示法則 (Cartesian Topology) 界定二維切割的 MPI 函式 MPI_Cart_create...92 MPI_Cart_coords MPI_Cart_shift 定義固定間隔資料的 MPI 函式...95 MPI_Type_vector MPI_Type_commit 多維陣列首二維切割的平行程式 T5_2D...97 第六章 MPI 程式的效率提昇 Nonblocking 資料傳送資料傳送的合併以邊界資料計算取代邊界資料交換輸出入資料的安排事先切割輸入資料事後收集切割過的輸出資料第七章導出的資料類別導出的資料類別陣列的轉換兩方迴歸與管線法第八章多方依賴及 SOR 解法四方依賴及 SOR 解法黑白點間隔 SOR 解法斑馬線 SOR 解法八方依賴與四色點間隔 SOR 解法第九章有限元素法程式有限元素法的循序程式有限元素法的平行程式參考書目 Parallel Processing of 1-D Arrays without Partition Parallel Processing of 1-D Arrays with Partition Parallel on the 1st Dimension of 2-D Arrays without Partition Parallel on the 1st Dimension of 2-D Arrays with Partition Partition on the 1st dimension of 3-D Arrays

4 第一章前言本章將介紹 MPI 平行計算軟體國家高速電腦中心現有的平行計算環境以及在各型機器上如何使用 MPI 第一節簡單介紹 MPI 平行計算軟體第二節介紹國家高速電腦中心現有的平行計算環境第三節介紹如何在 IBM SP2 上使用 MPI, 包括路徑的設定平行程式的編譯及平行程式的執行等第四節介紹如何在 PC cluster 上使用 MPI, 包括路徑的設定平行程式的編譯及平行程式的執行等 4

5 1.1 MPI 平行計算軟體 MPI (Message Passing Interface) 是第一個標準化的 Message Passing 平行語言可以使用在 Fortran C C++ 等語言撰寫的程式上 MPI 平行程式可以在分散式記憶體平行系統上執行, 也可以在共用記憶體 cluster 平行系統上執行目前系統廠商所提供的 MPI 軟體是屬於 MPI1.2 版它提供了一百多個函式, 讓程式人員來選用 MPI 協會在 1998 年公布了 MPI 2.0 版的規格, 數年之後就會有 MPI 2.0 版的軟體可用了日前美國的 Argonne National Lab 已經公布了 MPICH 1.2 版的整套軟體, 該版含有 MPI 2.0 版的部份功能有興趣的讀者可以免費自網路下載該軟體, 其網址是也可以用 anonymous ftp 下載該軟體, 其網址是 ftp.mcs.anl.gov 其下目錄 (directory) pub/mpi 裏檔名 mpich tar.z 或還有許多與 MPI 相關的資訊可供參考 mpich tar.gz, 在該目錄之下 5

6 1.2 國家高速電腦中心的平行計算環境目前國家高速電腦中心的 IBM SP2 IBM SP2 SMP HP SPP2000 SGI Origin2000 和 Fujitsu VPP300 等系統上均有該公司自備的 MPI 平行軟體,PC cluster 上是裝用 MPICH 公用平行軟體, 也都有能力執行平行程式但是到目前為止, 只有 PC cluster IBM SP2 和 IBM SP2 SMP 設有一個 CPU 只執行一個程式的平行環境, 其他機器上則無此種設定例如, 若有一個用戶要用四個 CPU 來執行其平行程式, 他在 IBM SP2 上取得四個 CPU 之後, 這四個 CPU 就僅只執行這個平行程式直到它執行完畢為止, 不會有其他程式進來跟他搶 CPU 時間但是在其他機器 ( 如 HP SPP2000) 上取得四個 CPU 之後, 如果所有使用者對 CPU 的需求數量超過該系統的 CPU 總數時, 他所取得四個 CPU 之中的每一個 CPU, 都有可能要跟其他程式以分時方式 (time sharing) 共用一個 CPU HP SPP2000 和 SGI ORIGIN2000 為共用記憶體平行系統, 這種電腦系統是 16 顆 CPU 共用一組記憶體 SP2 和 VPP300 是屬於分散式記憶體平行系統, 每一個 CPU 備有它獨用的記憶體 SP2 SMP 是共用記憶體及分散式記憶體混合的平行系統, 每一個 node 備有 4 顆 CPU 共用一組記憶體, 目前備有 42 個 node 的 SMP luster SP2 和 SP2 SMP 是採用該系統專屬的工作排程軟體 (job scheduler) LoadLeveler 來安排用戶的批次工作 (batch job) 使用者必須備妥 LoadLeveler 的 job command file, 使用 llsubmit 指令把該批次工作交給 SP2 來執行 SPP2000 ORIGIN2000 和 VPP300 是採用 NQS (Network Queue System) 工作排程軟體來安排用戶的批次工作使用者必須備妥 NQS 的 job command file, 使用 qsub 指令把該批次工作交給各該系統來執行 PC cluster 是採用 DQS (Distributed Queue System) 工作排程軟體來安排用戶的批次工作, 其使用方式類似 NQS 6

7 1.3 在 IBM SP2 上如何使用 MPI 首先,C shell 用戶要在自己 home directory 的.cshrc 檔裏加入下列路徑, 這樣才能夠抓得到 include file (mpif.h mpif90.h mpi.h) 編譯指令 (mpxlf mpxlf90 mpcc mpcc) MPI library 和 LoadLeveler 指令 (llsubmit llq llstatus llcancel) set lpath=(. ~ /usr/lpp/ppe.poe/include /usr/lpp/ppe.poe/lib) set lpath=($lpath /usr/lpp/ppe.poe/bin /home/loadl/bin ) set path=($path $lpath) 加好上述路徑之後, 將.cshrc 存檔, 再執行 source.cshrc 指令, 即可進行平行程式的編譯與執行簽退 (logout) 後再簽到 (login) 之後就不必再執行 source.cshrc 指令 IBM SP2 的 MPI C 程式編譯指令使用 MPI 的 C 平行程式, 其編譯器 (compiler) 一般叫做 mpicc, 但是在 IBM SMP 上卻叫做 mpcc mpcc 常用的編譯選項如下 : SP2 和 SP2 mpcc -O3 -qarch=auto -qstrict -o file.x file.f 其中選項 -O3 是作最高級的最佳化 (level 3 Optimization), 可使程式的計算速度加快數倍 -qarch=auto 是通知編譯器該程式要在同型機器上執行 -qstrict 是通知編譯器不要改變計算的順序 -o file.x 是指定執行檔名為 file.x, 不指定時其內定 (default) 檔名為 a.out IBM SP2 的 Job command file 要在 IBM SP2(ivy) 上執行平行程式, 使用者必須備妥 LoadLeveler 的 job command file 例如, 下面這個 job command file 叫做 jobp4, 它要在四個 CPU 上執行平行程式 file.x 7

8 #!/bin/csh executable = /usr/bin/poe #@ arguments = /your_working_directory/file.x #@ output = outp4 #@ error = outp4 #@ job_type = parallel #@ class = medium #@ min_processors = 4 #@ max_processors = 4 #@ requirements = (Adapter == "hps_user") #@ wall_clock_limit = 20 #@ queue euilib us 其中 executable = /usr/bin/poe 是固定不變,poe 是指 Parallel Operating Environment arguments = 執行檔所在之全路徑及檔名 output = 標準輸出檔名 (stdout) error = 錯誤訊息 (error message) 輸出檔名 class = SP2 CPU 的分組別, 使用 llclass 指令可以看到分組別 : short (CPU 時間上限為 12 小時, 共有 10 顆 120MHz CPU) medium (CPU 時間上限為 24 小時, 共有 64 顆 160MHz CPU) long (CPU 時間上限為 96 小時, 共有 24 顆 120MHz CPU) min_processors = 最少的 CPU 數目 max_processors = 最多的 CPU 數目 requirements = (Adapter == "hps_user") 是固定不變 wall_clock_limit = 該 job 最多需要的時間, 單位為分鐘 queue 是固定不變平行計算可以使用的 CPU 數目,short class 最多 4 個 CPU,medium class 最多 32 個 CPU, long class 最多 8 個 CPU 由於 MPI 1.2 版不具備取得 CPU 控制 CPU 和歸還 CPU 的功能, 所以 min_processors 和 max_processors 要填相同的數字如果所需要的時間較短時, 加上 wall_clock_limit 可以較早排入執行的行列要在 IBM SP2 SMP (ivory) 上執行平行程式, 使用者必須備妥 LoadLeveler 的 job command file 例如, 下面這個 job command file 叫做 jobp4, 它要在四個 CPU 上執行平行程式 file.x 8

9 #!/bin/csh network.mpi= css0,shared,us executable = /usr/bin/poe #@ arguments = /your_working_directory/file.x #@ output = outp4 #@ error = outp4 #@ job_type = parallel #@ class = medium #@ tasks_per_node = 4 #@ node = 1 #@ wall_clock_limit = 20 #@ queue euilib us 由於 IBM SP2 SMP 每個 Node 含有四棵 375MHz CPU 共用 4GB 或 8GB 的記憶體 class = SP2 SMP CPU 的分組別, 使用 llclass 指令可以看到分組別 : short (CPU 時間上限為 12 小時, 3 個 Node 共有 6 顆 CPU) medium (CPU 時間上限為 24 小時,32 個 Node 共有 128 顆 CPU) bigmem (CPU 時間上限為 48 小時, 4 個 Node 共有 16 顆 CPU) 這個 class 一個 Node 備有 8GB 的共用記憶體 tasks_per_node=4 是說明一個 Node 選用四棵 CPU node=1 是說明要用一個 Node, 一共四棵 CPU 平行計算可以使用的 CPU 數目 medium class 是 16 個 Node 一共 64 顆 CPU, 其他 class 不設限 IBM SP2 的平行程式的執行指令要在 IBM SP2 及 SP2 SMP 上執行平行程式, 使用者在備妥 LoadLeveler 的 job command file 之後, 就可以使用 llsubmit 指令將該 job command file 交給該系統排隊等候執行例如上一節的 job command file 例子 jobp4 即可用下述指令交付執行 : llsubmit jobp4 工作交付之後, 該工作執行的情形可用 llq 指令查詢要縮小查詢的範圍可在 llq 指令之後加上 grep 指令敘明要查詢的 class 或 user id 例如上一個例子 jobp4 所選用的分組別為 medium, 就可用下述指令進行查詢 : llq grep medium 9

10 llq 顯示之內容有下列事項 : job_id user_id submitted status priority class running on ivy u43ycc00 8/13 11:24 R 50 medium ivy39 ivy u50pao00 8/13 20:12 R 50 short ivy35 其中 job_id user_id submitted status Priority Class Running on 是 LoadLeveler 給交付的工作編定的工作代號是使用者的 login name 是交付工作的時刻, 月 / 日時 : 分是工作執行的情形 R 表 Running I 表 Idle (=waiting in queue) ST 表 Start execution NQ 表 Not Queued, 還在隊伍之外是交付工作的優先次序, 不用更動它是 CPU 分組別是執行交付工作的第一個 CPU 代號工作交付執行之後, 如果要中止該工作的執行可用 llcancel 指令殺掉該工作 llcancel job_id 此處的 job_id 就是使用 llq 指令所顯示之使用者交付工作的工作代號執行過 llcancel 指令之後, 再使用 llq 指令就可以看出該工作已經消失不見了 10

11 1.4 在 PC Cluster 上如何使用 MPI 首先, 使用 MPICH 的 C shell 用戶要在自己 home directory 的.cshrc 檔裏加入下列路徑, 這樣才能夠抓得到 include file (mpif.h mpi.h) 編譯指令 (mpif77 mpicc mpicc) MPI library 和 DQS 指令不同的 PC Cluster 這些存放的路徑可能不同, 要向該系統的管理人詢問其路徑設定如下 : setenv PGI /usr/local/pgi set path = (. ~ /usr/local/pgi/linux86/bin $path) set path = ( /home/package/dqs/bin $path) set path = ( /home/package/mpich/bin $path) 其中第一行是 PGI 公司 (Portland Group Inc.) 軟體存放的路徑, 第二行是 PGI 公司 C C++ 編譯器 pgcc pgcc 存放的路徑, 第三行是 DQS 批次工作排程軟体存放的路徑, 第四行是 MPICH 編譯系統存放的路徑沒有購用 PGI 公司的軟體時前面兩行可以省略 PC Cluster 上 C 語言的 MPI 程式編譯指令 MPICH 的 C 語言平行程式編譯器叫做 mpicc, 其底層是使用 GNU 的 gcc 來編譯, 因此可以使用 gcc 的調適選項舉例如下 : mpicc -O3 -o file.x file.f 其中選項 -O3 是選用 gcc 最高層次的調適選項 -o file.x 是指定編譯產生的執行檔為 file.x 沒有指定時, 內定的執行檔為 a.out file.c 是 C 語言平行程式如果選用 PGI 公司的 MPI 平行程式編譯器 mpicc, 其底層是使用該公司的 pgcc 來編譯, 因此可以使用 pgcc 的調適選項其 makefile 舉例如下 : 11

12 OBJ = file.o EXE = file.x MPI = /home/package/mpich_pgi LIB = $(MPI)/lib/libmpich.a MPICC = $(MPI)/bin/mpicc OPT = -O2 -I$(MPI)/include $(EXE) : $(OBJ) $(MPICC) $(LFLAG) -o $(EXE) $(OBJ) $(LIB).f.o : $(MPICC) $(OPT) -c $< 備妥 makefile 之後, 只要下 make 指令就會開始程式的編譯工作 PC Cluster 上的 Job command file 如果該 PC cluster 是採用 DQS 排程軟體來安排批次工作時, 要在其上執行平行程式, 使用者必須備妥 DQS 的 job command file 例如, 下面這個 job command file 叫做 jobp4, 它要在四個 CPU 上執行平行程式 hubksp : #!/bin/csh #$ -l qty.eq.4,hpcs00 #$ -N HUP4 #$ -A user_id #$ -cwd #$ -j y cat $HOSTS_FILE > MPI_HOST mpirun -np 4 -machinefile MPI_HOST hubksp >& outp4 其中 #!/bin/csh 是說明這是個 C shell script #$ -l qty.eq.4,hpcs 是向 DQS 要求四個 CPU,qty 是數量 (quantity) HPCS 是單 CPU cluster 的 queue class 代號 #$ -N HUP4 是說明這個工作的名字 (Name) 叫做 HUP4 #$ -A user_id 是說明付費帳號 (Account) 就是使用者帳號 #$ -cwd 是說明要在現在這個路徑 (working directory) 上執行程式內定的路徑是 home directory #$ -j y 是說明錯誤訊息要輸出到標準輸出檔 $HOST_FILE 是 DQS 安排給這項工作的 node list -np 4 hubksp 是告訴 mpirun 要在四個 CPU 上執行平行程式 hubksp >& outp4 是要把標準輸出檔寫入 outp4 12

13 1.4.3 PC Cluster 上的平行程式執行指令要在 PC cluster 上執行平行程式, 使用者在備妥 DQS 的 job command file 之後, 就可以使用 qsub 指令將該 job command file 交給 PC cluster 排隊等候執行例如上一節的 job command file 例子 jobp4 即可用下述指令交付執行 : qsub jobp4 工作交付之後, 可以使用 qstat 指令 ( 不加參數 ) 查詢整個 cluster 交付工作執行的情形, 使用 qstat -f 指令查詢整個 cluster 各個 node 的狀況上述指令 qsub jobp4 之後使用 qstat 指令顯示的內容如下 : c00tch00 HUP4 hpcs :1 r RUNNING 02/26/99 10:51:23 c00tch00 HUP4 hpcs :1 r RUNNING 02/26/99 10:51:23 c00tch00 HUP4 hpcs :1 r RUNNING 02/26/99 10:51:23 c00tch00 HUP4 hpcs :1 r RUNNING 02/26/99 10:51: Pending Jobs c00tch00 RAD5 70 0:2 QUEUED 02/26/99 19:24:32 第一欄是 user_id, 第二欄是交付工作的名稱, 第三欄是 CPU 代號, 第四欄是 DQS 替交付的工作編定的工作編號 job_id(62), 第五欄 0:1 的 0 是交付工作的優先序號, 0:1 的 1 是該用戶交付的第一個工作, 第六欄的 r 和第七欄的 RUNNING 表示該工作正在執行中, 最後是該工作交付時的時刻, 月 / 日 / 年時 : 分 : 秒排隊等待執行的工作則出現在 Pending Jobs 之列, 對應 RUNNING 的欄位則為 QUEUED 工作交付執行之後, 如果要中止該工作的執行可用 qdel 指令殺掉該工作 qdel job_id 此處的 job_id 就是使用 qstat 指令所顯示之第四欄執行過 qdel 指令之後, 再使用 qstat 指令就可以看出該工作已經消失不見了 13

14 第二章無邊界資料交換的平行程式最簡單的平行程式就是無邊界資料交換的平行程式本章將利用一個很簡單的循序程式 (sequential program) 使用 MPI 指令加以平行化, 並比較其計算結果以資驗證 2.1 節介紹六個 MPI 基本指令 MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv 2.2 節介紹無邊界資料交換的循序程式 T2SEQ 2.3 節說明使用這六個 MPI 基本指令平行化循序程式 T2SEQ 而成為平行程式 T2CP 2.4 節介紹另外四個常用的 MPI 指令 MPI_Scatter MPI_Gather MPI_Reduce MPI_Allreduce 2.5 節是使用這些指令平行化循序程式 T2SEQ 而成為平行程式 T2DCP 14

15 2.1 MPI 基本指令 MPI 的基本指令有下列六個, 將於本節分段加以介紹 MPI_Init, MPI_Finalize, MPI_Comm_size, MPI_Comm_rank, MPI_Send, MPI_Recv mpi.h include file 使用 MPI 撰寫 C 語言平行程式時, 必須在主程式之前加上 include <mpi.h> 陳述 (statement) mpi.h 檔案裏含有編譯 MPI 平行程式所必須的 MPI 字彙與 MPI 常數 (constant) 例如 : #include <stdio.h> #include <stdlib.h> #include <mpi.h> main ( argc, argv) int argc; char **argv; { { MPI_Finalize(); return 0; startend(int myid, int nproc, int is1, int is2, int* istart, int* iend)... return 0; 讀者可以在 MPI 軟體所在之路徑裏查看 mpi.h 的內容不同廠商設定的 MPI 常數也許不盡相同, 但是所使用的 MPI 字彙則是完全一致 MPI_Init, MPI_Finalize 在叫用其他 MPI 函式之前必須先叫用 MPI_Init 函式, 來啟動該程式在多個 CPU 上的平行 15

16 計算工作在程式結束之前必須叫用 MPI_Finalize 函式, 以結束平行計算工作所以 MPI_Init 和 MPI_Finalize 在主程式裏只要叫用一次就夠了, 例如 : #include <stdio.h> #include <stdlib.h> #include <mpi.h> main ( argc, argv) int argc; char **argv; { MPI_Init(&argc, &argv);... MPI_Finalize(); return 0; MPI_Comm_size, MPI_Comm_rank 通常在叫用過 MPI_Init 之後, 就必須叫用 MPI_Comm_size 以得知參與平行計算的 CPU 個數 (nproc), 及叫用 MPI_Comm_rank 以得知我是第幾個 CPU (myid), 第幾個 CPU 是從 0 開始起算所以第一個 CPU 的 myid 值為零, 第二個 CPU 的 myid 值為 1, 第三個 CPU 的 myid 值為 2, 餘類推通常要在幾個 CPU 上作平行計算是在下執行命令時決定的, 而不是在程式裏事先設定當然, 使用者也可以在程式裏事先設定要在幾個 CPU 上作平行計算, 其意義只供程式人員做參考, 實際上使用幾個 CPU 作平行計算是根據 job command file 裏 min_processors 和 max_processors 的設定值, 或 -np 的設定值 MPI_Comm_size 和 MPI_Comm_rank 的叫用格式如下 : MPI_Comm_size (MPI_COMM_WORLD, MPI_Comm_rank (MPI_COMM_WORLD, &nproc); &myid); 引數 MPI_COMM_WORLD 是 MPI 內定的 (default) communicator, 參與該程式平行計算的全部 CPU 都是屬於同一個 communicator 屬於同一個 communicator 的各個 CPU 之間才可以傳送資料 MPI 1.2 版不具備 CPU 的取得與控制功能, 參與平行計算的 CPU 顆數從程式開始執行到程式結束都是固定不變的因此, 這兩個 MPI 副程式在一個程式裏只要叫用一次就可以了例如 : 16

17 #include <stdio.h> #include <stdlib.h> #include <mpi.h> int nproc, myid; main ( argc, argv) int argc; char **argv; { MPI_Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, MPI_Comm_rank (MPI_COMM_WORLD, MPI_Finalize(); return 0; &nproc); &myid); MPI_Send, MPI_Recv 參與平行計算的各個 CPU 之間的資料傳送方式有兩種, 一種叫做 ' 點對點通訊 ' (point to point communication), 另外一種叫做 ' 集體通訊 ' (collective communication) 此處先介紹 ' 點對點通訊 ' 類的 MPI_Send 和 MPI_Recv, 其他常用的 ' 點對點通訊 ' 及 ' 集體通訊 ' 指令容後再介紹一個 CPU 與另外一個 CPU 之間的資料傳送屬於 ' 點對點通訊 ', 送出資料的 CPU 要叫用 MPI_Send 來送資料, 而收受資料的 CPU 要叫用 MPI_Recv 來收資料一個 MPI_Send 必須要有一個對應的 MPI_Recv 與之配合, 才能完成一份資料的傳送工作 MPI_Send 的叫用格式如下 : MPI_Send ((void *)&data, icount, DATA_TYPE, idest, itag, MPI_COMM_WORLD); 引數 data 要送出去的資料起點, 可以是純量 (scalar) 或陣列 (array) 資料 icount 要送出去的資料數量, 當 icount 的值大於一時,data 必須是陣列 DATA_TYPE 是要送出去的資料類別,MPI 內定的資料類別如表 1.1 idest 是收受資料的 CPU id itag 要送出去的資料標籤 17

18 MPI data type C data type description MPI_CHAR signed char 1-byte character MPI_SHORT signed short iny 2-byte integer MPI_INT signed int 4-byte integer MPI_LONG signed long int 4-byte integer MPI_UNSIGNED_CHAR unsigned char 1-byte unsigned character MPI_UNSIGNED_SHORT unsigned short int 2-byte unsigned integer MPI_UNSIGNED unsigned int 4-byte unsigned integer MPI_UNSIGNED_LONG unsigned long int 4-byte unsigned integer MPI_FLOAT float 4-byte floating point MPI_DOUBLE double 8-byte floating point MPI_LONG_DOUBLE long double 8-byte floating point MPI_PACKED 表 1.1 C 語言常用的 MPI 基本資料類別 MPI_Recv 的叫用格式如下 : MPI_Recv ((void *)&data, icount, DATA_TYPE, isrc, itag, MPI_COMM_WORLD, istat); 引數 data icount DATA_TYPE isrc itag istat 是要收受的資料起點是要收受的資料數量是要收受的資料類別是送出資料的 CPU id 是要收受的資料標籤是執行 MPI_Recv 副程式之後的狀況 istat 為一整數陣列, 該陣列的長度為在 mpi.h 裏已經設定的常數 MPI_STATUS_SIZE, 寫法如下 MPI_Status istat[mpi_status_size]; 若該電腦系統的 mpi.h 裏沒有設定 MPI_STATUS_SIZE 常數時, 可用任意整數取代如下 MPI_Status istat[8]; 一個 CPU 同時要收受多個 CPU 送來的資料時, 若不依照特定的順序, 而是先到先收, 則其指令為 MPI_Recv( (void *)&buff, icount, DATA_TYPE, MPI_ANY_SOURCE, itag, 18

19 MPI_COMM_WORLD, istat); 若要判斷送出該資料的 CPU id 時就要用到 STATUS 變數如下 isrc= istat( MPI_SOURCE ); MPI 在傳送資料 (MPI_Send MPI_Recv) 時, 是以下列四項構成其 ' 信封 ' (envelope), 用以識別一件訊息 (message) 1. 送出資料的 CPU id 2. 收受資料的 CPU id 3. 資料標籤 4. communicator 所以一個 CPU 送給另外一個 CPU 多種資料時, 不同的資料要用不同的資料標籤, 以資識別 19

20 2.2 無邊界資料交換的循序程式 T2SEQ T2SEQ 是個無邊界資料交換的循序程式, 在 test data generation 段落裏設定陣列 b c d 的值, 然後把這些陣列寫到磁檔上其目的是便利往後的範例程式可以讀入同一組資料作平行計算, 用來驗證其計算的結果是否正確這個程式的計算部份只有一個 for loop, 而且該 loop 裏只有兩個計算陳述, 其目的是方便往後說明如何將這一類 for loop 平行化實際的計算程式也許有數百個或數千個 for loop, 但是其平行化的方法是一樣的 /* PROGRAM T2SEQ sequential version of 1-dimensional array operation */ #include <stdio.h> #include <stdlib.h> #define n 200 main () { double suma, a[n], b[n], c[n], d[n]; int i, j; FILE *fp; /* test data generation and write out to file 'input.dat' */ for (i = 0; i < n; i++) { j=i+1; b[i] = 3. / (double) j + 1.0; c[i] = 2. / (double) j + 1.0; d[i] = 1. / (double) j + 1.0; fp = fopen( "input.dat", "w"); fwrite( (void *)&b, sizeof(b), 1, fp ); fwrite( (void *)&c, sizeof(c), 1, fp ); fwrite( (void *)&d, sizeof(d), 1, fp ); fclose( fp ); /* read 'input.dat', compute and write out the result */ 20

21 fp = fopen( "input.dat", "r"); fread( (void *)&b, sizeof(b), 1, fp ); fread( (void *)&c, sizeof(c), 1, fp ); fread( (void *)&d, sizeof(d), 1, fp ); fclose( fp ); suma = 0.; for (i = 0; i < n; i++) { a[i] = b[i] + c[i] * d[i]; suma += a[i]; for (i = 0; i < n; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]); printf( "sum of A=%f\n",suma); return 0; 循序程式 T2SEQ 的測試結果如下 : sum of A=

22 2.3 資料不切割的平行程式 T2CP 平行程式的切割 (decomposition/partition) 方式有兩種一種是計算切割而資料不切割, 另外一種是計算和資料都切割前一種切割方式不能夠節省記憶體的使用量是其缺點, 但是陣列的描述與循序版 (sequential version) 完全相同, 程式容易閱讀也容易維護是其優點後一種切割方式能夠節省記憶體的使用量是其最大優點, 但是陣列的描述與循序版差異較大, 程式的閱讀與維護比較困難是其缺點如何將循序程式 T2SEQ 平行化呢? 這一節先介紹 ' 計算切割而資料不切割 ' 的方法,2.5 節再介紹 ' 計算及資料同時切割 ' 的方法假如 T2SEQ 程式要在四個 CPU 上平行計算而資料不切割時, 就把一維陣列 a b c d 均分為四段, 各個 CPU 負責計算其中的一段, 分工合作完成整個計算工作此處是利用一個 startend 函式來計算各個 CPU 負責計算段落的起迄 index 它是把第一段分給 CPU0, 第二段分給 CPU1, 第三段分給 CPU2, 餘類推如圖 2.1 所示 : computing partition without data partition cpu0 cpu1 cpu2 cpu3 istart iend ntotal istart iend ntotal istart iend ntotal istart iend array element inside territory array element outside territory 圖 2.1 計算切割而資料不切割的示意圖圖 2.1 裏符號代表轄區內的陣列元素, 符號代表轄區外的陣列元素, 各個 CPU 負責計算的範圍是從該 CPU 的 istart 到 iend 由於 MPI 1.2 版不具備平行輸出入 (Parallel I/O) 的功能, 所以輸入資料由 CPU0(myid 值為零 ) 讀入後, 利用一個 for loop 分段傳送 (MPI_Send) 給其他三個 CPU, 而其他三個 CPU 22

23 (myid 值大於零 ) 則接收由 CPU0 送來給該 CPU 的陣列片段請留意, 傳送不同的陣列片段要使用不同的資料標籤 (itag), 每一個 MPI_Send 一定有一個對應的 MPI_Recv 每一個 CPU 算完自己負責的段落後, 把計算的結果 ( 陣列 a 的一部份 ) 傳送給 CPU0,CPU0 利用一個 for loop 把其他三個 CPU 送來的陣列片段逐一接收下來然後由 CPU0 單獨計算整個 a 陣列各個元素的和 suma, 再把 a 陣列和 suma 列印出來 /* PROGRAM T2CP computation partition without data partition of 1-dimensional arrays */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define n 200 main ( argc, argv) int argc; char **argv; { double suma, a[n], b[n], c[n], d[n]; int i, j, k; FILE *fp; int nproc, myid, istart, iend, icount; int itag, isrc, idest, istart1, icount1; int gstart[16], gend[16], gcount[16]; MPI_Status istat[8]; MPI_Comm comm; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nproc); MPI_Comm_rank (MPI_COMM_WORLD, &myid); startend( nproc, 0, n - 1, gstart, gend, gcount); istart=gstart[myid]; iend=gend[myid]; comm=mpi_comm_world; printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend); /* READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */ if ( myid==0) { 23

24 fp = fopen( "input.dat", "r"); fread( (void *)&b, sizeof(b), 1, fp ); fread( (void *)&c, sizeof(c), 1, fp ); fread( (void *)&d, sizeof(d), 1, fp ); fclose( fp ); for (idest = 1; idest < nproc; idest++) { istart1=gstart[idest]; icount1=gcount[idest]; itag=10; MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=20; MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=30; MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm); else { icount=gcount[myid]; isrc=0; itag=10; MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=20; MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=30; MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); /* compute, collect computed result and write out the result */ for (i = istart; i <= iend; i++) { a[i] = b[i] + c[i] * d[i]; itag=110; if (myid > 0) { icount=gcount[myid]; idest=0; MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, comm); 24

25 else { for ( isrc=1; isrc < nproc; isrc++ ) { icount1=gcount[isrc]; istart1=gstart[isrc]; MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat); if (myid == 0) { for (i = 0; i < n; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]); suma=0.0; for (i = 0; i < n; i++) suma+=a[i]; printf( "sum of A=%f\n",suma); MPI_Finalize(); return 0; startend(,int nproc,int is1,int is2,int gstart[16],int gend[16], int gcount[16]) { int ilength, iblock, ir; ilength=is2-is1+1; iblock=ilength/nproc; ir=ilength-iblock*nproc; for ( i=0; i < nproc; i++ ) { if(i < ir) { gstart[i]=is1+i*(iblock+1); gend[i]=gstart[i]+iblock; else { gstart[i]=is1+i*iblock+ir; gend[i]=gstart[i]+iblock-1; if(ilength < 1) { gstart[i]=1; gend[i]=0; 25

26 gcount[i]=gend[i]-gstart[i] + 1; 資料不切割平行程式 T2CP 的測試結果如下 : ATTENTION: nodes allocated by LoadLeveler, continuing... NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= sum of A= 這一種平行程式的寫法叫做 SPMD (Single Program Multiple Data) 程式這個程式在多顆 CPU 上平行時, 每一顆 CPU 都是執行這一個程式, 有些地方是用 rank (myid) 來判斷要執行條件陳述 (if statement) 裏的那一個區段, 有些地方是用不同的 index 起迄位置來執行 for loop 的某一段落沒有使用 index 或 rank 來區分執行的段落部份, 則每一顆 CPU 都要執行所以 T2CP 程式裏, 每一顆 CPU 都會執行該程式一開始的下列陳述 : MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nproc); MPI_Comm_rank (MPI_COMM_WORLD, &myid); startend( nproc, 0, n - 1, gstart, gend, gcount); istart=gstart[myid]; iend=gend[myid]; printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend); 雖然執行的陳述完全一樣, 但是每一個 CPU 得到的 myid 值卻不一樣, 所得到的 istart 和 iend 也就隨之而異, 從所列印出來的結果可以得到印證這樣一個 SPMD 平行程式每一個 CPU 都有工作做 CPU0 會多做資料的讀和寫的工作但是一般計算程式資料輸出入所佔的時間比例很小, 所以各個 CPU 的負載相當均勻, 也沒有 master CPU 和 slave CPU 的區別 26

27 2.4 MPI_Scatter,MPI_Gather,MPI_Reduce MPI_Scatter MPI_Gather MPI_Allgather MPI_Reduce MPI_Allreduce 都是屬於 ' 集體通訊 ' 類函式這一類的資料傳輸, 凡是屬於同一個 communicator 的每一個 CPU 都要參與運作所以使用這一種指令時, 每一個 CPU 都必須叫用同一個函式 MPI_Scatter,MPI_Gather MPI_Scatter 是 iroot CPU 把一個陣列 t 等分為 nproc 段,(nproc= 參與平行計算的 CPU 數量 ), 每一段資料長度為 n, 依 CPU id 的順序分送給每一個 CPU ( 包括 iroot CPU 在內 ) 它是把第一段分給 CPU0, 第二段分給 CPU1, 第三段分給 CPU2, 餘類推如圖 2.2 所示 : CPU0 t t 1 t 2 t 3 t 4 CPU0 b t 1 CPU1 Scatter CPU1 b t 2 > CPU2 CPU2 b t 3 CPU3 CPU3 b t 4 圖 2.2 MPI_Scatter 示意圖 MPI_Scatter 的叫用格式如下 : iroot = 0 MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm); 請注意每一段資料必須等長其引數依序為 t 是待送出陣列的起點 n 是送給每一個 CPU 的資料數量 MPI_DOUBLE 是待送出資料的類別 b n MPI_DOUBLE iroot 是接收資料存放的起點, 如果 n 值大於一時,b 必須是個陣列是接收資料的數量是接收資料的類別是送出資料的 CPU id MPI_Gather 與 MPI_Scatter 的動作剛好相反, 是 idest CPU 收集每一個 CPU 送給它的陣列 a, 27

28 依 CPU id 的順序存入陣列 t 裏頭也就是從 CPU0 收到的 n 個陣列元素存入 t 陣列的第一段, 從 CPU1 收到的 n 個陣列元素存入 t 陣列的第二段, 從 CPU2 收到的 n 個陣列元素存入 t 陣列的第三段, 餘類推如圖 2.3 所示 : CPU0 t t 1 t 2 t 3 t 4 CPU0 a t 1 CPU1 Gather CPU1 a t 2 < CPU2 CPU2 a t 3 CPU3 CPU3 a t 4 圖 2.3 MPI_Gather 示意圖 MPI_Gather 的叫用格式如下 : idest = 0 MPI_Gather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm); 請注意每一段資料必須等長 MPI_Gather 的引數依序為 a n MPI_DOUBLE 是待送出的資料起點, 如果 n 值大於一時,a 必須是個陣列是待送出資料的數量是待送出資料的類別 t n MPI_DOUBLE idest 是接收資料存放的陣列起點是接收來自各個 CPU 的資料數量是接收資料的類別是收集資料的 CPU id MPI_Allgather 的叫用格式如下 : MPI_ Allgather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, comm); MPI_Allgather 與 MPI_Gather 的運作功能相似,MPI_Gather 是把運作的結果存入指定的一個 CPU, 而 MPI_ Allgather 則是把運作的結果存入每一個 CPU 28

29 CPU0 t t 1 t 2 t 3 t 4 CPU0 a t 1 CPU1 t t 1 t 2 t 3 t 4 Allgather CPU1 a t 2 CPU2 t t 1 t 2 t 3 t 4 CPU2 a t 3 CPU3 t t 1 t 2 t 3 t 4 CPU3 a t 4 圖 2.4 MPI_Allgather 示意圖 MPI_Reduce, MPI_Allreduce 另外一種集體資料傳輸功能叫作 ' 縮減運作 ' (reduction operation), 例如把各個 CPU 算出來的部份和 (partial sum) 加總, 或找出各個 CPU 上某一個變數的最大值或最小值 MPI_Reduce 運作的結果只存放在指定的 CPU (iroot) 裏,MPI_Allreduce 則是把運作的結果存放在每一個 CPU 裏 MPI_Reduce 的運作方式如圖 2.5 所示,MPI_Allreduce 的運作方式如圖 2.6 所示 CPU0 suma CPU0 sumall CPU1 suma Reduce CPU1 > CPU2 suma (MPI_SUM) CPU2 CPU3 suma CPU3 圖 2.5 MPI_Reduce 示意圖 CPU0 suma CPU0 sumall CPU1 suma Allreduce CPU1 sumall > CPU2 suma (MPI_SUM) CPU2 sumall CPU3 suma CPU3 sumall 圖 2.6 MPI_Allreduce 示意圖 29

30 MPI_Reduce 和 MPI_Allreduce 叫用格式如下 : iroot = 0; MPI_Reduce ((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM, iroot, comm); MPI_Allreduce((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM, comm); 引數 suma 是待運作 ( 累加 ) 的變數 sumall 是存放運作 ( 累加 ) 後的結果 ( 把各個 CPU 上的 suma 加總 ) count 是待運作 ( 累加 ) 的資料個數 MPI_DOUBLE 是 suma 和 sumall 的資料類別 MPI_SUM 是運作涵數, 可以選用的涵數如表 2.1 iroot 是存放運作結果的 CPU_id 此處 sumall 等於各個 CPU 上 suma 對應項目之和 MPI 指令 Operation C Data type MPI_SUM sum 累加 MPI_INT, MPI_ FLOAT, MPI_PROD product 乘積 MPI_DOUBLE, MPI_LONG_DOUBLE MPI_MAX maximum 最大值 MPI_MIN minimum 最小值 MPI_MAXLOC max value and location MPI_FLOAT_INT, MPI_DOUBLE_INT, MPI_MINLOC MPI_LAND MPI_LOR MPI_LXOR MPI_BAND MPI_BOR MPI_BXOR min value and location logical AND logical OR logical exclusive OR binary AND binary OR binary exclusive OR MPI_LONG_INT, MPI_2INT MPI_SHORT, MPI_LONG, MPI_INT, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG MPI_SHORT, MPI_LONG, MPI_INT, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG 表 2.1 MPI Reduction Function 其中 MPI_MAXLOC 和 MPI_MINLOC 所用到的資料類別是 C 語言的 structure 如下表 Data type MPI_FLOAT_INT MPI_DOUBLE_INT MPI_LONG_INT MPI_2INT Description (C structure) {MPI_FLOAT, MPI_INT {MPI_DOUBLE, MPI_INT {MPI_LONG, MPI_INT {MPI_INT, MPI_INT 30

31 2.5 資料切割的平行程式 T2DCP 程式 T2DCP 是一個計算及資料都切割的平行程式這個程式在 np 個 CPU 上平行計算時, 陣列 a b c d 的長度只需要原來長度 ntotal 的 np 分之一但是輸入資料 b c d 陣列的長度是 ntotal, 所以要再設定一個長度為 ntotal 的暫用陣列 t 來輪流存放讀入資料陣列 b c d 及存放輸出資料整個陣列 a CPU0 每讀入一個輸入陣列就利用 MPI_Scatter 分段分送給每一個 CPU iroot=0; MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm); 各個 CPU 把分配給它的陣列段落計算完畢之後, 就利用 MPI_Gather 把計算的結果 a 陣列片段送回給 CPU0 idest=0; MPI_Gather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, idest, comm); 比較 T2CP 和 T2DCP 兩個程式就可以看出利用 MPI_Scatter,MPI_Gather 的程式要簡潔得多但是使用這種指令時分送給每一個 CPU 的陣列長度必須相等, 而使用 MPI_Send MPI_Recv 指令時就沒有這種限制陣列的 dimension 若是常數, 則 CPU 的個數 np 陣列原來的長度 ntotal 切割後的陣列長度 n 等於 ntotal / np, 可以使用 define 陳述設定如下 : #define ntotal 200 #define np 4 #define n 50 此處 ntotal 必須能被 np 整除於是, 陣列的 dimension 由 ntotal 改為 n : double a[n], b[n], c[n], d[n], t[ntotal]; 現在, 每一個 CPU 執行的 for loop 範圍是從 0 到 n-1, 所以算出來的 suma 是 a 陣列 ntotal 個元素裏 np 分之一個陣列元素的和, 是為部份和 (partial sum) 所以此處叫用 MPI_Reduce 把各個 CPU 的部份和 suma 加總, 存放在 CPU0 的 sumall 裏 31

32 iroot=0; MPI_Reduce ((void *0&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, iroot, comm); 整個計算及資料都切割的平行程式 T2DCP 如下 : /* PROGRAM T2DCP */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define ntotal 200 #define np 4 #define n 50 main ( argc, argv) int argc; char **argv; { /* Data & Computational Partition Using MPI_Scatter, MPI_Gather value of n must be modified when run on other than 4 processors */ int i, j, k; FILE *fp; double a[n], b[n], c[n], d[n], t[ntotal], suma, sumall; int nproc, myid, istart, iend, iroot, idest; MPI_Comm comm; MPI_Status istat[8]; extern int mod; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nproc); MPI_Comm_rank (MPI_COMM_WORLD, &myid); Comm = MPI_COMM_WORLD; istart = 0; iend = n-1; 32

33 /* read input data and distribute input data */ if (nproc!= np) { printf( "nproc not equal to np= %d\t%d\t",nproc, np); printf(" program will stop"); MPI_Finalize(); return 0; if (myid == 0) { fp = fopen( "input.dat", "r"); fread( (void *)&t, sizeof(t), 1, fp ); iroot=0; MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_ DOUBLE, iroot, comm); if(myid == 0) { fread( (void *)&t, sizeof(t), 1, fp ); MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&c, n, MPI_ DOUBLE, iroot, comm); if(myid == 0) { fread( (void *)&t, sizeof(t), 1, fp ); MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&d, n, MPI_DOUBLE, iroot, comm); /* compute, gather computed data, and write out the result */ suma=0.0; /* for(i=0; i<ntotal; i++) { */ for(i=istart; i<=iend; i++) { a[i]=b[i]+c[i]*d[i]; suma=suma+a[i]; idest=0; MPI_Gather((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm); MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm); if(myid == 0) { for (i = 0; i < ntotal; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]); 33

34 printf( "sum of A=%f\n",sumall); MPI_Finalize(); return 0; 計算及資料都切割的平行程式 T2DCP 測試結果如下 : ATTENTION: nodes allocated by LoadLeveler, continuing sum of A=

35 第三章需要邊界資料交換的平行程式最常見的平行程式就是在計算的過程中需要邊界資料交換的平行程式這一章將利用一個需要邊界資料交換的循序程式使用 MPI 指令加以平行化, 並比較其計算結果以資驗證 3.1 節將介紹兩個 MPI 指令 MPI_Sendrecv MPI_Bcast MPI_ Sendrecv 是用來交換邊界資料, 而 MPI_Bcast 則是用來分送輸入資料 3.2 節介紹需要邊界資料交換的循序程式 T3SEQ 3.3 節說明使用 MPI_ Sendrecv MPI_Send MPI_Recv 指令平行化循序程式 T3SEQ 而成為平行程式 T3CP_1 然後說明使用 MPI_ Bcast 取代 T3CP_1 所使用的 MPI_Send 和 MPI_Recv 而成為平行程式 T3CP_2 T3CP_1 和 T3CP_2 都是計算切割而資料不切割的平行程式 3.4 節說明交換一個陣列元素的資料切割平行程式 T3DCP_1 3.5 節說明交換兩個陣列元素的資料切割平行程式 T3DCP_2 35

36 3.1 MPI_Sendrecv, MPI_Bcast MPI_Sendrecv 是屬於 ' 點對點通訊 ' 類函式, 而 MPI_Bcast 是屬於 ' 集體通訊 ' 類函式 MPI_ Sendrecv 甲 CPU 送出一些資料給乙 CPU, 又要接受丙 CPU 送來另外一些資料時, 可以叫用一個 MPI_ Sendrecv 副程式來做這兩件工作其作用等於一個 MPI_Send 加一個 MPI_Recv 下面這個指令是把轄區裏最後一個陣列元素送給右鄰 CPU, 同時自左鄰 CPU 接受轄區外的前一個陣列元素 itag = 110; MPI_ Sendrecv ((void *)&b[iend], icount, DATA_TYPE, r_nbr, itag, (void *)&b[istartm1], icount, DATA_TYPE, l_nbr, itag, comm, istat); 引數 b[iend] 是要送出去的資料起點 icount 是要送出去的資料數量 DATA_TYPE 是要送出去的資料類別 r_nbr 是要送出去的目的地 CPU id ( 右鄰 ) itag 是要送出去的資料標籤 b[istartm1] 是要接收的陣列起點 icount 是要接收的資料數量 DATA_TYPE 是要接收的資料類別 l_nbr 是要接收的來源地 CPU id ( 左鄰 ) itag 是要接收的資料標籤 istat 是叫用這個函式執行後的狀況 MPI_Bcast MPI_Bcast 是屬於 ' 集體通訊 ' 類函式,Bcast 是 Broadcast 的縮寫當你要把同一項資料傳送給屬於同一個 communicator 其他各個 CPU 時就可以叫用這一個函式既然是 ' 集體通訊 ' 類函式, 參與平行計算的每一個 CPU 都要叫用這一個函式, 不允許只有少數幾個 CPU 叫用它, 而其他 CPU 不叫用它 MPI_Bcast 的叫用格式如下 : iroot=0; 36

37 MPI_Bcast ( (void *)&b, icount, DATA_TYPE, iroot, comm); 引數 b icount DATA_TYPE iroot 是要送出去的資料起點, 簡單變數或陣列名稱是要送出去的資料數量是要送出去的資料類別是要送出資料的 CPU id MPI_Bcast 的運作方式如圖 3.1 所示 : CPU0 b b1 b2 b3b4 CPU0 b b1 b2 b3 b4 CPU1 MPI_Bcast CPU1 b b1 b2 b3 b4 CPU2 CPU2 b b1 b2 b3 b4 CPU3 CPU3 b b1 b2 b3 b4 圖 3.1 MPI_Bcast 示意圖 37

38 3.2 邊界資料交換的循序程式 T3SEQ T3SEQ 程式的 for loop 裏, 在算出等號左邊的 a[i] 時, 要用到等號右邊的 c[i] d[i] 和 b[i-1] b[i] b[i+1] 這在循序程式裏是奚鬆平常的事情, 但是在計算分割的平行程式裏卻是一件大事因為要用到該 CPU 轄區外 (outside territory) 的資料, 這時就牽涉到邊界資料交換 (boundary data exchange) 的問題另外還要找出 a 陣列各個元素的最大值 amax /* PROGRAM T3SEQ */ Boundary Data Exchange Program - Sequential Version #include <stdio.h> #include <stdlib.h> #define ntotal 200 main () { double amax, a[ntotal], b[ntotal], c[ntotal], d[ntotal]; int i, j; FILE *fp; extern double max(double, double); /* read 'input.dat', compute, and write out the result */ fp = fopen( "input.dat", "r"); fread( (void *)&b, sizeof(b), 1, fp ); fread( (void *)&c, sizeof(c), 1, fp ); fread( (void *)&d, sizeof(d), 1, fp ); fclose( fp ); amax = -1.0e12; for (i = 1; i < ntotal-1; i++) { a[i]=c[i]*d[i]+(b[i-1]+2.0*b[i]+b[i+1])*0.25; amax=max(amax,a[i]); for (i = 0; i < ntotal; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", 38

39 { a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]); printf( "MAXIMUM VALUE OF A ARRAY is=%f\n",amax); return 0; double max(double a, double b) if(a >= b) return a; else return b; 循序程式 T3SEQ 的測試結果如下 : MAXIMUM VALUE OF A ARRAY is=

40 3.3 資料不切割的邊界資料交換平行程式 T3CP 如何將循序程式 T3SEQ 平行化呢? 這一節先介紹計算切割而資料不切割的方法,3.4 節及 3.5 節再介紹計算及資料同時切割的方法程式 T3CP_1 也是利用函式 startend 來切割陣列, 算出各個 CPU 轄區的 index 起迄點, 它是把陣列的第一段分給 CPU0, 第二段分給 CPU1, 餘類推如圖 3.2 所示 : 左 left mpi_proc_null iend+1 iend1 iend ntotal cpu iend+1 istart2 iend1 istart iend ntotal cpu istart iend istart2 iend1 istart-1 cpu istart istart2 istart -1 mpi_proc_null 右 right is owned data is exchanged data 圖 3.2 資料不切割的邊界資料交換平行計算示意圖圖 3.2 裏符號代表轄區內的陣列元素, 符號. 代表轄區外的陣列元素, 符號代表從左鄰右舍傳送過來的陣列元素, 箭頭代表陣列元素傳送的方向各個 CPU 負責計算的範圍是從該 CPU 的 istart 到 iend 程式 T3SEQ 的計算部份如下 : amax=-1.e12; for (i=1; i<ntotal-1; i++) { a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1] )*0.25 amax = max(amax, a[i]) 40

41 index i 是從 1 到 <ntotal-1 在切割後, 只有 CPU0 是從 1 開始, 其他的 CPU 都是從 istart 開始, 所以必須設一個變數 istart1 來解決這個問題 : istart1=istart; if (myid == 0) istart1=1; 而 loop 的終點只有最後一個 CPU 是到 ntotal-2, 也就是 iend-1, 其他的 CPU 都是到 iend, 所以必須再設一個變數 iend1 來解決這個問題 : iend1= iend; if (myid == nproc-1) iend1= iend 1; 當 a[i] 的 i 等於 istart 時, 要用到 b[istart-1] 而當 a[i] 的 i 等於 iend 時, 要用到 b[iend+1], 所以還必須設一個變數 istartm1 (istart minus 1) 和 iendp1 (iend plus 1) 來解決這 b 變數的 index 問題 : istartm1=istart-1; iendp1=iend+1; 在需要域外資料 (i-1 或 i+1 等 ) 的 for loop 之前就要叫用 MPI_Sendrecv 來取得該項資料不過在這之前先要知道該 CPU 的左鄰右舍是那一個 CPU, 函式 startend 在切割陣列的 index 時是把第一段分給 CPU0, 第二段分給 CPU1, 第三段分給 CPU2, 餘類推所以一個 CPU 的左鄰 l_nbr 就是該 CPU 的 CPU id 減一, 而其右鄰 r_nbr 就是該 CPU 的 CPU id 加一只有第一個和最後一個 CPU 是例外, 第一個 CPU 沒有左鄰, 而最後一個 CPU 沒有右鄰, 這時的左鄰右舍就要給他一個特定的名子叫做 MPI_PROC_NULL 這個 MPI_PROC_RULL 是在 mpi.h 檔裏已經設定的常數 l_nbr = myid-1; r_nbr = myid+1; IF(myid == 0) IF(myid == NPROC-1) l_nbr = MPI_PROC_NULL; r_nbr = MPI_PROC_NULL; 現在來解決 b[i-1] 和 b[i+1] 的邊界資料交換問題, 這需要兩個 MPI_Sendrecv, 一個解決 b[i-1] 的資料交換, 另外一個解決 b[i+1] 的資料交換先來解決 b[i-1] 的邊界資料交換問題從圖 3.2 中的 CPU1 來看, 它要送 b[iend] 給右鄰 " 當作右鄰的 b[istartm1]", 又要自左鄰取得 " 左鄰的 b[iend]" 做為它自己的 b[istartm1] 如果要傳送的對象是 MPI_PROC_NULL 時, 是沒有傳送動作發生的每一個 CPU 都做同樣的動作, 就解決了 b[istartm1] 的邊界資料交換問題也就是 : 41

42 itag = 110; MPI_Sendrecv ((void *)&b[iend], 1, MPI_DOUBLE, r_nbr, itag, (void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag, comm, istat) 再來解決 b[i+1] 的邊界資料交換問題從圖 3.2 中的 CPU1 來看, 它要送 b[istart] 給左鄰 " 當作左鄰的 b[iendp1]", 又要自右鄰取得 " 右鄰的 b[istart]" 做為它自己的 b[iendp1] 如果要傳送的對象是 MPI_PROC_NULL 時, 是沒有傳送動作發生的每一個 CPU 都做同樣的動作, 就解決了 b[iendp1] 的邊界資料交換問題也就是 : itag = 120; MPI_Sendrecv ((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag, (void *)&b[iendp1], 1, MPI_DOUBLE, r_nbr, itag, comm, istat); 現在, 每一個 CPU 執行的 for loop 範圍是從 istart 到 iend, 所以算出來的 amax 是 a 陣列 ntotal 個元素裏 np 分之一個陣列元素的最大值, 只是部份資料的最大值所以此處叫用 MPI_Allreduce 找出各個 CPU 轄區裏的最大值 amax 之中的最大值 gmax (global maximum), 存放在每一個 CPU 裏 MPI_Allreduce ( (void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm ); 至於甚麼時候使用 reduce, 甚麼時候使用 allreduce, 視需要而定每一個 CPU 都要用到 reduce 的結果時就要用 MPI_Allreduce, 只有一個 CPU 需要用到 reduce 的結果時只要用 MPI_Reduce 就可以了, 因為 MPI_Allreduce 比較耗時間所以完整的邊界資料交換平行程式如下 : /* PROGRAM T3CP Boundary data exchange with computing partition without data partition Using MPI_Send, MPI_Recv to distribute input data */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define ntotal 200 main ( argc, argv) 42

43 int argc; char **argv; { double amax, gmax, a[ntotal], b[ntotal], c[ntotal], d[ntotal]; int i, j, k; FILE *fp; int nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp; int itag, isrc, idest, istart1,icount1, istart2, iend1, istartm1, iendp1; int gstart[16], gend[16], gcount[16]; MPI_Status istat[8]; MPI_Comm comm; extern double max(double, double); MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nproc); MPI_Comm_rank (MPI_COMM_WORLD, &myid); comm=mpi_comm_world; startend (nproc, 0, ntotal-1, gstart, gend, gcount); istart=gstart[myid]; iend=gend[myid]; icount=gcount[myid]; lastp=nproc-1; printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend); istartm1=istart-1; iendp1=iend+1; istart2=istart; if (myid == 0) istart2=istart+1; iend1=iend; if(myid == lastp ) iend1=iend-1; l_nbr = myid - 1; r_nbr = myid + 1; if (myid == 0) l_nbr=mpi_proc_null; if (myid == lastp) r_nbr=mpi_proc_null; 43

44 /* READ 'input.dat', and distribute input data */ if ( myid==0) { fp = fopen( "input.dat", "r"); fread( (void *)&b, sizeof(b), 1, fp ); fread( (void *)&c, sizeof(c), 1, fp ); fread( (void *)&d, sizeof(d), 1, fp ); fclose( fp ); for (idest = 1; idest < nproc; idest++) { istart1=gstart[idest]; icount1=gcount[idest]; itag=10; MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=20; MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=30; MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm); else { isrc=0; itag=10; MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=20; MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=30; MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); /* Exchange data outside the territory */ itag=110; MPI_Sendrecv((void *)&b[iend], 1, MPI_DOUBLE, r_nbr, itag, (void *)&b[istartm1],1, MPI_DOUBLE, l_nbr, itag, comm, istat); itag=120; MPI_Sendrecv((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag, (void *)&b[iendp1],1, MPI_DOUBLE, r_nbr, itag, comm, istat); 44

45 /* */ { Compute, gather and write out the computed result amax= -1.0e12; for (i=istart2; i<=iend1; i++) { a[i]=c[i]*d[i]+(b[i-1]+2.0*b[i]+b[i+1])*0.25; amax=max(amax,a[i]); itag=130; if (myid > 0) { idest=0; MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, icomm); else { for (isrc=1; isrc<nproc; isrc++) { istart1=gstart[isrc]; icount1=gcount[isrc]; MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat); MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm); amax=gmax; if( myid == 0) { for (i = 0; i < ntotal; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]); printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax); MPI_Finalize(); return 0; double max(double a, double b) if(a >= b) return a; else return b; 45

46 計算切割而資料不切割的邊界資料交換平行程式 T3CP_1 的測試結果如下 : ATTENTION: nodes allocated by LoadLeveler, continuing... NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= MAXIMUM VALUE OF ARRAY A is 在陣列不切割的平行程式裏,CPU0 讀入資料之後也可以使用 MPI_Bcast 把整個陣列傳送給各個 CPU 這樣做程式寫起來比較簡單, 但是卻傳送了許多額外的資料利敝得失端看兩種傳送方式所花費總時數的多寡而定, 通常一次傳送大量資料比分多次傳送小量資料來得省時如何取得平行程式執行時的時間資訊將在第四章裏介紹下列程式片段就是使用 MPI_Bcast 取代 T3CP 所使用的 MPI_Send 和 MPI_Recv 來分送輸入資料的結果 if ( myid==0) { fp = fopen( "input.dat", "r"); fread( (void *)&b, sizeof(b), 1, fp ); fread( (void *)&c, sizeof(c), 1, fp ); fread( (void *)&d, sizeof(d), 1, fp ); fclose( fp ); iroot=0; MPI_Bcast( (void *)&b, ntotal, MPI_DOUBLE, iroot, comm); MPI_Bcast( (void *)&c, ntotal, MPI_DOUBLE, iroot, comm); MPI_Bcast( (void *)&d, ntotal, MPI_DOUBLE, iroot, comm); 46

47 3.4 資料切割的邊界資料交換平行程式 ( 一 ) T3DCP_1 計算及資料同時切割時, 如果是在 np 個 CPU 上平行計算, 則陣列的長度 n 只需要原來長度 ntotal 的 np 分之一此時,ntotal 必須能被 NP 整除再加上前後各保留一個界外陣列元素的存放位置, 則所需的陣列長度為 n+2, 即其 dimension 為 [n+2], 其轄區內資料是存放在 index 1 到 n 於是陣列長度的設定為 : double a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal], amax,gmax; 如圖 3.3 所示 : left cpu0 cpu1 cpu2 cpu3 mpi_proc_null n + index n 1 n + index n 1 n + index n 1 n + index n 1 right is owned data is exchanged data mpi_proc_null 圖 3.3 資料切割且交換一個邊界資料的平行計算示意圖這樣一來, 每一個 CPU 的 for loop 起點為 1, 終點為 N, 於是 : istart=1; iend=n; 第一個 CPU 的 for loop 起點為 2, 其他 CPU 的 for loop 起點為 1, 最後一個 CPU 的 for loop 終點為 n-1, 其他 CPU 的 for loop 終點為 n, 也就是 : 47

48 istart2= istart ; if (myid == 0) istart2=2; iend1= iend; if(myid == nproc-1) iend1= iend 1; 每一個 CPU 要把轄區內最後一個陣列元素送給右邊的 CPU, 這個陣列元素的位置是 iend, 同時接受來自左邊 CPU 的陣列元素是放在 istart-1 所以 : istartm1 = istart 1; itag=110; MPI_Sendrecv ((void *)&b[iend], 1, MPI_DOUBLE, r_nbr, itag, (void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag, comm, istat); 每一個 CPU 也要把轄區內第一個陣列元素送給左邊的 CPU, 這個陣列元素的位置是 istart, 同時接受來自右邊 CPU 的一個陣列元素是放在 iend+1 所以 : iendp1 = iend+1; itag=120 MPI_Sendrecv ((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag, (void *)&b[iendp1], 1, MPI_DOUBLE, r_nbr, itag, comm, istat); ntotal 個元素的 b c d 陣列資料讀入陣列 t 之後, 在用 MPI_Scatter 分送給各個 CPU 時,b c d 陣列的 dimension 是從零開始, 而輸入資料是從 1 開始存放所以在 MPI_Scatter 裏必需敘明要從 b[1] c[1] d[1] 開始存放 MPI_Gather 時亦然 iroot=0; MPI_Scatter (t, n, MPI_DOUBLE, b[1], n, MPI_DOUBLE, iroot, comm) MPI_Gather (a[1], n, MPI_DOUBLE, t, n, MPI_DOUBLE, iroot, comm); 計算及資料都切割的邊界資料交換平行程式 T3DCP_1 如下 : /* PROGRAM T3DCP_1 Boundary data exchange with data & computing partition Using MPI_Gather, MPI_Scatter to gather & scatter data */ 48

49 #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define ntotal 200 #define n 50 #define np 4 main ( argc, argv) int argc; char **argv; { double amax, gmax, a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal]; int i, j, k; FILE *fp; int nproc, myid, istart, iend, istart2, iend1, istartm1, iendp1; int r_nbr,l_nbr, lastp, iroot, itag; MPI_Status istat[8]; MPI_Comm comm; extern double max(double, double); MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nproc); MPI_Comm_rank (MPI_COMM_WORLD, &myid); comm=mpi_comm_world; istart=1; iend=n; lastp=nproc-1; printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend); istartm1=istart-1; iendp1=iend+1; istart2=istart; if(myid == 0) istart2=2; iend1=iend; if(myid == lastp ) iend1=iend-1; 49

50 l_nbr = myid - 1; r_nbr = myid + 1; if(myid == 0) if(myid == lastp) l_nbr=mpi_proc_null; r_nbr=mpi_proc_null; /* READ 'input.dat', and distribute input data */ if( myid==0) { fp = fopen( "input.dat", "r"); fread( (void *)&t, sizeof(t), 1, fp ); iroot=0; MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b[1], n, MPI_DOUBLE, iroot, comm); if( myid==0) fread( (void *)&t, sizeof(t), 1, fp ); MPI_Scatter ((void *)&t, n, MPI_DOUBLE,( void *)&c[1], n, MPI_DOUBLE, iroot, comm); if( myid==0) { fread( (void *)&t, sizeof(t), 1, fp ); fclose( fp ); MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&d[1], n, MPI_DOUBLE, iroot, comm); /* Exchange data outside the territory */ itag=110; MPI_Sendrecv((void *)&b[iend], 1,MPI_DOUBLE, r_nbr, itag, (void *)&b[istartm1], 1,MPI_DOUBLE, l_nbr, itag, comm, istat); itag=120; MPI_Sendrecv((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag, (void *)&b[iendp1],1, MPI_DOUBLE, r_nbr, itag, comm, istat); /* Compute, gather and write out the computed result */ amax= -1.0e12; for (i=istart2; i<=iend1; i++) { a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1] )*0.25; amax=max(amax,a[i]); 50

51 { MPI_Gather((void *)&a[istart], n, MPI_DOUBLE,(void *)&t, n, MPI_DOUBLE,iroot, comm); MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm); amax=gmax; if( myid == 0) { for (i = 0; i < ntotal; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]); printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax); MPI_Finalize(); return 0; double max(double a, double b) if(a >= b) return a; else return b; 計算及資料都切割的邊界資料交換平行程式 T3DCP_1 的測試結果如下 : ATTENTION: nodes allocated by LoadLeveler, continuing... NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= MAXIMUM VALUE OF ARRAY A is

52 3.5 資料切割的邊界資料交換平行程式 ( 二 ) T3DCP_2 如果 3.4 節裏程式 T3DCP_1 的 for loop 需要用到兩個鄰近陣列元素時, 如下列所示 : for (i=3; i<=ntotal-2; i++) a[i]=c[i]*d[i]+( b[i-2] +2.0*b[i-1] +2.0*b[i]+2.0*Bb[i+1] +b[i+2] )*0.125; 其平行化方法與程式 T3DCP_1 相同只要把切割後陣列的 dimension 加 4 個陣列元素, 把 dimension 改為 [n+4], 其轄區內資料是存放在 index 2 到 n+1, 如圖 3.4 所示 : double a[n+4], b[n+4], c[n+4], d[n+4], t[ntotal], amax, gmax; istart = 2; iend = n+1; LEFT cpu0 cpu1 cpu2 cpu3 mpi_proc_null n n n n n n index n n n index n n n index is exchanged data mpi_proc_null RIGHT is owned data 圖 3.4 資料切割且交換兩個邊界資料的平行計算示意圖 for loop 的起迄 index 也要加以修改第一個 CPU 的起點是 3, 最後一個 CPU 的終點是 ntotal-2, 所以 : istart3=istart; 52

53 if (myid == 0) istart3=4; iend2= iend; if (myid == nproc-1) iend2= iend 2; 當然, 邊界資料的交換量也要由一個改為兩個每一個 CPU 要把轄區內最後兩個陣列元素送給右邊的 CPU, 這兩個陣列元素的起點是 iend-1, 同時要接受來自左邊 CPU 的兩個陣列元素, 從 istart-2 放起所以 : iendm1=iend-1; istartm2=istart-2; itag = 110; MPI_Sendrecv ((void *)&b[iendm1], 2, MPI_DOUBLE, r_nbr, itag, 1 (void *)&b[istartm2], 2, MPI_DOUBLE, l_nbr, itag, comm, istat); 每一個 CPU 也要把轄區內最前面兩個陣列元素送給左邊的 CPU, 這兩個陣列元素的起點是 istart, 同時要接受來自右邊 CPU 的兩個陣列元素, 從 iend+1 放起所以 : iendp1=iend+1; itag=120; MPI_Sendrecv ((void *)&b[istart], 2, MPI_DOUBLE, l_nbr, itag, (void *)&b[iendp1], 2, MPI_DOUBLE, r_nbr, itag, comm, istat); 交換邊界兩個陣列元素的平行程式如下 : /* PROGRAM T3CP_2 Two element of boundary data exchange with data & computing partition Using MPI_Gather, MPI_Scatter to gather & scatter data */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define ntotal 200 #define n 50 #define np 4 main ( argc, argv) 53

54 int argc; char **argv; { double amax, gmax, a[n+4], b[n+4], c[n+4], d[n+4], t[ntotal]; int i, j, k; FILE *fp; int nproc, myid, istart, iend, istart3, iend2, istartm2, iendm1, iendp1; int r_nbr, l_nbr, lastp, iroot, itag; MPI_Status istat[8]; MPI_Comm comm; extern double max(double, double); MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nproc); MPI_Comm_rank (MPI_COMM_WORLD, &myid); comm=mpi_comm_world; istart=2; iend=n+1; lastp=nproc-1; printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend); istartm2=istart-2; iendp1=iend+1; iendm1=iend-1; istart3=istart; if(myid == 0) istart3=4; iend2=iend; if(myid == lastp ) iend2=iend-2; l_nbr = myid - 1; r_nbr = myid + 1; if(myid == 0) if(myid == lastp) l_nbr=mpi_proc_null; r_nbr=mpi_proc_null; /* READ 'input.dat', and distribute input data */ 54

55 if ( myid==0) { fp = fopen( "input.dat", "r"); fread( (void *)&t, sizeof(t), 1, fp ); iroot=0; MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b[2], n, MPI_DOUBLE, iroot, comm); if( myid==0) fread( (void *)&t, sizeof(t), 1, fp ); MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&c[2], n, MPI_DOUBLE, iroot, comm); If ( myid==0) { fread( (void *)&t, sizeof(t), 1, fp ); fclose( fp ); MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&d[2], n, MPI_DOUBLE, iroot, comm); /* Exchange data outside the territory */ itag=110; MPI_Sendrecv((void *)&b[iendm1], 2, MPI_DOUBLE, r_nbr, itag, (void *)&b[istartm2], 2, MPI_DOUBLE, l_nbr, itag, comm, istat); itag=120; MPI_Sendrecv((void *)&b[istart], 2, MPI_DOUBLE, l_nbr, itag, (void *)&b[iendp1], 2, MPI_DOUBLE, r_nbr, itag, comm, istat); /* C Compute, gather and write out the computed result */ amax= -1.0e12; for (i=istart3; i<=iend2; i++) { a[i]=c[i]*d[i] + ( b[i-2] + 2.0*b[i-1] + 2.0*b[i] + 2.0*b[i+1] + b[i+2] )*0.125; amax=max(amax,a[i]); MPI_Gather((void *)&a[istart], n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, iroot, comm); MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm); amax=gmax; if( myid == 0) { for (i = 0; i < ntotal; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]); 55

56 { printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax); MPI_Finalize(); return 0; double max(double a, double b) if(a >= b) return a; else return b; 交換邊界兩個陣列元素的平行程式 T3DCP_2 的測試結果如下 : ATTENTION: nodes allocated by LoadLeveler, continuing... NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= NPROC,MYID,ISTART,IEND= MAXIMUM VALUE OF ARRAY A is

57 第四章格點數不能整除的平行程式本章將探討格點數 (grid points) 不能被參與平行計算的 CPU 數目整除時的處理方法所謂格點數就是陣列的 dimension 4.1 節循序程式 T4SEQ 裏陣列的 dimension 是 161, 只能夠被 7 和 23 整除 4.2 節介紹 MPI_Scatterv 和 MPI_Gatherv 兩個 ' 集體通訊 ' 類函式, 其功能與 MPI_Scatter 和 MPI_Gather 相似, 但是分送和來自各個 CPU 的資料不必等長 4.3 節介紹 MPI_Pack 和 MPI_Unpack 兩個資料集結與分解指令, 以及同步指令 MPI_Barrier 和取得時鐘時刻指令 MPI_Wtime 4.4 節說明使用這些 MPI 指令來把循序程式 T4SEQ 改寫成平行程式 T4DCP 57

58 4.1 格點數不能整除的循序程式 T4SEQ 循序程式 T4SEQ 裏陣列的 dimension 是 161, 只能夠被 7 和 23 整除該程式除了陣列資料 a b c d 之外, 還用到三個純量資料 (scalar data) p q r 在設定各個變數的初值 (initial value) 之後, 也寫入磁檔, 便利平行化時的驗證 /* PROGRAM T4SEQ */ Sequential Version of an odd-dimensioned array with -1, +1 access #include <stdio.h> #include <stdlib.h> #define ntotal 161 main () { double a[ntotal], b[ntotal], c[ntotal], d[ntotal], p, q, r, pqr[3]; int i,j; FILE *fp; extern double max(double, double); /* READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */ for (i = 0; i < ntotal; i++) { b[i]=3.0/(double)(i+1)+1.0; c[i]=2.0/(double)(i+1)+1.0; d[i]=1.0/(double)(i+1)+1.0; p=1.45; q=2.62; r=0.5; pqr[0]=p; pqr[1]=q; pqr[2]=r; fp = fopen( "input.dat", "w"); fwrite((void *)&b, sizeof(b), 1, fp ); fwrite((void *)&c, sizeof(c), 1, fp ); 58

59 fwrite((void *)&d, sizeof(d), 1, fp ); fwrite((void *)&pqr, sizeof(pqr), 1, fp ); fclose( fp ); fp = fopen( "input.dat", "r"); fread( (void *)&b, sizeof(b), 1, fp ); fread( (void *)&c, sizeof(c), 1, fp ); fread( (void *)&d, sizeof(d), 1, fp ); fread( (void *)&pqr, sizeof(pqr), 1, fp ); fclose( fp ); p=pqr[0]; q=pqr[1]; r=pqr[2]; for (i = 1; i < ntotal-1; i++) { a[i]=c[i]*d[i]*p+(b[i-1]+2.0*b[i]+b[i+1])*q+r; for (i = 0; i < ntotal-1; i+=40) { printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n", a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]); return 0; 格點數不能整除的循序程式 T4SEQ 的測試結果如下 :

60 4.2. MPI_Scatterv MPI_Gatherv 如果要把 4.1 節循序程式 T4SEQ 平行化, 而且陣列也要切割時, 其 dimension 是 161, 只能夠被 7 和 23 整除不能夠被等參與平行計算的 CPU 數目整除, 因此在個 CPU 上平行計算時, 就不能夠使用 MPI_Scatter 來分送輸入資料, 及使用 MPI_Gather 來收集各個 CPU 上的資料當然, 此時還是可以使用 MPI_Send 和 MPI_Recv 來完成陣列資料的分送與與收集不過,MPI 還備有 MPI_Scatterv 和 MPI_ Gatherv 可以用來分送與與收集不等長陣列資料 MPI_ Scatterv 和 MPI_ Scatter 的功能相似,MPI_ Gatherv 和 MPI_ Gather 的功能相似, 但是 MPI_ Scatter 是用來分送等量資料給每一個 CPU,MPI_Gather 是收集來自每一個 CPU 的等量資料,MPI_ Scatterv 和 MPI_Gatherv 則不受這種限制 MPI_ Scatterv 的叫用格式如下 : MPI_ Scatterv ( (void *)&t, gcount, gdisp, MPI_DOUBLE, (void *)&c(1), mycount, MPI_DOUBLE, iroot, comm); MPI_ Scatterv 是 iroot CPU 把一個陣列 t 依 CPU id 的順序分段分送給每一個 CPU, 包括 iroot CPU 在內由於分送給各個 CPU 的資料量可以不一樣多, 因此必須使用一個陣列 gcount 來存放分送給各個 CPU 的資料數量, 再另外使用一個陣列 gdisp 來存放送出資料在 t 陣列上的相對位置其引數依序為 : t gcount gdisp MPI_DOUBLE 是待送出的資料陣列起點整數陣列, 是存放要送給各個 CPU 的資料數量整數陣列, 是存放要送給各個 CPU 資料起點在 t 陣列上的相對位置是待送出資料的類別 c(1) mycount MPI_DOUBLE iroot 是接收資料存放的起點是接收資料的數量是接收資料的類別是送出資料陣的 CPU id 此處 gcount 及 gdisp 是叫用 startend 函式算出來各個 CPU 上經切割後陣列片段的長度及各該陣列片段在未切割前的起點 START index 再減一為了配合 CPU id 是從零起算,gcount gdisp 的 dimension 也是從零開始所以 : double a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal]; 60

展开

大綱介紹 MPI 標準介紹 MPI 的主要目標 Compiler & Run 平行程式 MPICH 程式基本架構點對點通訊函數介紹集體通訊函數介紹

MPI 平行程式設計勁智數位科技股份有限公司技術研發部林勝峰 sflin@infowrap.com.tw 大綱介紹 MPI 標準介紹 MPI 的主要目標 Compiler & Run 平行程式 MPICH 程式基本架構點對點通訊函數介紹集體通訊函數介紹 MPI (Message Passing Interface) Version1.0:June, 1994. Version1.1:June,