Research on Efficient Collective Communication Algorithms of Interconnection Networks for Multicomputers Dissertation for the Doctor Degree of Univers



Similar documents
University of Science and Technology of China A dissertation for master s degree Research of e-learning style for public servants under the context of

IP TCP/IP PC OS µclinux MPEG4 Blackfin DSP MPEG4 IP UDP Winsock I/O DirectShow Filter DirectShow MPEG4 µclinux TCP/IP IP COM, DirectShow I

致 谢 开 始 这 篇 致 谢 的 时 候, 以 为 这 是 最 轻 松 最 愉 快 的 部 分, 而 此 时 心 头 却 充 满 了 沉 甸 甸 的 回 忆 和 感 恩, 一 时 间 竟 无 从 下 笔 虽 然 这 远 不 是 一 篇 完 美 的 论 文, 但 完 成 这 篇 论 文 要 感 谢

WTO

输电线路智能监测系统通信技术应用研究

A dissertation for Master s degree Metro Indoor Coverage Systems Analysis And Design Author s Name: Sheng Hailiang speciality: Supervisor:Prof.Li Hui,

Microsoft Word - 专论综述1.doc

2006中國文學研究範本檔

Construction of Chinese pediatric standard database A Dissertation Submitted for the Master s Degree Candidate:linan Adviser:Prof. Han Xinmin Nanjing

Public Projects A Thesis Submitted to Department of Construction Engineering National Kaohsiung First University of Science and Technology In Partial

Microsoft Word - 7-針刺.doc

Shanghai International Studies University THE STUDY AND PRACTICE OF SITUATIONAL LANGUAGE TEACHING OF ADVERB AT BEGINNING AND INTERMEDIATE LEVEL A Thes

Abstract / / B-ISDN ATM Crossbar Batcher banyan N DPA Modelsim Verilog Synopsys Design Analyzer Modelsim FPGA ISE FPGA ATM ii

UDC 厦门大学博硕士论文摘要库

XML SOAP DOM B2B B/S B2B B2B XML SOAP

Microsoft Word - netcontr.doc

國立中山大學學位論文典藏.PDF

F4

Microsoft Word - 論文封面 修.doc

untitled

A VALIDATION STUDY OF THE ACHIEVEMENT TEST OF TEACHING CHINESE AS THE SECOND LANGUAGE by Chen Wei A Thesis Submitted to the Graduate School and Colleg

我国原奶及乳制品安全生产和质量安全管理研究

國立中山大學學位論文典藏.PDF


课题调查对象:

本科毕业设计(论文)工作细则&撰写规范

[ ],,,,,,,,,,,,,,, [ ] ; ; ; [ ]F120 [ ]A [ ] X(2018) , :,,,, ( ),,,,,,, [ ] [ ],, 5

1

Abstract Today, the structures of domestic bus industry have been changed greatly. Many manufacturers enter into the field because of its lower thresh

Microsoft Word M 黃士種

國立中山大學學位論文典藏

中國文化大學政治學研究所

第三章 国内外小组合作学习的应用情况

Thesis for the Master degree in Engineering Research on Negative Pressure Wave Simulation and Signal Processing of Fluid-Conveying Pipeline Leak Candi

Mechanical Science and Technology for Aerospace Engineering October Vol No. 10 Web SaaS B /S Web2. 0 Web2. 0 TP315 A

报 告 1: 郑 斌 教 授, 美 国 俄 克 拉 荷 马 大 学 医 学 图 像 特 征 分 析 与 癌 症 风 险 评 估 方 法 摘 要 : 准 确 的 评 估 癌 症 近 期 发 病 风 险 和 预 后 或 者 治 疗 效 果 是 发 展 和 建 立 精 准 医 学 的 一 个 重 要 前


硕 士 学 位 论 文 论 文 题 目 : 北 岛 诗 歌 创 作 的 双 重 困 境 专 业 名 称 : 中 国 现 当 代 文 学 研 究 方 向 : 中 国 新 诗 研 究 论 文 作 者 : 奚 荣 荣 指 导 老 师 : 姜 玉 琴 2014 年 12 月


~ ~ ~

Microsoft PowerPoint - talk8.ppt

Microsoft Word - a8_wu_guangyun

東方設計學院文化創意設計研究所

Microsoft Word - bxyj2007_01_zongdi225.doc

Microsoft Word - 2.v3n1.gjtm.docx

國家圖書館典藏電子全文

闲 旅 游 现 已 成 为 城 市 居 民 日 常 生 活 的 重 要 部 分 袁 它 的 出 现 标 志 着 现 代 社 会 文 明 的 进 步 遥 据 国 外 学 者 预 测 袁 2015 年 左 右 袁 发 达 国 家 将 陆 续 进 入 野 休 闲 时 代 冶 袁 发 展 中 国 家 也 将

m 3 m m 84 m m m m m m m

Microsoft Word 張嘉玲-_76-83_

标题

Time Estimation of Occurrence of Diabetes-Related Cardiovascular Complications by Ching-Yuan Hu A thesis submitted in partial fulfillment of the requi


http / /yxxy. cbpt. cnki. net / % % %


Microsoft Word - 1-編者的話

中山大學學位論文典藏

50% SWEET 甜 蜜 五 分 仔 - 橋 頭 糖 廠 紀 念 商 品 開 發 設 計 之 研 究 50% SWEET - The Study on the Development and Design of Souvenirs of Qiao Tou Sugar Plant 研 究 生 : 陳

Improved Preimage Attacks on AES-like Hash Functions: Applications to Whirlpool and Grøstl

Microsoft Word 谢雯雯.doc

JAEA-Technology indb

Microsoft Word - 11-秦华伟.doc

文档 9

Microsoft Word - Alan Jameson's Master's Thesis.pdf




Microsoft Word - 口試本封面.doc

廣州舊城區的保護和發展

排版稿.FIT)

10 ( ) ( ) [5] 1978 : [1] (P13) [6] [1] (P217) [7] [1] (P19) : : [1] [4] (P1347) (P18) 1985 : [1] (P343) 1300 : [1] (P12) 1984 :

黑面琵鷺2015

Microsoft Word - 专论综述1.doc

永遠的革新號--側論《筆匯》遺漏在文學史上的密碼

Microsoft Word - xb 牛尚鹏.doc

\\Lhh\07-02\黑白\内页黑白1-16.p

096STUT DOC

Microsoft Word - A doc

國立高雄大學○○○○○○學系(研究所)(標楷體18號字

陶艳.doc


Microsoft Word - 18.doc

1 科 学 谋 划, 有 序 促 进 扶 贫 工 作 的 持 续 发 展 1.1 科 学 定 位, 精 准 发 现 地 方 的 需 求 按 照 国 家 生 态 功 能 区 的 划 分, 库 伦 旗 属 重 点 生 态 保 护 开 发 区 这 里 生 态 环 境 优 良 特 色 作 物 资 源 优 势

第16卷 第2期 邯郸学院学报 年6月


本 論 文 獲 行 政 院 客 家 委 員 會 99 年 客 家 研 究 優 良 博 碩 士 論 文 獎 助

Microsoft Word - 刘藤升答辩修改论文.doc

“先污染、后治理”是太湖、滇池治污失败的根本原因


中 国 药 事 2016 年 7 月 第 30 卷 第 7 期 709 体 外 诊 断 试 剂 是 指 用 生 物 化 学 免 疫 学 微 生 物 学 分 子 生 物 学 等 原 理 或 方 法 制 备, 在 体 外 用 于 对 人 体 疾 病 的 诊 断 筛 查 或 监 测 及 流 行 病 学 调

BC04 Module_antenna__ doc

LH_Series_Rev2014.pdf

VASP应用运行优化

14A 0.1%5% 14A 14A

穨_2_.PDF

(Chi)_.indb

Microsoft PowerPoint Zhang Guohua.ppt [Compatibility Mode]

Monographic Study.Employment for the Disabled / 专 题 研 究. 残 疾 人 就 业 / 表 1 高 等 院 校 残 疾 大 学 生 录 取 情 况 和 普 通 大 学 生 录 取 情 况 (2005 ~ 2013) 年 份 普 通 高 等 院 校 录

9 Internet 10 Internet

本 論 文 獲 行 政 院 客 家 委 員 會 一 年 客 家 研 究 優 良 博 碩 士 論 文 獎 助

在 应 用 实 践 上 指 导 性 建 议 ( 黄 白,2008) 近 几 年 来, 国 家 政 府 在 教 育 方 面 高 度 重 视 教 育 信 息 化 工 作, 相 继 出 台 一 系 列 政 策 文 件 和 规 范 来 促 进 和 推 动 信 息 技 术 在 教 育 教 学 领 域 的 广

Transcription:

2006 5

Research on Efficient Collective Communication Algorithms of Interconnection Networks for Multicomputers Dissertation for the Doctor Degree of University of Science and Technology of China by Liu Gang ( Computer System Architecture ) Dissertation Supervisor: Professor Gu Naijie May 2006

DCE MCCE MCCE MPICH LAM/MPI I

Clos 3 2 12N ON ( ) VLSI k-fold Qos Clos k-fold k-fold Clos II

Abstract Abstract The development of modern science and life has resulted in increased demand for powerful computing capacity, and the design of parallel systems that have the speed of teraflops and even petaflops requires high performance interconnection networks to interconnect numerous processors. Meanwhile, with the growing of the system size, interprocessor communication overhead more and more become a bottleneck. In large-scale scientific computing and engineering applications, the overhead of collective communication usually takes absolute majority of the entire communication overhead. Thus, the research on interconnection networks and collective communication is crucial to increase the performance of multicomputers so as to improve the execution efficiency of parallel applications. The research works presented in this dissertation are mainly concentrated on improving the performance of collective communication operations on interconnection networks. First of all, this dissertation extensively studied the algorithms for complete exchange on single-port torus. Torus has become more and more popular for interconnecting the processors in parallel and distributed systems due to its topological features, and has been widely adopted by many supercomputers today. Moreover, complete exchange, also known as all-to-all personalized communication, occurs in numerous important applications in parallel computing environment. This dissertation proposes a novel network partitioning scheme on multidimensional single-port torus. By adopting the network partitioning scheme, a near-optimal (in terms of transmission time) indirect algorithm has been presented on multidimensional single-port torus. Compared with other existing algorithms, the algorithm for complete exchange not only has a better scalability, but also prominently improves the communication performance. Secondly, this dissertation improves the algorithm for complete exchange, which has minimum startup cost on single-port 2D and 3D torus, by adopting a bottom-up-send-back approach to scheduling parallel communication. Compared with the original algorithm, the improved algorithm not only achieves minimum startup cost, but also has better communication performance. Moreover, there are few researches that have been done for complete exchange on all-port torus. This dissertation firstly proposes indirect algorithms for complete exchange on all-port 1D ring, 2D and 4D torus, which completely achieve the theoretical lower bounds on message transmission. The new algorithms fully utilize all ports and communication links in all-port torus, and transmit messages along shortest paths. The analytical results show that the proposed algorithms on all-port torus have the best communication performance among all existing algorithms, when the message size is enough long. The dissertation proposes the direct algorithm DCE and indirect algorithm MCCE for complete exchange on clusters connected by Ethernet switched hierarchical network. The two algorithms fully utilized the bandwidth in the bottleneck links and theoretically achieve the lower bounds on message transmission. In addition, the algorithm MCCE significantly reduces message startup cost and synchronization cost so as to further improve the communication performance of complete exchange. Experimental results show that the two proposed algorithms significantly outperform other complete exchange algorithms included in MPICH and LAM/MPI, on Ethernet switched III

Abstract clusters with hierarchical network topologies when the message size is long. To minimize multicast latency and eliminate the memory bottleneck induced by traditional multicast in software, this dissertation takes into account multistage interconnection networks which support concurrent hardware multicast in the internal switch fabric of routers and switches. This dissertation proposes a cost-effective wide-sense nonblocking four stage Clos-type multicast network and the multicast routing algorithm. Compared with previous wide-sense non-blocking multicast networks, the proposed multicast network has following advantages: 1) The hardware 3 2 cost, in terms of the number of crosspoints, is about 12N, which is only a small constant factor higher than that of wide-sense non-blocking permutation networks; 2) The time complexity of multicast routing algorithm is O(N), which is conceptually simple and easily implemented by hardware; 3) It reduces the number of ports in each switch module, which is better for VLSI implement; 4) The additional stage of switches can effectively balance multicast loads, therefore improves the flexibility of the network. Finally, this dissertation extensively studies wide-sense nonblocking k-fold multicast networks, in which any destination node can be involved in up to k simultaneous multicast connections in a nonblocking manner. These networks can effectively reduce external blocking of multicast assignment, so that they can provide better QoS (Quality of Service) function for arbitrary multicast communication. In addition, the networks have a much lower hardware cost than that of simply putting k copies of 1-fold network together. By recalculating the number of switches of the intermediate stages in a wide-sense non-blocking four-stage Clos network, this dissertation provides sufficient internal paths between inputs and outputs to realize k-fold multicast routing. Inspired by this idea, two wide-sense nonblocking k-fold multicast routing algorithms and corresponding hardware conditions have been proposed. Although the whole research work is conducted to meet the definite requirement of applications in parallel and distributed systems, the proposed wide-sense nonblocking multicast networks are not limited to multicomputers. They can be also applied in routers and switches for wide area networks and local area networks. Key Words: Interconnection Network, Collective Communication, Torus, All-to-all Personalized Communication, Complete Exchange, Cluster, Wide-sense Nonblocking, Multicast, Multistage Interconnection Network (MIN), Parallel Communication Model IV

I ABSTRACT.....III... V 1... 1 1.1..1 1.2...2 1.3..4 1.3.1 4 1.3.2 5 1.3.3 8 1.3.4 9 1.4..9 1.5 10 1.5.1..11 1.5.2..11 1.5.3..12 1.5.4 12 1.6.14 1.7 16 1.8 18 2 20 2.1 20 2.2 21 2.3 22 2.3.1..22 2.3.2..23 2.3.3..24 2.3.4..24 2.3.5..25 2.4 25 2.4.1 Hockney..25 2.4.2 LogP..25 2.4.3 LogGP...27 2.4.4 LogGP...27 2.4.5..27 2.5 28 3 29 3.1 29 V

3.1.1..29 3.1.2..30 3.2 31 3.3 32 3.3.1..32 3.3.2..34 3.3.3..36 3.3.4..37 3.3.5 k MTk 40 3.3.6..40 3.4 42 3.4.1..42 3.4.2..47 3.4.3..54 3.5.56 4 57 4.1 57 4.2 57 4.2.1..58 4.2.2..59 4.3 60 4.3.1..60 4.3.2..61 4.3.3....63 4.3.4..66 4.4 66 4.5 68 4.6 70...70 5 75 5.1 75 5.2 77 5.2.1 (Ring) 78 5.2.2 (Pairwise exchange).79 5.3 79 5.3.1..79 5.3.2 DCE.80 5.3.3 MCCE..82 5.3.4..85 5.4 90 6 86 6.1 86 6.2 88 VI

6.2.1 Clos.89 6.2.2 Clos...89 6.2.3..91 6.3 Clos 92 6.3.1 Clos..92 6.3.2 PMPM..95 6.3.3 PMPM.96 6.3.4.. 98 6.4 k-fold.99 6.4.1 k-pmpm k-fold. 102 6.4.2 k-mmmm k-fold..104 6.4.3 108 6.5..109 7.110 7.1..110 7.2..111 7.2..112 A..114 118....126 127 128 VII

VLSI SAN System-Area NetworkATM Asynchronous Transfer Mode IP / 80 18 [1] teraflopspetaflops 2005 11 500 [2] IBM Blue Gene/L [3] 12 280.6Tera (10 ) 65,536 1

collective communication VLSI Distributed MemoryShared Memory (a) (b) 1.1 1.2 1.1(a) Intel Paragon MIT J-Machine ncube3 message passing 2

1.1(b) Uniform Memory Access UMA Symmetric Multiprocessors SMP Cray C-90 Cray T-90 NEC SX-4 SMP IBM R50 SGI Power Challenge DEC Alpha 8400 DSM Distributedshared Memory 1.2 DSM NUMA Non-Uniform Memory Access Stanford Dash Flash Cray T3D SGI Origin Cluster SMP PC Beowulf Myrinet SeverNet Infiniband SMP IBM ASCI Blue 3

Pacific 1464 4 [4] shared-medium network direct network indirect network switch-based network hybrid network 1.3 contention bus token bustoken ring backplane bus 1.3 Cache 4

mesh torus hypercube k n- tree cube-connected cycle star graph 1.4(a) 1.4(b) 6 6 6 6 1.4(c) 4 a 6 6 b 6 6 b 4 1.4 router 1.5 1.5 5

1.6 injection channel consumption channel ejection channel delivery channel (LC) FIFO First In First Out 1.6 1.6 6

(routing delay) (flow control delay) [5] [1] Node Degree in degreeout degree I/O Network Diameter Bisection Width Symmetry VLSI regularity Scalability N N 1 7