Research on Efficient Collective Communication Algorithms of Interconnection Networks for Multicomputers Dissertation for the Doctor Degree of Univers

2006 5

Research on Efficient Collective Communication Algorithms of Interconnection Networks for Multicomputers Dissertation for the Doctor Degree of University of Science and Technology of China by Liu Gang ( Computer System Architecture ) Dissertation Supervisor: Professor Gu Naijie May 2006

DCE MCCE MCCE MPICH LAM/MPI I

Clos 3 2 12N ON ( ) VLSI k-fold Qos Clos k-fold k-fold Clos II

Abstract Abstract The development of modern science and life has resulted in increased demand for powerful computing capacity, and the design of parallel systems that have the speed of teraflops and even petaflops requires high performance interconnection networks to interconnect numerous processors. Meanwhile, with the growing of the system size, interprocessor communication overhead more and more become a bottleneck. In large-scale scientific computing and engineering applications, the overhead of collective communication usually takes absolute majority of the entire communication overhead. Thus, the research on interconnection networks and collective communication is crucial to increase the performance of multicomputers so as to improve the execution efficiency of parallel applications. The research works presented in this dissertation are mainly concentrated on improving the performance of collective communication operations on interconnection networks. First of all, this dissertation extensively studied the algorithms for complete exchange on single-port torus. Torus has become more and more popular for interconnecting the processors in parallel and distributed systems due to its topological features, and has been widely adopted by many supercomputers today. Moreover, complete exchange, also known as all-to-all personalized communication, occurs in numerous important applications in parallel computing environment. This dissertation proposes a novel network partitioning scheme on multidimensional single-port torus. By adopting the network partitioning scheme, a near-optimal (in terms of transmission time) indirect algorithm has been presented on multidimensional single-port torus. Compared with other existing algorithms, the algorithm for complete exchange not only has a better scalability, but also prominently improves the communication performance. Secondly, this dissertation improves the algorithm for complete exchange, which has minimum startup cost on single-port 2D and 3D torus, by adopting a bottom-up-send-back approach to scheduling parallel communication. Compared with the original algorithm, the improved algorithm not only achieves minimum startup cost, but also has better communication performance. Moreover, there are few researches that have been done for complete exchange on all-port torus. This dissertation firstly proposes indirect algorithms for complete exchange on all-port 1D ring, 2D and 4D torus, which completely achieve the theoretical lower bounds on message transmission. The new algorithms fully utilize all ports and communication links in all-port torus, and transmit messages along shortest paths. The analytical results show that the proposed algorithms on all-port torus have the best communication performance among all existing algorithms, when the message size is enough long. The dissertation proposes the direct algorithm DCE and indirect algorithm MCCE for complete exchange on clusters connected by Ethernet switched hierarchical network. The two algorithms fully utilized the bandwidth in the bottleneck links and theoretically achieve the lower bounds on message transmission. In addition, the algorithm MCCE significantly reduces message startup cost and synchronization cost so as to further improve the communication performance of complete exchange. Experimental results show that the two proposed algorithms significantly outperform other complete exchange algorithms included in MPICH and LAM/MPI, on Ethernet switched III

Abstract clusters with hierarchical network topologies when the message size is long. To minimize multicast latency and eliminate the memory bottleneck induced by traditional multicast in software, this dissertation takes into account multistage interconnection networks which support concurrent hardware multicast in the internal switch fabric of routers and switches. This dissertation proposes a cost-effective wide-sense nonblocking four stage Clos-type multicast network and the multicast routing algorithm. Compared with previous wide-sense non-blocking multicast networks, the proposed multicast network has following advantages: 1) The hardware 3 2 cost, in terms of the number of crosspoints, is about 12N, which is only a small constant factor higher than that of wide-sense non-blocking permutation networks; 2) The time complexity of multicast routing algorithm is O(N), which is conceptually simple and easily implemented by hardware; 3) It reduces the number of ports in each switch module, which is better for VLSI implement; 4) The additional stage of switches can effectively balance multicast loads, therefore improves the flexibility of the network. Finally, this dissertation extensively studies wide-sense nonblocking k-fold multicast networks, in which any destination node can be involved in up to k simultaneous multicast connections in a nonblocking manner. These networks can effectively reduce external blocking of multicast assignment, so that they can provide better QoS (Quality of Service) function for arbitrary multicast communication. In addition, the networks have a much lower hardware cost than that of simply putting k copies of 1-fold network together. By recalculating the number of switches of the intermediate stages in a wide-sense non-blocking four-stage Clos network, this dissertation provides sufficient internal paths between inputs and outputs to realize k-fold multicast routing. Inspired by this idea, two wide-sense nonblocking k-fold multicast routing algorithms and corresponding hardware conditions have been proposed. Although the whole research work is conducted to meet the definite requirement of applications in parallel and distributed systems, the proposed wide-sense nonblocking multicast networks are not limited to multicomputers. They can be also applied in routers and switches for wide area networks and local area networks. Key Words: Interconnection Network, Collective Communication, Torus, All-to-all Personalized Communication, Complete Exchange, Cluster, Wide-sense Nonblocking, Multicast, Multistage Interconnection Network (MIN), Parallel Communication Model IV

I ABSTRACT.....III... V 1... 1 1.1..1 1.2...2 1.3..4 1.3.1 4 1.3.2 5 1.3.3 8 1.3.4 9 1.4..9 1.5 10 1.5.1..11 1.5.2..11 1.5.3..12 1.5.4 12 1.6.14 1.7 16 1.8 18 2 20 2.1 20 2.2 21 2.3 22 2.3.1..22 2.3.2..23 2.3.3..24 2.3.4..24 2.3.5..25 2.4 25 2.4.1 Hockney..25 2.4.2 LogP..25 2.4.3 LogGP...27 2.4.4 LogGP...27 2.4.5..27 2.5 28 3 29 3.1 29 V

3.1.1..29 3.1.2..30 3.2 31 3.3 32 3.3.1..32 3.3.2..34 3.3.3..36 3.3.4..37 3.3.5 k MTk 40 3.3.6..40 3.4 42 3.4.1..42 3.4.2..47 3.4.3..54 3.5.56 4 57 4.1 57 4.2 57 4.2.1..58 4.2.2..59 4.3 60 4.3.1..60 4.3.2..61 4.3.3....63 4.3.4..66 4.4 66 4.5 68 4.6 70...70 5 75 5.1 75 5.2 77 5.2.1 (Ring) 78 5.2.2 (Pairwise exchange).79 5.3 79 5.3.1..79 5.3.2 DCE.80 5.3.3 MCCE..82 5.3.4..85 5.4 90 6 86 6.1 86 6.2 88 VI

6.2.1 Clos.89 6.2.2 Clos...89 6.2.3..91 6.3 Clos 92 6.3.1 Clos..92 6.3.2 PMPM..95 6.3.3 PMPM.96 6.3.4.. 98 6.4 k-fold.99 6.4.1 k-pmpm k-fold. 102 6.4.2 k-mmmm k-fold..104 6.4.3 108 6.5..109 7.110 7.1..110 7.2..111 7.2..112 A..114 118....126 127 128 VII

VLSI SAN System-Area NetworkATM Asynchronous Transfer Mode IP / 80 18 [1] teraflopspetaflops 2005 11 500 [2] IBM Blue Gene/L [3] 12 280.6Tera (10 ) 65,536 1

collective communication VLSI Distributed MemoryShared Memory (a) (b) 1.1 1.2 1.1(a) Intel Paragon MIT J-Machine ncube3 message passing 2

1.1(b) Uniform Memory Access UMA Symmetric Multiprocessors SMP Cray C-90 Cray T-90 NEC SX-4 SMP IBM R50 SGI Power Challenge DEC Alpha 8400 DSM Distributedshared Memory 1.2 DSM NUMA Non-Uniform Memory Access Stanford Dash Flash Cray T3D SGI Origin Cluster SMP PC Beowulf Myrinet SeverNet Infiniband SMP IBM ASCI Blue 3

Pacific 1464 4 [4] shared-medium network direct network indirect network switch-based network hybrid network 1.3 contention bus token bustoken ring backplane bus 1.3 Cache 4

mesh torus hypercube k n- tree cube-connected cycle star graph 1.4(a) 1.4(b) 6 6 6 6 1.4(c) 4 a 6 6 b 6 6 b 4 1.4 router 1.5 1.5 5

1.6 injection channel consumption channel ejection channel delivery channel (LC) FIFO First In First Out 1.6 1.6 6

(routing delay) (flow control delay) [5] [1] Node Degree in degreeout degree I/O Network Diameter Bisection Width Symmetry VLSI regularity Scalability N N 1 7