An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi.

An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi

Outline Introduction –Problems in collective communication –Contribution Problem settings Our approach Conclusion

Background Grid computing has become widely used. –More opportunities to perform parallel computation in grid environments. InTrigger (Japan), Grid5000 (France) Large scale parallel computation Data intensive applications Message passing (e.g. MPI) –Point to point communication –Collective communication WA N

Problems in collective communication Heterogeneous Network –LAN / WAN –Difference in latency and bandwidth Contention Network congestion Connectivity –Scalability –Nat, Firewall LAN WAN NG OK Connection Node SW Contention

Designing an efficient collective operation algorithm. –Suppresing network contention –Adaptive and scalable in large network Focusing on Many-to-one and Many-to-Many operations. –Many-to-One (Gather) –Many-to-Many (All-to-all) Implementation and evaluation –Our algorithm achieved better performance than existing MPI libraries. Contribution Many-to-One Many-to-Many

Outline Introduction Problem settings –Effect of network contention –Related work Our approach Conclusion

Gather operation behavior Gather operation –Root node receives different data from the other nodes. SW N0N0 N1N1 N2N2 N3N3 D0D0 D1D1 D2D2 D3D3 N0N0 N1N1 N2N2 N3N3 D0D0 D1D1 D2D2 D3D3 D1D1 D2D2 D3D3 Before After Contention –Messages from N 1 ~ N k flow into N 0 ’s link at the same time. –N 0 can only receive part of them Reach N 0 ’s receive capacity limits. N0N0 N1N1 NkNk N2N2 N3N3

Effect of network contention 0 500 1000 1500 2000 110100100010000 Message size (Kbyte) Completion time (msec) Theoretical Concurrent 200msec The Completion time of “Concurrent” is up to 400 times as much as “Theoretical” value. A leap in the completion time of “Concurrent” around 3KB. SW 14nodes LAN Experimental Settings SW: PowerConnect5324 Network: 500Mbps 3KB

Findings Caused by some TCP features –Packet loss at a switch –Receiver waits for retransmission of the packet RTO : retransmission timeout 200msec ~ (Linux kernel 2.6.18) –Sender retransmits the packet Requirements –Prevent packet losses at any switches –Control the number of nodes which communicate with a common destination at a time at all switches.

Related work (MPI Implementation) OpenMPI –Flat tree MPICH –Binomial tree MagPIe [Kielmann et al. 1999] –Binomial tree (LAN) –Flat tree (WAN) Flat tree Binomial tree Network contention may degrade the performance of gather operation in these MPI implementation. Contention!

Outline Introduction Problem settings Our approach –Base idea Pipeline transfer Synchronized transfer –Evaluation Conclusion

Necessary conditions Prerequisite Messages do NOT flow into a link –at the same time –from two or more different sources SW OK SW NG SW OK Assumption –All node can send and receive defferent messages concurrntly –No nodes communicate to other nodes in more than two gathers at the same time.

Basic idea Immediate Goal –Suppressing contention at any switches and routers. Our algorithm consists of two approaches. –Sequential send with synchronization –Pipeline transfer Communication graph configuration –Combine pipeline transfer and synchronized transfer to improve the performance of gather operation

Sequential send with synchronization 1.N 1 send its message to N 0 2.N 0 send a packet (1 byte) to N 2 3.When N 2 has gotten the message from N 0, then N 2 starts to send its message to N 0. N0N0 N1N1 N2N2 SW 1 byte message

The weekness of the “sequential send with synchronization” The method of “Sequential send” does not always achieve the most efficient communication. root 1000 1 Sync. … NOT scalable High cost Cost 7000 root

Pipeline transfer N 1 send its message to N 0 N 2 send its message to N 1 When N 1 completely finished to receive the message from N 2, N 1 transfers it immediately to N 0. N0N0 N1N1 N2N2 SW

The feature of “Pipeline transfer” No synchronization A low-bandwidth network in the middle of the pipeline often get into a bottleneck. 1000 1 Bottleneck Cost 1003

Graph configuration First, configure a “pipeline” with layer 2 network topology. Meet the conditions of avoiding contention – Messages do not flow into any links in network In the same direction From More than once sources SW Pipelined transfer root ※ Getting network information - Topology inference [Shirai et. al 2006] - Bandwidth estimation [Naganuma et. al 2008] root

Improving the performance Reconfigure the communication graph SW Sync. root Sync. can send its message to ( Pipeline transfer ) or ( synchronized transfer ) or … ( synchronized transfer ). E.g. 1. Calculate the arrival time that ‘s message reaches to the root node. If send to, then it takes X seconds. If send to, then it takes Y seconds. … 2. Select the route where ‘s message arrives at the root node as soon as possible.

Experimentation OpenMPI (Concurrent) –Flat tree –All nodes send their messages concurrently. MPICH / MagPIe –Binomial tree –Flat tree (MagPIe over WAN) Sequential –Only using “Synchronized transfer” in our algorithm. OURS –Pipeline transfer and Synchronized transfer. Flat tree Binomial tree

Experiment results (1) 55 nodes send a 10KB ~ 1MB message to the root node. The performance of our algorithm is better than the other algorithms in almost all case. SW 47nodes LAN SW: FastIron GS Network: 1Gbps Experimental Settings SW 9 nodes root

Experiment results (2) Settings –Each node sends a 20KB message to the root node. –1 cluster, 50 nodes → 9 cluster 190 nodes The result shows –Our algorithm can avoid contention and prevent the performance of communication from being degraded. 200ms Contention

Conclusion –Proposed an algorithm that achieves avoiding contention in large network. –The algorithm achieved better performance than existing MPI libraries. Future work –Designing more adaptive and precise communication graph configuration algorithm –Considering wide area bandwidth –Designing a contention free algorithm for Many- to-Many operation.

Publications 吉富翔太, 弘中健, 田浦健次朗. メッセージ衝突を防止する適応的な収集操作アルゴリズム. 先進的計算基盤システムシンポジウム (SACSIS2009). May 2009 ( 発表予定 ) 吉富翔太, 斎藤秀雄, 田浦健次朗, 近山隆. 自動取得したネットワーク構成情報に基づく MPI 集合通信アルゴリズム並列・分散・協調処理に関するサマーワークショップ (SWoPP2008). Aug 2008. 吉富翔太, 斎藤秀雄, 田浦健次朗, 近山隆. 自動取得したネットワーク構成情報に基づく MPI 集合通信アルゴリズムの改良情報処理学会全国大会 2008. Mar 2008.

Experiment results (2) Settings –Each node sends a 20KB message to the root node. –1 cluster, 50 nodes → 9 cluster 190 nodes The result shows –Our algorithm can avoid contention and prevent the performance of communication from being degraded. 200ms Contention

reconfiguration SW root SW Sync. root Sync. Reconfiguration order Arraged in decending order of the bandwidth to the root node

An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi.

Similar presentations

Presentation on theme: "An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi.

Similar presentations

Presentation on theme: "An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi."— Presentation transcript:

Similar presentations

About project

Feedback