Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Slides:

Advertisements

Similar presentations

Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.

Advertisements

Chapter 5: Tree Constructions

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Grid Communication Simulator Boro Jakimovski Marjan Gusev Institute of Informatics Faculty of Natural Sciences and Mathematics University of Sts. Cyril.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

COP 3502: Computer Science I (Note Set #21) Page 1 © Mark Llewellyn COP 3502: Computer Science I Spring 2004 – Note Set 21 – Balancing Binary Trees School.

25 May Quick Sort (11.2) CSE 2011 Winter 2011.

Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,

Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada ISP-Friendly Peer Matching without ISP Collaboration Mohamed Hefeeda (Joint.

Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.

1 Delay-efficient Data Gathering in Sensor Networks Bin Tang, Xianjin Zhu and Deng Pan.

Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.

Cache Placement in Sensor Networks Under Update Cost Constraint Bin Tang, Samir Das and Himanshu Gupta Department of Computer Science Stony Brook University.

Communication operations Efficient Parallel Algorithms COMP308.

Quality-Aware Segment Transmission Scheduling in Peer-to-Peer Streaming Systems Cheng-Hsin Hsu Senior Research Scientist Deutsche Telekom R&D Lab USA Los.

CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.

1 Techniques for pipelined broadcast on ethernet switched clusters SELECTED TOPICS FOR DISTRIBUTED COMPUTING [SKR 5800] DEPARTMENT OF COMMUNICATION TECHNOLOGY.

Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece

Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.

Topic Overview One-to-All Broadcast and All-to-One Reduction

1 Topology Design of Structured Campus Networks by Habib Youssef Sadiq M. SaitSalman A. Khan Department of Computer Engineering King Fahd University of.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.

Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

International Technology Alliance In Network & Information Sciences International Technology Alliance In Network & Information Sciences 1 Cooperative Wireless.

Computer Science Secure Hierarchical In-network Data Aggregation for Sensor Networks Steve McKinney CSC 774 – Dr. Ning Acknowledgment: Slides based on.

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Broadcast & Convergecast Downcast & Upcast

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Network Aware Resource Allocation in Distributed Clouds.

جلسه دهم شبکه های کامپیوتری به نــــــــــــام خدا.

Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

G-REMiT: An Algorithm for Building Energy Efficient Multicast Trees in Wireless Ad Hoc Networks Bin Wang and Sandeep K. S. Gupta NCA’03 speaker ： Chi-Chih.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

1Computer Sciences Department. Book: Introduction to Algorithms, by: Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Electronic:

CSC 211 Data Structures Lecture 13

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.

Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

David Stotts Computer Science Department UNC Chapel Hill.

MPI implementation – collective communication MPI_Bcast implementation.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.

Super computers Parallel Processing

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

A configuration method for structured P2P overlay network considering delay variations Tomoya KITANI (Shizuoka Univ. 、 Japan) Yoshitaka NAKAMURA (NAIST,

A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.

Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.

Efficient Pairwise Key Establishment Scheme Based on Random Pre-Distribution Keys in Wireless Sensor Networks Source: Lecture Notes in Computer Science,

A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) 1 CCGrid Lyon, France.

On Mobile Sink Node for Target Tracking in Wireless Sensor Networks Thanh Hai Trinh and Hee Yong Youn Pervasive Computing and Communications Workshops(PerComW'07)

Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.

Collective Communication Implementations

Advanced Algorithms Analysis and Design

Trees Chapter 15.

Collective Communication Implementations

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

COSC160: Data Structures Binary Trees

Collective Communication Implementations

Communication operations

Resource Allocation in a Middleware for Streaming Data

Optimizing MPI collectives for SMP clusters

Presentation transcript:

Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306

Broadcast communication(MPI_Bcast) n0n0 n1n1 n2n2 n3n3 n0n0 n1n1 n2n2 n3n3 Before After ABCD ABCDABCDABCDABCD Let T(msize) = time to send a message of size msize Broadcast(msize) >= T(msize)

Ethernet Switched Cluster switch

Problem statement: How to efficiently realize the broadcast operation with large message sizes on Ethernet switched clusters. Using pipelined broadcast can achieve near optimal results (T(msize) time for broadcasting a message of size msize). Finding contention free broadcast tree Finding a good segment size

Traditional Broadcast algorithms Linear tree Flat tree 0 Time = (P-1) x T(msize)

Binary tree k-ary tree Time = 2x(log 2 (P+1)-1)xT(msize)

Binomial tree Time = log 2 P x T(msize)

Scatter/Allgather n0n0 n1n1 n2n2 n3n3 Before ABCD ABCD Scatter Allgather ABCDABCDABCDABCD Time = 2 x T(msize)

Time Complexity for large messages Linear tree(P-1) x T(msize) Flat tree(P-1) x T(msize) Binary tree2x(log 2 (P+1)-1)xT(msize) Approx. 2xlog 2 P x T(msize) Binomial treelog 2 P x T(msize) Scatter/allgather2xT(msize)

Pipelined Broadcast Algorithm Linear pipeline 0123

Performance of pipelined broadcast: Assume no network contention a message of size msize be broken into X messages of msize/X. H: tree hight, D: the number of children Size of pipelined stage: D * T(msize/X) Total time T: (X + H –1) * (D * T(msize /X)) linear tree: H = P, D = 1, T = T(msize) Binary tree: H = log(P), D= 2, T = 2T(msize) K-ary tree: H = log_k(P), D = k, in general not as efficient as binary tree.

Time Complexity for large messages Pipelined (linear)T(msize) Pipelined (binary)2 x T(msize) k-ary pipelinek x T(msize) Binomial treelog 2 P x T(msize) Scatter/allgather2xT(msize)

Pipelined broadcast How to find a contention-free broadcast tree? How to select the best segment size?

Example of network contention Binary tree switch n 0,n 1,n 2,n 3 n 4,n 5,n 6,n 7 There is a link contention cause by communication (1  4), (2  5), (2  6), and (3  7)

Linear tree switch n 0,n 1,n 4,n 5 n 2,n 3,n 6,n 7 The linear tree 0  1  2  3  …  7 will have a contention caused by (1  2) and (5  6)

Algorithm for constructing contention free linear tree Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS tree Step 2: The linear tree consists of all machines in switch S 0, follows by all machines in S 1, then S 2,and so on

Example of contention free linear tree Switch S0 Switch S1 n 0,n 1,n 4,n 5 n 2,n 3,n 6,n 7 Switch S3 Switch S2 n 12,n 13,n 14,n 15 n 8,n 9,n 10,n 11 Linear tree: n0  n1  n4  n5  2  3  6  7  8  9  …  15

Algorithm for constructing contention free binary tree Start with a contention free linear tree Recursively divide the tree into 2 sub-trees Make sure that the cannot be a contention The sub-trees are chosen such that the height of the whole tree will be minimal

Binary tree height Performance of binary pipeline broadcast depends on the height of a binary tree Even though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree

Average tree heights for 20 randomly generated topologies

Evaluation Contention free pipelined algorithms: Routine generators from topology information The generated routines are based on MPICH p2p primitives. Linear tree Binary tree 3-nary tree Targets for comparison: MPICH: Binomial tree, Scatter/allgather LAM: Flat-tree, Binomial Topology unaware pipelined linear and binary algorithms

Evaluation

Performance of different pipelined trees (topology 1)

Comparing pipelined broadcast with other schemes

Topology unaware and contention-free pipelined broadcast

Segment size for pipelined broadcast

Conclusions Pipelined broadcast is faster than the current broadcast algorithm for medium and large messages Linear pipeline has a completion time roughly equal to T(msize) binary pipeline broadcast is best for medium messages Contention free broadcast tree is necessary for pipelined algorithms A good segment size for pipelined broadcast is not difficult to find.

Questions?