1 Techniques for pipelined broadcast on ethernet switched clusters SELECTED TOPICS FOR DISTRIBUTED COMPUTING [SKR 5800] DEPARTMENT OF COMMUNICATION TECHNOLOGY.

Slides:



Advertisements
Similar presentations
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Advertisements

Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Chabot College Chapter 2 Review Questions Semester IIIELEC Semester III ELEC
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 Delay-efficient Data Gathering in Sensor Networks Bin Tang, Xianjin Zhu and Deng Pan.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
Communication operations Efficient Parallel Algorithms COMP308.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Switching, routing, and flow control in interconnection networks.
Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Interconnect Networks
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Network Aware Resource Allocation in Distributed Clouds.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Improving Network I/O Virtualization for Cloud Computing.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Multicast Algorithms for Multi- Channel Wireless Mesh Networks Guokai Zeng, Bo Wang, Yong Ding, Li Xiao, Matt Mutka Department of Computer Science and.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 Next Few Classes Networking basics Protection & Security.
DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.
 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
1 G-REMiT: An Algorithm for Building Energy Efficient Multicast Trees in Wireless Ad Hoc Networks Bin Wang and Sandeep K. S. Gupta Computer Science and.
Interconnection network network interface and a case study.
MPI implementation – collective communication MPI_Bcast implementation.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Energy-Efficient Wake-Up Scheduling for Data Collection and Aggregation Yanwei Wu, Member, IEEE, Xiang-Yang Li, Senior Member, IEEE, YunHao Liu, Senior.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.
RM-MAC: A Routing-Enhanced Multi-Channel MAC Protocol in Duty-Cycle Sensor Networks Ye Liu, Hao Liu, Qing Yang, and Shaoen Wu In Proceedings of the IEEE.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.
Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) 1 CCGrid Lyon, France.
Bin Wang, Arizona State Univ S-REMiT: A Distributed Algorithm for Source-based Energy Efficient Multicasting in Wireless Ad Hoc Networks Bin Wang and Sandeep.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Efficient Geographic Routing in Multihop Wireless Networks Seungjoon Lee*, Bobby Bhattacharjee*, and Suman Banerjee** *Department of Computer Science University.
Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema Vrije Universiteit.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Advanced Computer Networks
Switching, routing, and flow control in interconnection networks
Cluster Computers.
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

1 Techniques for pipelined broadcast on ethernet switched clusters SELECTED TOPICS FOR DISTRIBUTED COMPUTING [SKR 5800] DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORKING LECTURER: DR. NOR ASILA WATI BT ABD HAMID PREPARED BY : MUHAMAD RAFIQ BIN OSMAN METRIC NO.: GS18838 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY UNIVERSITY PUTRA OF MALAYSIA

2 Contents  Introduction  Literature review  Problem statements  Objectives  Methodology  Cluster designs  Broadcast trees  Contention-free linear tree  Contention-free binary tree  Heuristic algorithms  Model for computing appropriate segment sizes  Experiments  Results/ finding  Conclusion

3 Introduction  Broadcast = the root process sends message to all other processes in the system. Atomic broadcast algorithms Pipelined broadcast algorithms a) Apply when the case is only one broadcast operation and the broadcast message cannot be split. a) Multiple broadcast operation and the message can be split into message segments.

4 Literature review  Binomial tree based pipelined broadcast algorithm have been developed [11],[13], [23],[24], and [25].  K-binomial tree algorithm [13], has shown to have better performance than traditional binomial trees.  Does not propose new pipelined broadcast schemes, otherwise the paper develop practical techniques to facilitate the deployment of pipelined broadcast on clusters connected by multiple Ethernet switches.

5 Problem statements  The problem wants to be state here are: a) We have to determine the proper broadcast tree when to apply with pipelined broadcast. b) Two or more communication could be processed actively just only when they comes from different branches. c) Appropriate segment sizes must be selected because small segment size may excessive start- up overheads while large segment size may decrease pipeline efficiency.

6 Objectives  The paper has few objectives to be achieved:- a) broadcasting large messages using pipelined broadcast approach. b) develop adaptive MPI routines that use different algorithms according to the message sizes. c) allowing the algorithms and the complementary algorithms for broadcasting small messages to co- exist in one MPI routine.

7 Methodology  Example of path (n0 -> n3) = {(n0,s0),(s0,s1),(s1,s3),(s3,n3)}  Contention-free pattern is a pattern where no two communications in the pattern have contention. s0 s2 s3 s1 n0 n4n3 n2 n1 n5 switches machines

8 Cont..(1)Broadcast trees Linear tree3-ary treeBinary tree Flat treeBinomial tree

9 Cont..(2) Broadcast tree Height, HNodal degree, D Completion time Linear treeP 1 (X+P-1) *T(msize/X). Binary treelog2(P)2(X+log2(P)-1) *2*T(msize/X). K-ary treelogk(P)k(X+logk(P)-1) *k*T(msize/X). Binomial tree Atomic broadcastlog(P)*T(msize). Flat treeAtomic broadcast(P-1)*T(msize).

10 Contention-free linear trees  All communications in contention-free linear tree must be contention-free.  G=(S U M,E) as tree graph.  S = switches, M = machines, E = edges.  P = |M| and G’ = (S,E’) as subgraph of G.  Step 1:  Start from switch that nr is connected to, perform Depth First Search (DFS) on G’.  Numbering the switches based on the DFS arrival order.  Step 2:  Numbering ni,0,ni,1,…,ni,Xi-1. Xi=0 when no machine attaching to si.

11 Contention-free binary tree  Tree height affects the time to complete the operation, smallest tree height is an ideal for pipelined broadcast binary tree.  Example (i<j≤k<l and a≤b≤c≤d):  Path (mi  mj) has three components: (mi,sa), path(sa  sb) and (sb,mj).  Path (mk  ml) has three components: (mk,sc), path (sc  sd), and (sd,ml).  When a=b, communication mi  mj does not have contention with communication mk  ml since (mi,sa) and (sb,mj) are not in path (sc  sd) and vice versa.  Question: How about k-ary broadcast tree. Is there have any contention-free from up to k children?

12 Heuristic algorithms  Tree[i][j] stores tree(i,j) and best[i][j] stores the height of tree(i,j).  Tree(i,j), j>i+2 is formed by having mi as the root, tree(i+1,k-1) as the left child, and tree(k,j) as the right child.  Make sure that mi  mk does not have contention with communications in tree (i+1,k-1), which ensure that the binary tree is contention-free.  Choose k with the smallest max (best[i+1][k- 1],best[k][j])+1, which minimizes tree height.  Tree[0][P-1] stores the contention-free binary tree.

13 Model for computing appropriate segment sizes  The point-to-point communication performance is characterized by five parameters: 1) L = Latency 2) Os(m) = the times that the CPUs are busy sending message of size m. 3) Or(m) = the times that the CPUs are busy receiving message of size m. 4) g(m) = the minimum time interval between consecutive message (size m) transmission and receptions. 5) P = the number of processors in the system.  Os, or, and g are functions which allows the communication time of large messages to be modeled more accurately.

14 Pipelined broadcast Linear treeBinary tree Completion time calculated Does not need to consider the network contention issue. Depends on tree topology. Strategy of patterncombining the times of the point-to-point communications in the critical path. the broadcast tree first sends the segment to the left child and then to the right child Formula of completion time calculation (P-1)[L(msize/X) + g(msize/X)] + (X-1)g(msize/X) A*L(msize/X) + B*g(msize/X) + 2(X-1)g(msize/X). Propagated segments in pipelined fashion. (X-1)*g(m)(X-1)*2g(m)

15 Experiments  Evaluate the performance of pipelined broadcast with various broadcast trees on 100 Mbps (fast) Ethernet and 1000 Mbps (Giga-bit) Ethernet clusters with different physical topologies.  Topology (1) contains 16 machines connected by a single switch.  Topology (2),(3),(4) and (5) are 32-machine clusters with different network connectivity.  Topology (4) and (5) have same physical topology, but different node assignments.

16 Cont..(1)  Machine specifications: ProcessorDell Dimension 2400 with 2.8GHz P4 Memory640 MB Disk Space40 GB Operating SystemLinux (Fedora) with kernel Ethernet CardBroadcom BCM 5705 (10/100/1000 Mbps ethernet card) with driver from Broadcom. Ethernet Switches (100 Mbps) Dell PowerConnect 2224 (24-port 100 Mbps Ethernet switches). Ethernet switches (1000 Mbps) 1000 Mbps are Dell PowerConnect 2724 (24- port 1000 Mbps switches).

17 Cont..(2)  Extended parameterized LogP model characterizes the system with five parameters, L(m), os(m), or(m), g(m),P.  Select range of potential sizes from 256B to 32kB.  To obtain L(m), use pingpong program to measure the round trip time for the messages of size m(RTT(m)) and derive L(m) based on formula RTT(m)=L(m)+g(m)+L(m)+g(m).  The CPU is the bottleneck with 1000 Mbps Ethernet when the message size is more than 8kB. That’s why L(m) decreases when m increases from 8 to 32kB for the 1000 Mbps.

18 Cont..(3)  Sometimes the predicted optimal segment size differ from the measured sizes.  Factor: a) first, assuming that 1-port model where each node can send and receive at the link speed. The assumption holds for the clusters with 100 Mbps Ethernet, but processor cannot keep up with sending and receiving at 1000 Mbps at the same time. b) inaccuracy in the performance parameter measurements.

19 Results on 100 Mbps Ethernet switched clusters  The time for binary trees is about twice the time to send single message.  The segment size does not give impact to the pipelined broadcast.  Changing from segment size of 512 Bytes to 2048 Bytes does not significantly affect the performance, especially comparison with different algorithm.

20 Performance of different broadcast trees, 100 Mbps The linear tree offers the best performance when the message size is large (>=32kB). The binary tree offers the best performance when the medium sized message (8-16kB). the communication completion time for linear trees is very close to T(msize),

21 Performance of different algorithms, 100 Mbps (LAM+MPICH) – topology 4 Poor performance MPICH gradually has similar performance to the pipelined broadcast with binary trees. (scatter followed by all-gather algorithm) Pipelined broadcast with linear tree is about twice as fast as MPICH.

22 Performance of different algorithms, 100 Mbps (LAM+MPICH) – topology 5 Topology-unaware algorithms is sensitive to the physical topology and manifests the advantage of pipelined broadcast with contention-free trees. All algorithms in LAM and MPICH perform poorly.

23 Results on 1000 Mbps Ethernet switched clusters  The linear tree performs better than binary tree when the message is larger than 1MB.  Factor: a) the processor cannot keep up with sending and receiving data at 1000 Mbps at the same time. Binary tree pipelined broadcast algorithm is less computational intensive than the linear tree algorithm. b) larger software start-up overheads in 1000 Mbps Ethernet.

24 Performance with different broadcast trees, 1000 Mbps 3-ary tree is always worse than the binary tree which confirms that k>2 ary are not effective. Insufficient CPU speed significantly affect the linear tree algorithm.

25 Performance for different algorithms, 1000 Mbps (LAM+MPICH) -> topology 4 The recursive-doubling algorithm introduces severe network contention and yields extremely poor performance. Although MPICH perform well, but still 64% slower than contention-free broadcast tree.

26 Performance for different algorithms, 1000 Mbps (LAM+MPICH) -> topology 5 pipelined broadcast performs better than the algorithms in MPICH and LAM on 1000 Mbps clusters in all different situations. All algorithms used by LAM and MPICH incur severe network contention and perform much worse across all the message sizes. severe network contention

27 Properties of pipelined broadcast algorithms  Two conditions for pipelined broadcast to be effective:- a) the software overhead for splitting large message into segments should not be excessive. b) The pipeline term must dominate the delay term.  For 100 Mbps, when the segment size≥1024 Bytes, X*T(msize/X) is within 10% of T(msize).  For 1000 Mbps, when the segment size≥8kB, X*T(msize/X) is within 10% of T(msize).

28 Cont..(1)  When the message size is smaller than these thresholds, the communication start-up overheads increase more dramatically. However, optimal segment size may less than thresholds because compromise between software overhead and pipeline efficiency.  The linear tree pipelined algorithm is efficient for broadcasting on small number of processes while the binary tree algorithm may apply for large number of processes.

29 Cont..(2) 100 Mbps Cluster 1000 Mbps Cluster Linear tree algorithm (recommended size of message for broadcasting) 2*(32-1)*1kB = 62kB 2*(32-1)*8kB = 496kB Binary tree algorithm (recommended size of message for broadcasting) 2*(log32-1)*1kB = 8kB 2*(log32- 1)*8kB= 64kB

30 Conclusions  Modeled segment size<>measured segment size but performance model == performance measured.  Pipelined broadcast is more efficient than other commonly used broadcast algorithms on contemporary 100 Mbps and 1000 Mbps Ethernet switched clusters in many situations.  The techniques can be applied to other types of clusters.  The near-optimal broadcast performance can be achieved by irregular topology through finding and spanning tree plus apply the techniques.

31 References  [1] O. Beaumont, A. Legrand, L. Marchal, Y. Robert, Pipelined broadcasts on heterogeneous platforms, IEEE Transactions on Parallel and Distributed Systems 16 (4) (2005)  [2] O. Beaumont, L. Marchal, Y. Robert, Broadcast trees for heterogeneous platforms, in: The 9th IEEE Int’l Parallel and Distributed Processing Symposium, 2005, p. 80b.  [3] K.W. Cameron, X.-H. Sun, Quantifying locality effect in data access delay: Memory LogP, in: IEEE Int’l Parallel and Distributed Processing Symposium IPDPS, 2003 p. 48b.  [4] K.W. Cameron, R. Ge, X.-H. Sun, LognP and log3P: Accurate analytical models of point-to- point communication in distributed systems, IEEE Transactions on Computers 56 (3) (2007)  [5] D. Culler, et. al., LogP: Towards a realistic model of parallel computation, in: Proceedings of the fourth ACM SIGPLAN Symposium on Principle and Practice of Parallel Programmings, PPoPP, 1993, pp  [6] A. Faraj, X. Yuan, Automatic generation and tuning of MPI collective communication routines, in: The 19th ACM International Conference on Supercomputing, 2005, pp  [7] A. Faraj, X. Yuan, Pitch Patarasuk, A message scheduling scheme for all-to-all personalized communication on Ethernet switched clusters, IEEE Transactions on Parallel and Distributed Systems 18 (2) (2007)  [8] A. Faraj, P. Patarasuk, X. Yuan, A study of process arrival patterns for mpi collective operations, International Journal of Parallel Programming, (in press)  [9] A. Faraj, P. Patarasuk, X. Yuan, Bandwidth efficient all-to-all broadcast on switched clusters, International Journal of parallel Programming,(in press).  [10] J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G.E. Fagg, E. Gabriel, J. Dongarra, Performance analysis on MPI collective operations, in: The 19th IEEE International Parallel and Distributed Processing Symposium, 2005, pp8-8.  [11] S.L. Johnsson, C.T. Ho, Optimum broadcasting and personalized communication in hypercube, IEEE Transactions on Computers 38 (9) (1989)  [12] A. Karwande, X. Yuan, D.K. Lowenthal, An MPI prototype for compiled communication on Ethernet switched clusters, Journal of Parallel and Distributed Computing 65 (10) (2005)  [13] R. Kesavan, D.K. Panda, Optimal multicast with packetization and network interface support, in: Proceedings of International Conference onm Parallel Processing, 1997, pp

32 References (2)  [14] T. Kielmann, H.E. Bal, K. Verstoep, Fast measurement of Logp parameters for message passing platforms, in: Proceeding of 2000 IPDPS Workshop on Parallel and Distributed Processing, Cancun, Mexico, May 2000, pp  [15] LAM/MPI Parallel Computing,  [16] R.G. Lane, S.Daniels, X. Yuan, An empirical study of reliable multicast protocols over Ethernet-connected networks, Performance Evaluation Journal 64 (2007)  [17] P.K McKinley, H. Xu, A. Esfahanian, L.M. Ni, Unicast-based multicast communication in wormhole-routed networks, IEEE Trans. on Parallel and Distributed Systems 5 (12) (1994)  [18] The MPI Forum. The MPI-2: Extensions to the Message Passing Interface, July Available at:  [19] MPICH- A portable implementation of MPI.  [20] P. Sanders, J.F. Sibeyn, A bandwidth latency tradeoff for broadcast and reduction, Information Processing letters 86 (1) (2003)  [21] SCI-MPICH: MPI for SCI-connected Clusters. Available at: aachen.de/users/joachim/SCI-MPICH/pcast/html.  [22] Andrew Tanenbaum, Computer Networks, 4th Edition,  [23] J.-Y. Tien, C.-T. Ho, W.-P. Yang. Broadcasting on incomplete hypercubes, IEEE Transaction on Computers 42 (11) (1993)  [24] J.L. Traff, A. Ripke, Optimal broadcast for fully connected networks in: Processdings of High-Performance Computing and Communication (HPPC-05), 2005, pp  [25] J.L. Traff, A. Ripke, An optimal broadcast algorithm adapted to SMP-clusters, EURO PVM/MPI (2005)  [26] S.S. Vadhiyar, G.E Fagg, J. Dongarra, Automatically tuned collective communications, in: Proceedings of SC’00: High Performance Networking and Computing (CDROM Proceeding),  [27] J. Watts, R. Van De Gejin. A pipelined broadcast for multidimentional meshes, Parallel Processing Letters 5 (2) (1995)  [28] Xin Yuan, Rami Melhem, Rajiv Gupta, Algorithms for supporting compiled communication, IEEE Transaction of Parallel and Distributed System 14 (2) (2003)

33 THE ENDS  Thank you,  Question and Answer.