 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

Slides:

Advertisements

Similar presentations

February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.

Advertisements

1 Routing Protocols I. 2 Routing Recall: There are two parts to routing IP packets: 1. How to pass a packet from an input interface to the output interface.

Basic Communication Operations

Introduction to Computer Networks Spanning Tree 1.

1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

1 Collective Operations Dr. Stephen Tse Lesson 12.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Algorithms in sensor networks By: Raghavendra kyatham.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Computer Science 1 ShapeShifter: Scalable, Adaptive End-System Multicast John Byers, Jeffrey Considine, Nicholas Eskelinen, Stanislav Rost, Dmitriy Zavin.

Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)

Color Aware Switch algorithm implementation The Computer Communication Lab (236340) Spring 2008.

Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]

Mesh Networks A.k.a “ad-hoc”. Definition A local area network that employs either a full mesh topology or partial mesh topology Full mesh topology- each.

Dissemination protocols for large sensor networks Fan Ye, Haiyun Luo, Songwu Lu and Lixia Zhang Department of Computer Science UCLA Chien Kang Wu.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

Communication operations Efficient Parallel Algorithms COMP308.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Server Assisted TRILL Edge Linda Dunbar

The hybird approach to programming clusters of multi-core architetures.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

Interconnect Network Topologies

Interconnect Networks

Collective Communication

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Network Aware Resource Allocation in Distributed Clouds.

Routing & Architecture

Andreas Larsson, Philippas Tsigas SIROCCO Self-stabilizing (k,r)-Clustering in Clock Rate-limited Systems.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.

1 Finding Constant From Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds Yifan Gong, Bingsheng He, Dan Li Nanyang Technological.

Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Communication Paradigm for Sensor Networks Sensor Networks Sensor Networks Directed Diffusion Directed Diffusion SPIN SPIN Ishan Banerjee

Comparing Topology based Collective Communication Algorithms Vishal Sharda Ashima Gupta.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Multiuser Receiver Aware Multicast in CDMA-based Multihop Wireless Ad-hoc Networks Parmesh Ramanathan Department of ECE University of Wisconsin-Madison.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

M. Veeraraghavan (originals by J. Liebeherr) 1 Need for Routing in Ethernet switched networks What do bridges do if some LANs are reachable only in multiple.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.

Peer to Peer Network Design Discovery and Routing algorithms

MPI implementation – collective communication MPI_Bcast implementation.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.

Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.

Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.

On Mobile Sink Node for Target Tracking in Wireless Sensor Networks Thanh Hai Trinh and Hee Yong Youn Pervasive Computing and Communications Workshops(PerComW'07)

Collective Communication Implementations

Impact of Interference on Multi-hop Wireless Network Performance

High Performance and Reliable Multicast over Myrinet/GM-2

Collective Communication Implementations

Computing and Compressive Sensing in Wireless Sensor Networks

Wireless Sensor Network Architectures

Energy-Efficient Communication Protocol for Wireless Microsensor Networks by Wendi Rabiner Heinzelman, Anantha Chandrakasan, and Hari Balakrishnan Presented.

Mayank Bhatt, Jayasi Mehar

Collective Communication Implementations

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Optimizing MPI collectives for SMP clusters

Presentation transcript:

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012

Motivation  Collectives are an important component of parallel programs  Impacts performance and scalability  Performance of large message collectives constrained by network bandwidth  Topology aware implementations are required to extract best performance  Clos, fat-tree and torus are low-radix and have large diameters  Mutiplicity of hops make them congestion prone  Carefully designed collective algorithms 2

Two-tier Direct Networks 3  New network topology  IBM PERCS  Dragonfly - Aries  High radix network with mutiple levels of connections  At level 1, multi-core chips (nodes) are clustered to form supernodes/racks  At level 2, connections are provided between the supernodes

Two-tier Direct Networks 4 Second tier connections (of one supernode) First tier connections within a SN (of one node) Connections to other supernodes Supernode Node LR Link LL Link D Link

Topology Oblivious Algorithms Scatter/GatherBinomial Tree AllgatherRing, Recursive Doubling BroadcastDeGeijn’s Scatter with Allgather Reduce-scatterPairwise Exchange ReduceRabenseifner’s Reduce-Scatter with Gather 5

Topology Aware Algorithms  Blue Gene  Multi-color non-overlapping spanning trees  Generalized n-dimensional bucket algorithm  Tofu  The Trinaryx3 Allreduce  Clos  Distance-halving allgather algorithm 6

Assumptions/Conventions  Task – (core ID, node ID, supernode ID)  core ID and node ID are local  All-to-all connection between nodes of a supernode  nps – nodes per supernode, cpn – cores per node  Connection from a supernode to other supernodes originate from nodes in a round robin manner  Link from supernode S1 to supernode S2 originates at node (S2 modulo nps) in supernode S1 7

Assumptions/Conventions  Focus on large messages – startup cost ignored 8 N0N0 N1N1 SN 2 SN 4 SN 6 SN 1 SN 3 SN 5 SN 7 Unused SN 0 Connections among supernodes

Two-tier Algorithms  How to take advantage of the cliques and multiple levels of connections?  SDTA – Stepwise Dissemination, Transfer or Aggregation  Simultaneous exchange of data within level 1  Minimize amount of data transfered at level 2  (0, 0, 0) assumed root 9

Scatter using SDTA  (0, 0, 0)  (0, *, 0)  Data sent to (0, x, 0) is data that belongs to supernodes connected to (0, x, 0)  (0, *, 0) scatters the data to corresponding nodes (in other supernodes)  (0, x, *) distributes the data within their supernode  (0, *, *) provides data to other cores in their node 10

Step 3 – Disseminate in all supernodesStep 1 – Disseminate within source Scatter among 4 supernodes Step 2 – Transfer to other supernodes 11 N0N0 N1N1 N2N2 N3N3 N0N0 N1N1 N2N2 N3N3 N0N0 N1N1 N2N2 N3N3 N0N0 N1N1 N2N2 N3N3 SN 0 SN 2 SN 1 SN 3

Broadcast using SDTA  Can be done using an algorithm similar to scatter – not optimal  (0, 0, 0) divides data into nps chunks; sends chunk x to (0, x, 0)  (0, *, 0) sends data to exactly one connected node (in other supernode)  Every node that receives data acts like a broadcast source  Sends data to all other nodes in their supernodes  These nodes forward data to other supernodes  The recepient in other supernodes share it within their supernodes 12

Allgather using SDTA  All-to-all networks facilitates parallel base broadcast  Steps:  Every nodes shares data with all other nodes in its supernode  Every node shares the data (it has so far) with corresponding nodes in other supernodes  Nodes share the data within their supernodes  Majority of communication at level 1 - minimal communication at level 2 13

Computation Collectives  Owner core - core that has been assigned a part of the data that needs to be reduced  Given a clique of k cores with size m data, consider the following known approach  Each core is made owner of size m/k data  Every core sends the data corresponding to the owner cores (in their data) to the owner cores – all-to-all network  The owner cores reduce the data they own 14

Multi-phased Reduce  Perform reduction among cores of every node; collect the data at core 0  Perform reduction among nodes of every supernode – decide owners carefully  Perform reduction among supernodes; collect the data at the root 15

Reduce-scatter  First two steps same as Reduce  In reduction among supernodes, choose owners carefully – a supernode is owner of data that should be deposited on cores in it as part of Reduce-scatter  Nodes in supernodes that contain data scatter it to other nodes within their supernodes 16

Cost Comparison 17

Experiments  Rank-order mapping  pattern-generator generates a list of communication exchange beween MPI ranks  linkUsage generates the amount of traffic that will flow on each link in the given two-tier network  64 supernodes, nps = 16, cpn = 16  4032 L2 links and L1 links in the system 18

L1 Links Used 19 L1 Links

L1 Links Maximum Traffic 20 Traffic (MB)

L2 Links Used 21 L2 Links

L2 Links Maximum Traffic 22 Traffic (MB)

Conclusion and Future Work  Proposed topology aware algorithms for large message collectives on two-tier direct networks  Comparison based on cost model and analytical modeling promise good performance  Implement these algorithms on a real system  Explore short message collectives 23