C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Slides:



Advertisements
Similar presentations
1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.
Advertisements

So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
Multiple Processor Systems
NGS computation services: API's,
1 Implementing PGAS on InfiniBandPaul H. Hargrove Experiences Implementing Partitioned Global Address Space (PGAS) Languages on InfiniBand Paul H. Hargrove.
L.N. Bhuyan Adapted from Patterson’s slides
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
MPI Message Passing Interface
Processes Management.
Executional Architecture
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Reference: Message Passing Fundamentals.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Programmability Hiroshi Nakashima Thomas Sterling.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Basic Communication Operations Carl Tropper Department of Computer Science.
Parallel Computing Presented by Justin Reschke
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Background Computer System Architectures Computer System Software.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Distributed Shared Memory
Parallel Programming By J. H. Wang May 2, 2017.
Programming Models for SimMillennium
Parallel Algorithm Design
CMSC 611: Advanced Computer Architecture
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap Christian Bell 1,2, Dan Bonachea 1, Rajesh Nishtala 1, and Katherine Yelick 1,2 1 UC Berkeley, Computer Science Division 2 Lawrence Berkeley National Laboratory

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 2Berkeley UPC: Conventional Wisdom Send few, large messages –Allows the network to deliver the most effective bandwidth Isolate computation and communication phases –Uses bulk-synchronous programming –Allows for packing to maximize message size Message passing is preferred paradigm for clusters Global Address Space (GAS) Languages are primarily useful for latency sensitive applications GAS Languages mainly help productivity –However, not well known for their performance advantages

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 3Berkeley UPC: Our Contributions Increasingly, cost of HPC machines is in the network One-sided communication model is a better match to modern networks –GAS Languages simplify programming for this model How to use these communication advantages –Case study with NAS Fourier Transform (FT) –Algorithms designed to relieve communication bottlenecks Overlap communication and computation Send messages early and often to maximize overlap

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 4Berkeley UPC: UPC Programming Model Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout g: Proc 0 Proc 1Proc n-1 3 of the current languages: UPC, CAF, and Titanium –Emphasis in this talk on UPC (based on C) –However programming paradigms presented in this work are not limited to UPC l: Global arrays: Allows any processor to directly access data on any other processor shared private

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 5Berkeley UPC: Advantages of GAS Languages Productivity –GAS supports construction of complex shared data structures –High level constructs simplify parallel programming –Related work has already focused on these advantages Performance (the main focus of this talk) –GAS Languages can be faster than two-sided MPI –One-sided communication paradigm for GAS languages more natural fit to modern cluster networks –Enables novel algorithms to leverage the power of these networks –GASNet, the communication system in the Berkeley UPC Project, is designed to take advantage of this communication paradigm

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 6Berkeley UPC: One-Sided vs Two-Sided A one-sided put/get can be entirely handled by network interface with RDMA support –CPU can dedicate more time to computation rather than handling communication A two-sided message can employ RDMA for only part of the communication –Each message requires the target to provide the destination address –Offloaded to network interface in networks like Quadrics RDMA makes it apparent that MPI has added costs associated with ordering to make it usable as a end-user programming model dest. addr. message id data payload one-sided put (e.g., GASNet) two-sided message (e.g., MPI) network interface memory host CPU

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 7Berkeley UPC: Latency Advantages Comparison: –One-sided: Initiator can always transmit remote address Close semantic match to high bandwidth, zero-copy RDMA –Two-sided: Receiver must provide destination address Latency measurement correlates to software overhead –Much of the small-message latency is due to time spent in software/firmware processing down is good One-sided implementation consistently outperforms 2-sided counterpart

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 8Berkeley UPC: Bandwidth Advantages One-sided semantics better match to RDMA supported networks –Relaxing point-to-point ordering constraint can allow for higher performance on some networks –GASNet saturates to hardware peak at lower message sizes –Synchronization decoupled from data transfer MPI semantics designed for end user –Comparison against good MPI implementation –Semantic requirements hinder MPI performance –Synchronization and data transferred coupled together in message passing Over a factor of 2 improvement for 1kB messages up is good

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 9Berkeley UPC: Bandwidth Advantages (cont) GASNet and MPI saturate to roughly the same bandwidth for “large” messages GASNet consistently outperforms MPI for “mid- range” message sizes up is good

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 10Berkeley UPC: A Case Study: NAS FT How to use the potential that the microbenchmarks reveal? Perform a large 3 dimensional Fourier Transform –Used in many areas of computational sciences Molecular dynamics, computational fluid dynamics, image processing, signal processing, nanoscience, astrophysics, etc. Representative of a class of communication intensive algorithms –Sorting algorithms rely on a similar intensive communication pattern –Requires every processor to communicate with every other processor –Limited by bandwidth

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 11Berkeley UPC: Performing a 3D FFT NX x NY x NZ elements spread across P processors Will Use 1-Dimensional Layout in Z dimension –Each processor gets NZ / P “planes” of NX x NY elements per plane 1D Partition NX NY Example: P = 4 NZ p0 p1 p2 p3 NZ/P

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 12Berkeley UPC: Performing a 3D FFT (part 2) Perform an FFT in all three dimensions With 1D layout, 2 out of the 3 dimensions are local while the last Z dimension is distributed Step 1: FFTs on the columns (all elements local) Step 2: FFTs on the rows (all elements local) Step 3: FFTs in the Z-dimension (requires communication)

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 13Berkeley UPC: Performing the 3D FFT (part 3) Can perform Steps 1 and 2 since all the data is available without communication Perform a Global Transpose of the cube –Allows step 3 to continue Transpose

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 14Berkeley UPC: The Transpose Each processor has to scatter input domain to other processors –Every processor divides its portion of the domain into P pieces –Send each of the P pieces to a different processor Three different ways to break it up the messages 1.Packed Slabs (i.e. single packed “Alltoall” in MPI parlance) 2.Slabs 3.Pencils An order of magnitude increase in the number of messages An order of magnitude decrease in the size of each message “Slabs” and “Pencils” allow overlapping communication and computation and leverage RDMA support in modern networks

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 15Berkeley UPC: Algorithm 1: Packed Slabs Example with P=4, NX=NY=NZ=16 1.Perform all row and column FFTs 2.Perform local transpose –data destined to a remote processor are grouped together 3.Perform P puts of the data Local transpose put to proc 0 put to proc 1 put to proc 2 put to proc 3 For grid across 64 processors – Send 64 messages of 512kB each

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 16Berkeley UPC: Bandwidth Utilization NAS FT (Class D) with 256 processors on Opteron/InfiniBand –Each processor sends 256 messages of 512kBytes –Global Transpose (i.e. all to all exchange) only achieves 67% of peak point-to-point bidirectional bandwidth –Many factors could cause this slowdown Network contention Number of processors that each processor communicates with Can we do better?

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 17Berkeley UPC: Algorithm 2: Slabs Waiting to send all data in one phase bunches up communication events Algorithm Sketch –for each of the NZ/P planes Perform all column FFTs for each of the P “slabs” (a slab is NX/P rows) –Perform FFTs on the rows in the slab –Initiate 1-sided put of the slab –Wait for all puts to finish –Barrier Non-blocking RDMA puts allow data movement to be overlapped with computation. Puts are spaced apart by the amount of time to perform FFTs on NX/P rows Start computation for next plane plane 0 put to proc 0 put to proc 1 put to proc 2 put to proc 3 For grid across 64 processors –Send 512 messages of 64kB each

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 18Berkeley UPC: Algorithm 3: Pencils Further reduce the granularity of communication –Send a row (pencil) as soon as it is ready Algorithm Sketch –For each of the NZ/P planes Perform all 16 column FFTs For r=0; r<NX/P; r++ –For each slab s in the plane »Perform FFT on row r of slab s »Initiate 1-sided put of row r –Wait for all puts to finish –Barrier Large increase in message count Communication events finely diffused through computation –Maximum amount of overlap –Communication starts early plane Start computation for next plane For grid across 64 processors –Send 4096 messages of 8kB each

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 19Berkeley UPC: Communication Requirements across 64 processors –Alg 1: Packed Slabs Send 64 messages of 512kB –Alg 2: Slabs Send 512 messages of 64kB –Alg 3: Pencils Send 4096 messages of 8kB With Slabs GASNet is slightly faster than MPI GASNet achieves close to peak bandwidth with Pencils but MPI is about 50% less efficient at 8k With the message sizes in Packed Slabs both comm systems reach the same peak bandwidth

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 20Berkeley UPC: Platforms NameProcessorNetworkSoftware Opteron/Infiniband NERSC Dual 2.2 GHz Opteron (320 4GB/node) Mellanox Cougar InfiniBand 4x HCA Linux 2.6.5, Mellanox VAPI, MVAPICH 0.9.5, Pathscale CC/F Alpha/Elan3 PSC Quad 1 GHz Alpha (750 4GB/node) Quadrics QsNet1 Elan3 /w dual rail (one rail used) Tru64 v5.1, Elan3 libelan , Compaq C V , HP Fortra Compiler X5.5A E1K Itanium2/Elan4 LLNL Quad 1.4 Ghz Itanium2 (1024 8GB/node) Quadrics QsNet2 Elan4Linux chaos, Elan4 libelan , Intel ifort , icc P4/Myrinet UC Berkeley Millennium Cluster Dual 3.0 Ghz Pentium 4 Xeon (64 3GB/node) Myricom Myrinet 2000 M3S-PCI64B Linux , GM , Intel ifort Z, icc Z

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 21Berkeley UPC: Comparison of Algorithms Compare 3 algorithms against original NAS FT –All versions including Fortran use FFTW for local 1D FFTs –Largest class that fit in the memory (usually class D) All UPC flavors outperform original Fortran/MPI implantation by at least 20% –One-sided semantics allow even exchange based implementations to improve over MPI implementations –Overlap algorithms spread the messages out, easing the bottlenecks –~1.9x speedup in the best case up is good

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 22Berkeley UPC: Time Spent in Communication Implemented the 3 algorithms in MPI using Irecvs and Isends Compare time spent initiating or waiting for communication to finish –UPC consistently spends less time in communication than its MPI counterpart –MPI unable to handle pencils algorithm in some cases MPI Crash (Pencils) down is good

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 23Berkeley UPC: Performance Summary MFLOPS / Proc up is good

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 24Berkeley UPC: Conclusions One-sided semantics used in GAS languages, such as UPC, provide a more natural fit to modern networks –Benchmarks demonstrate these advantages Use these advantages to alleviate communication bottlenecks in bandwidth limited applications –Paradoxically it helps to send more, smaller messages Both two-sided and one-sided implementations can see advantages of overlap –One-sided implementations consistently outperform two-sided counterparts because comm model more natural fit Send early and often to avoid communication bottlenecks

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 25Berkeley UPC: Try It! Berkeley UPC is open source –Download it from –Install it with CDs that we have here

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 26Berkeley UPC: Contact Us Authors –Christian Bell –Dan Bonachea –Rajesh Nishtala –Katherine A. Yelick – us: Special thanks to the fellow members of the Berkeley UPC Group Wei Chen Jason Duell Paul Hargrove Parry Husbands Costin Iancu Mike Welcome Associated Paper: IPDPS ‘06 Proceedings Berkeley UPC Website: GASNet Website: