Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,
Synchronization and Communication in the T3E Multiprocessor.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Programming in UPC Burt Gordon HPN Group, HCS lab Taken in part from a Presentation by Tarek El-Ghazawi at SC2003.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Background Computer System Architectures Computer System Software.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Overview of Berkeley UPC
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Unified Parallel C at NERSC
Immersed Boundary Method Simulation in Titanium Objectives
Presentation transcript:

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell

Outline Global Address Space Languages in General –Programming models Overview of Unified Parallel C (UPC) –Programmability advantage –Performance opportunity Status –Next step Related projects

Programming Model 1: Shared Memory Program is a collection of threads of control. –Many languages allow threads to be created dynamically, Each thread has a set of private variables, e.g. local variables on the stack. Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap. –Threads communicate implicitly by writing/reading shared variables. –Threads coordinate using synchronization operations on shared variables Pn P1 P0... x =... y =..x... Shared Private

Programming Model 2: Message Passing Program consists of a collection of named processes. –Usually fixed at program startup time –Thread of control plus local address space -- NO shared data. –Logically shared data is partitioned over local processes. Processes communicate by explicit send/receive pairs –Coordination is implicit in every communication event. –MPI is the most common example Pn P1 P0... send P0,X recv Pn,Y X Y

Tradeoffs Between the Models Shared memory +Programming is easier Can build large shared data structures –Machines don’t scale SMPs typically < 16 processors (Sun, DEC, Intel, IBM) Distributed shared memory < 128 (SGI) –Performance is hard to predict and control Message passing +Machines easier to build from commodity parts –Can scale (given sufficient network) –Programming is harder Distributed data structures only in the programmers mind Tedious packing/unpacking of irregular data structures

Global Address Space Programming Intermediate point between message passing and shared memory Program consists of a collection of processes. –Fixed at program startup time, like MPI Local and shared data, as in shared memory model –But, shared data is partitioned over local processes –Remote data stays remote on distributed memory machines –Processes communicate by reads/writes to shared variables Examples are UPC, Titanium, CAF, Split-C Note: These are not data-parallel languages –heroic compilers not required

GAS Languages on Clusters of SMPs Cluster of SMPs (CLUMPs)hb –IBM SP: 16-way SMP nodes –Berkeley Millennium: 2-way and 4-way nodes What is an appropriate programming model? –Use message passing throughout Most common model Unnecessary packing/unpacking overhead –Hybrid models Write 2 parallel programs (MPI + OpenMP or Threads) –Global address space Only adds test (on/off node) before local read/write

Support for GAS Languages Unified Parallel C (UPC) –Funded by the NSA –Compaq compiler for Alpha/Quadrics –HP, Sun and Cray compilers under development –Gcc-based compiler for SGI (Intrepid) –Gcc-based compiler (SRC) for Cray T3E –MTU and Compaq effort for MPI-based compiler –LBNL compiler based on Open64 Co-Array Fortran (CAF) –Cray compiler –Rice and UMN effort based on Open64 SPMD Java (Titanium) –UCB compiler available for most machines

Parallelism Model in UPC UPC uses an SPMD model of parallelism –A set if THREADS threads working independently Two compilation models –THREADS may be fixed at compile time or –Dynamically set at program startup time MYTHREAD specifies thread index (0..THREADS-1) Basic synchronization mechanisms –Barriers (normal and split-phase), locks What UPC does not do automatically: –Determine data layout –Load balance – move computations –Caching – move data These are intentionally left to the programmer

UPC Pointers Pointers may point to shared or private variables Same syntax for use, just add qualifier shared int *sp; int *lp; sp is a pointer to an integer residing in the shared memory space. sp is called a shared pointer (somewhat sloppy). Shared Global address space x: 3 Private sp: lp:

Shared Arrays in UPC Shared array elements are spread across the threads shared int x[THREADS] /*One element per thread */ shared int y[3][THREADS] /* 3 elements per thread */ shared int z[3*THREADS] /* 3 elements per thread, cyclic */ In the pictures below –Assume THREADS = 4 –Elements with affinity to processor 0 are red x y blocked z cyclic Of course, this is really a 2D array

Overlapping Communication in UPC Programs with fine-grained communication require overlap for performance UPC compiler does this automatically for “relaxed” accesses. –Accesses may be designated as strict, relaxed, or unqualified (the default). –There are several ways of designating the ordering type. A type qualifier, strict or relaxed can be used to affect all variables of that type. Labels strict or relaxed can be used to control the accesses within a statement. strict : { x = y ; z = y+1; } A strict or relaxed cast can be used to override the current label or type qualifier.

Performance of UPC Reason why UPC may be slower than MPI –Shared array indexing is expensive –Small messages encouraged by model Reasons why UPC may be faster than MPI –MPI encourages synchrony –Buffering required for many MPI calls Remote read/write of a single word may require very little overhead Cray t3e, Quadrics interconnect (next version) Assuming overlapped communication, the real issues is overhead: how much time does it take to issue a remote read/write?

UPC vs. MPI: Sparse MatVec Multiply Short term goal: –Evaluate language and compilers using small applications Longer term, identify large application Show advantage of t3e network model and UPC Performance on Compaq machine worse: -Serial code -Communication performance -New compiler just released

UPC versus MPI for Edge detection a. Execution time b. Scalability Performance from Cray T3E Benchmark developed by El Ghazawi’s group at GWU

UPC versus MPI for Matrix Multiplication a. Execution time b. Scalability Performance from Cray T3E Benchmark developed by El Ghazawi’s group at GWU

Implementing UPC UPC extensions to C are small –< 1 person-year to implement in existing compiler Simplest approach –Reads and writes of shared pointers become small message puts/gets –UPC has “relaxed” keyword for nonblocking communication –Small message performance is key Advanced optimizations include conversions to bulk communication by either –Application programmer –Compiler

Overview of NERSC Compiler 1)Compiler –Portable compiler infrastructure (UPC->C) –Explore optimizations: communication, shared pointers –Based on Open64: plan to release sources 2)Runtime systems for multiple compilers –Allow use by other languages (Titanium and CAF) –And in other UPC compilers, e.g., Intrepid –Performance of small message put/get are key –Designed to be easily ported, then tuned –Also designed for low overhead (macros, inline functions)

Compiler and Runtime Status Basic parsing and type-checking complete Generates code for small serial kernels –Still testing and debugging –Needs runtime for complete testing UPC runtime layer –Initial implementation should be done this month –Based on processes (not threads) on GASNet GASNet –Initial specification complete –Reference implementation done on MPI –Working on Quadrics and IBM (LAPI…)

Benchmarks for GAS Languages EEL – End to end latency or time spent sending a short message between two processes. BW – Large message network bandwidth Parameters of the LogP Model –L – “Latency”or time spent on the network During this time, processor can be doing other work –O – “Overhead” or processor busy time on the sending or receiving side. During this time, processor cannot be doing other work We distinguish between “send” and “recv” overhead –G – “gap” the rate at which messages can be pushed onto the network. –P – the number of processors

LogP Parameters: Overhead & Latency Non-overlapping overhead Send and recv overhead can overlap P0 P1 o send L o recv P0 P1 o send o recv EEL = o send + L + o recv EEL = f(o send, L, o recv )

Benchmarks Designed to measure the network parameters –Also provide: gap as function of queue depth –Measured for “best case” in general Implemented once in MPI –For portability and comparison to target specific layer Implemented again in target specific communication layer: –LAPI –ELAN –GM –SHMEM –VIPL

Results: EEL and Overhead

Results: Gap and Overhead

Send Overhead Over Time Overhead has not improved significantly; T3D was best –Lack of integration; lack of attention in software

Summary Global address space languages offer alternative to MPI for large machines –Easier to use: shared data structures –Recover users left behind on shared memory? –Performance tuning still possible Implementation –Small compiler effort given lightweight communication –Portable communication layer: GASNet –Difficulty with small message performance on IBM SP platform