1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

Parallel Processing with OpenMP
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Introductions to Parallel Programming Using OpenMP
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Types of Parallel Computers
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Hossein Bastan Isfahan University of Technology 1/23.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
Unified Parallel C at LBNL/UCB UPC: A Portable High Performance Dialect of C Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove,
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Introduction Why are virtual machines interesting?
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Background Computer System Architectures Computer System Software.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Computer System Structures
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Overview of Berkeley UPC
HPC User Forum: Back-End Compiler Technology Panel
Presentation transcript:

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07

2 Wei ChenUPC talk Parallel Programming Most parallel programs are written using either: Message passing with a SPMD model Usually for scientific applications with C++/Fortran Scales easily: user controlled data layout Hard to use: send/receive matching, message packing/unpacking Shared memory with OpenMP/pthreads/Java Usually for non-scientific applications Easier to program: direct reads and writes to shared data Hard to scale: (mostly) limited to SMPs, no concept of locality PGAS: an alternative hybrid model

3 Wei ChenUPC talk Partitioned Global Address Space PGAS model uses global address space abstraction Shared memory is partitioned by processors User controlled data layout (global pointers and distributed arrays) One-sided communication: Use RDMA support for reads/writes of shared variables Much faster than message passing for small/medium size messages Hybrid model works for both SMPs and clusters Languages: Titanium, Co-Array Fortran, UPC Shared Global address space X[0] Private ptr: X[1]X[P]

4 Wei ChenUPC talk Unified Parallel C A SPMD parallel extension of C PGAS: add shared qualifier to type system Several kinds of shared array distributions Fine-grained and bulk communication Commercial compilers with Cray/HP/IBM Open source compilers with Berkeley UPC Vector Addition in UPC #define N 100*THREADS shared int v1[N], v2[N], sum[N]; //cyclic layout void main() { for(int i=0; i<N; i++) if (MYTHREAD == i%THREADS) //SPMD sum[i]=v1[i]+v2[i]; }

5 Wei ChenUPC talk Overview of the Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Lower UPC code into ISO C code Shared Memory Management and pointer operations Uniform get/put interface for underlying networks

6 Wei ChenUPC talk UPC to C Translator Preprocessed UPC Source WHIRL with shared types WHIRL with runtime calls ISO C code Parsing Optimized WHIRL Lowering WHIRL2C Lowering Backend C compiler Optimizer Based on Open64 Extend with shared type Reuse analysis framework Add UPC specific optimizations Portable translation High level IR Config file for platform dependent information Reinclude library headers Convert shared memory operations into runtime calls

7 Wei ChenUPC talk Optimization framework Combination of language/compiler/runtime support Transparent to the user Performance portable Short term goal: effective on different cluster networks. Long term goal: code designed for SMP get good performance on clusters Optimize regular array accesses Optimize irregular pointer accesses Nonblocking bulk communication Loop framework for message vectorization, strip mining PRE framework with split-phase access and coalescing Runtime framework for communication overlap A[i][j][k] p->x->yupc_memget(dst, src, size)

8 Wei ChenUPC talk Application Performance – LU Decomposition UPC performance comparable to MPI/HPL(Linpack) with < ½ the code size Uses light-weight multi-threading atop SPMD  latency tolerant Highly adaptable to different problem and machine sizes

9 Wei ChenUPC talk Application Performance – 3D FFT One-sided UPC approach sends more, smaller messages Same total volume of data, but send earlier and more often Aggressively overlaps the transpose with the 2nd 1-D FFT Same approach is less effective in MPI due to higher per-message cost Consistently outperforms MPI-based implementations – by as much as 2X MFLOPS / Proc up is good

10 Wei ChenUPC talk Current Status Public release v2.4 in November 2006 Fully compliant with UPC 1.2 specification Communication optimizations Extensions for performance and programmability Support from laptops to supercomputers OS: UNIX (Linux, BSD, AIX, Solaris, etc), Mac, Cygwin Arch: x86, Itanium, Opteron, Alpha, PPC, SPARC, Cray X1, NEC SX-6, Blue Gene, etc. Network: SMP, Myrinet, Quadrics, Infiniband, IBM LAPI, MPI, Ethernet, SHMEM, etc. Give us a try at

11 Wei ChenUPC talk Summary UPC designed to be consistent with C Expose memory layout Flexible communication with pointers and arrays Give users more control to achieve high performance Berkeley UPC compiler provides an open-source and portable implementation Hand optimized UPC programs match and often beat MPI’s performance Research goal: productive user + efficient compiler