Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Lightweight Abstraction for Mathematical Computation in Java 1 Pavel Bourdykine and Stephen M. Watt Department of Computer Science Western University London.
Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
 2006 Michigan Technological University CS /15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.
Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Introduction To C++ Programming 1.0 Basic C++ Program Structure 2.0 Program Control 3.0 Array And Structures 4.0 Function 5.0 Pointer 6.0 Secure Programming.
Lecture 1: Overview of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++ Designed.
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
COP4020 Programming Languages
1 History of compiler development 1953 IBM develops the 701 EDPM (Electronic Data Processing Machine), the first general purpose computer, built as a “defense.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Threaded Programming Lecture 2: Introduction to OpenMP.
C LANGUAGE Characteristics of C · Small size
Introduction to OOP CPS235: Introduction.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Hello world !!! ASCII representation of hello.c.
Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.
1 Titanium Review: Language and Compiler Amir Kamil Titanium Language and Compiler Changes Amir Kamil U.C. Berkeley September.
Operating Systems A Biswas, Dept. of Information Technology.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Fundamental of Java Programming (630002) Unit – 1 Introduction to Java.
A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.
The heavyweight parts of lightweight languages LL1 Workshop November 17, 2001.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Visit for more Learning Resources
Compiler Construction (CS-636)
The heavyweight parts of lightweight languages
Indranil Roy High Performance Computing (HPC) group
Lecture 5: GPU Compute Architecture
Compiler Construction
Presentation transcript:

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance upcc Compilation all encapsulated in one “upcc” command

Unified Parallel C at LBNL/UCB A Layered Design UPC to C translator: Translates UPC code into C, inserting calls to the runtime library for parallel features UPC runtime: Allocate/initialize shared data, perform operations on pointer-to-shared GASNet: An uniform interface for low-level communication primitives Portable: -C is our intermediate language -GASNet itself has a layered design with a small core High-Performance: -Native C compiler optimizes serial code -Translator can perform communication optimizations -GASNet can access network directly

Unified Parallel C at LBNL/UCB Implementing the UPC to C Translator Source File UPC front end Whirl w/ shared types Backend lowering Whirl w/ runtime calls Whirl2c ANSI-compliant C Code Based on Open64 Supports both 32/64 bit platforms Designed to incorporate existing optimization framework in open64 (LNO, IPA, WOPT) Communicate with runtime via a standard API and configuration files Will present our implementation in Open64 workshop in March

Unified Parallel C at LBNL/UCB Components in the Translator Front end: -UPC extensions to C: shared qualifier, block size, forall loops, builtin functions and values (THREADS, memget, etc), strict/relaxed -Parses and type-checks UPC code, generates Whirl, with UPC-specific information available in symbol table Backend: -Transform shared read and writes into calls into runtime library. Calls can be blocking/non-blocking/bulk/register- based -Apply standard optimizations and analyses Whirl2c: -Convert Whirl back to C, with shared variables declared as opaque pointer-to-shared types -Special handling for static user data

Unified Parallel C at LBNL/UCB Pointer-to-Shared: Phases UPC has three different kinds of distributed arrays: -Block-cyclic: shared [4] double a [n]; -Cyclic: shared double a [n]; -Indefinite (local to allocating thread): shared [] double *a = (shared [] double *) upc_alloc(n); A pointer needs a “phase” to keep track of where it is in a block -Source of overhead for updating and dereferencing Special case for “phaseless” pointers -Cyclic pointers always have phase 0 -Indefinite blocked pointers only have one block -Don’t need to keep phase in pointer operations for cyclic and indefinite -Don’t need to update thread id for indefinite phaseless

Unified Parallel C at LBNL/UCB Thread 1Thread N -1 Address ThreadPhase 0 2addr Phase Shared Memory Thread 0 block size start of array object … … Accessing Shared Memory in UPC

Unified Parallel C at LBNL/UCB Pointer-to-Shared Representation Important to performance, since it affects all shared operations Shared pointer representation trade-offs -Use of scalar types rather than a struct may improve backend code quality -Faster pointer manipulation, e.g., ptr+int as well as dereferencing -These are important in C, because array reference are based on pointers - Smaller pointer size may help performance -Use of packed 8-byte format may allow pointers to reside in a single register -But very large machines may require a longer representation

Unified Parallel C at LBNL/UCB Let the Users Decide Compiler offers two pointer-to-shared configurations -Packed 8-byte format that gives better performance -Struct format for large-scale programs Portability and performance balance in UPC compiler -Representation is hidden in the runtime layer -Can easily switch at compiler installation time -Modular design means easy to add new representations (packed format done in one day) -May have a different representation for phaseless pointers (skipping the phase field)

Unified Parallel C at LBNL/UCB Preliminary Performance Testbed -Compaq AlphaServer in ORNL, with Quadrics conduit -Compaq C compiler for the translated C code Microbenchmarks -Measures the cost of UPC language features and constructs -Shared pointer arithmetic, forall, allocation, etc -Vector addition: no remote communication Performance-tuning benchmarks (Costin) -Measure the effectiveness of various communication optimizations -Scale: test message pipelining and software pipelining NAS Parallel Benchmarks (Parry) -EP: no communication -IS: large bulk memory operations -MG: bulk memput

Unified Parallel C at LBNL/UCB Performance of Shared Pointer Arithmetic Phaseless pointer an important optimization Indefinite pointers almost as fast as regular C pointers Packing also helps, especially for pointer and int addition 1 cycle = 1.5ns

Unified Parallel C at LBNL/UCB Comparison with HP UPC v1.7 HP a little faster, due to it generating native code Gap for addition likely smaller with further hand-tuning 1 cycle = 1.5ns

Unified Parallel C at LBNL/UCB Cost of Shared Memory Access Local accesses somewhat slower than private accesses HP has improved local access performance in new version Remote accesses worse than local, as expected Runtime/GASNet layering for portability is not a problem

Unified Parallel C at LBNL/UCB UPC Loops UPC has a “forall” construct for distributing computation shared int v1[N], v2[N], v3[N]; upc_forall(i=0; i < N; i++; &v3[i]) v3[i] = v2[i] + v1[i]; Two kinds of affinity expressions: -Integer (compare with thread id) -Shared address (check the affinity of address) Affinity tests are performed on every iteration Affinity ExpNoneintegershared address # cycles61710

Unified Parallel C at LBNL/UCB Overhead of Dynamic Allocation Faster than HP, due to the benefit of Active Messages Shared allocation functions easy to implement, and scale well Should perform better once the collectives (broadcast)are added to GASNet Shared locks also easy to implement using AM

Unified Parallel C at LBNL/UCB Overhead of Local Shared Accesses Berkeley beats HP, but neither performs well Culprit: cost of affinity test and pointer-to-shared Privatizing local shared accesses improves performance by an order of magnitude

Unified Parallel C at LBNL/UCB Observations on the Results Acceptable overhead for shared memory operations and access latencies Phaseless pointers are good for performance Packed representation is also effective Good performance compared to HP UPC 1.7 Still lots of opportunities for optimizations

Unified Parallel C at LBNL/UCB Compiler Status Targeting a 3/31 release that is fully UPC 1.1 compliant Compiler builds with gcc 2.96 and 3.2 on Linux - remote compilation option for other platforms Runtime and GASNet tested on AlphaServer (Quadrics), Linux (Myrinet), and IBM SP (LAPI) Successfully built and run NAS UPC benchmarks (EP, IS, CG, MG) ~ 2000 lines of code A paper submitted for publication

Unified Parallel C at LBNL/UCB Challenges That We Solved Portability is non-trivial to achieve -Double include: translator can’t simply output declarations in system headers, because the runtime/GASNet may #include it -Porting the translator: Open64 originally only compiled for gcc2.96 -IA-32 Support: Open64 was designed to generate IA-64 code -Preprocessor issues: Open64’s C front end was not ANSI-compliant -Static User Data: Elaborate scheme to allocate and initialize static global data -Memory Management, Machine-specific information, and many more.

Unified Parallel C at LBNL/UCB Future Work: Optimizations Overlapping Communication with Computation Separate get(), put() as far as possible from sync() Privatizing Accesses for Local Memory Can be done in conjunction with elimination of forall loop affinity tests Message Pipelining Effectiveness varies based on the network Message Coalescing/Aggregation/Vectorization Reduce the number of small message traffic Prefetching and Software Caching Difficulty is in understanding the UPC memory model

Unified Parallel C at LBNL/UCB Future Work: Functionality Pthread and System V Shared Memory Support Port the translator to more platforms Debugging Support Merge with ORC (Open Research Compiler) to get new optimizations and bug fixes (Possible) Native Code Generation for IA64