NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Instruction Set Design

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Lecture 6: Multicore Systems

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Parallell Processing Systems1 Chapter 4 Vector Processors.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Computer Organization and Architecture

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.

Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.

UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Java Virtual Machine Case Study on the Design of JikesRVM.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Programmability Hiroshi Nakashima Thomas Sterling.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

Assembly - Arrays תרגול 7 מערכים.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Communication Optimizations in Titanium Programs Jimmy Su.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Vector computers.

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Optimization Code Optimization ©SoftMoore Consulting.

A Practical Stride Prefetching Implementation in Global Optimizer

STUDY AND IMPLEMENTATION

The University of Adelaide, School of Computer Science

Presentation transcript:

NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group

UPC Compiler – Status Report  Current Status:  UPC-to-C translator implemented in open64. Compliant with rev 1.0 of the UPC spec.  “Translates” the GWU test suite and test programs from Intrepid.

UPC Compiler – Future Work Front-ends Very High WHIRL High WHIRL Mid WHIRL Low Whirl Very Low Whirl CG MIR VHO inline IPA PREOPT LNO lslsllslsl WOPT RVI1

UPC Compiler – Future Work Front-ends Very High WHIRL High WHIRL Mid WHIRL Low Whirl Very Low Whirl CG MIR VHO inline IPA PREOPT LNO lslsllslsl  Integrate with GasNet and the UPC runtime  Test runtime and translator (32/64 bit)  Investigate interaction between translator and optimization packages (legal C code)  UPC specific optimizations  Open64 code generator WOPT RVI1

UPC Optimizations - Problems  Shared pointer - logical tuple ( addr, thread, phase ) {void *addr; int thread; int phase;}  Expensive pointer arithmetic and address generation p+i -> p.phase=(p.phase+i)%B p.thread=(p.thread+(p.phase+i)/B)%T  Parallelism expressed by forall and affinity test  Overhead of fine grained communication can become prohibitive

k = 7; while(k <= 233) { Mreturn_temp_0 = upcr_add_shared(a.u0, 4, k, 1); __comma1 = upcr_threadof_shared(Mreturn_temp_0); if(MYTHREAD == __comma1) { i = 0; while(i <= 999) { Mreturn_temp_2 = upcr_add_shared(a.u0, 4, k, 1); Mreturn_temp_1 = upcr_add_shared(b.u0, 4, k + 1, 1); __comma = upcr_get_nb_shared_float(Mreturn_temp_1, 0); __comma0 = upcr_wait_syncnb_valget_float(__comma); upcr_put_nb_shared_float(Mreturn_temp_2, 0, __comma0); _3 :; i = i + 1; } _2 :; k = k + 1; } …….. #include shared float *a, *b; int main() { int i, k ; upc_forall(k=7; k <234; k++; &a[k]) { upc_forall(i = 0; i < 1000; i++; 333) { a[k] = b[k+1]; } Translated UPC Code

UPC Optimizations  “Generic” scalar and loop optimizations (unrolling, pipelining…)  Address generation optimizations  Eliminate run-time tests  Table lookup / Basis vectors  Simplify pointer/address arithmetic  Address components reuse  Localization  Communication optimizations  Vectorization  Message combination  Message pipelining  Prefetching for irregular data accesses

Run-Time Test Elimination  Problem – find sequence of local memory locations that processor P accesses during the computation  Well explored in the context of HPF  Several techniques proposed for for block-cyclic distributions:  table lookup (Chatterjee,Kennedy)  basis vectors (Ramanujam, Thirumalai)  UPC layouts: cyclic, pure block, indefinite block size - particular case of block cyclic

i=l; while(i<u) { t_0 = upcr_add_shared(a, 4, i, 1); __comma1 = upcr_threadof_shared(t_0); if(MYTHREAD == __comma1) { t_2 = upcr_add_shared(a.u0, 4, i, 1); upcr_put_shared_float(t_2, 0, EXP()); } _1: i+= s; } UPC to C translation Table Array Address Lookup upc_forall(i=l; i<u; i+=s; &a[i]) a[i] = EXP(); compute T, next, start base = startmem; i = startoffset; while (base < endmem) { *base = EXP(); base += T[i]; i = next[i]; } Table based lookup (Kennedy)

Array Address Lookup  Encouraging results – speedups between 50:200 versus run-time resolution  Lookup – time vs space tradeoff. Kennedy introduces a demand-driven technique  UPC arrays – simpler than HPF arrays  UPC language restrictions – no aliasing between pointers with different block sizes  Existing HPF techniques also applicable to UPC pointer based programs

Address Arithmetic Simplification  Address Components Reuse  Idea – view shared pointers as three separate components (A, T, P) : (addr, thread, phase)  Exploit the implicit reuse of the thread and phase fields  Pointer Localization  Determine which accesses can be performed using local pointers  Optimize for indefinite block size  Requires heap analysis/LQI and a similar dependency analysis to the lookup techniques

Communication Optimizations  Message Vectorization – hoist and prefetch an array slice.  Message Combination – combine messages with the same target processor into a larger message  Communication Pipelining – separate the initiation of a communication operation by its completion and overlap communication and computation

Communication Optimizations  Some optimizations are complementary  Choi&Snyder ( Paragon/T3D -PVM/shmem ), Krishnamurthy ( CM5 ), Chakrabarti ( SP2/Now )  Speedups in the range 10%-40%  Optimizations more effective for high latency transport layers ( PVM/Now ) ~ 25% speedup vs 10% speedup ( shmem/SP2 )

Prefetching of Irregular Data Accesses  For serial programs – hide cache latency  “Simpler” for parallel programs – hide communication latency  Irregular data accesses  Array based programs : a[b[i]]  Irregular data structures (pointer based)

Prefetching of Irregular Data Accesses  Array based programs  Well explored topic (“inspector-executor” – Saltz)  Irregular data structures  Not very well explored in the context of SPMD programs.  Serial techniques: jump pointers, linearization (Mowry)  Is there a good case for it?

Conclusions  We start with a clean slate  Infrastructure for pointer analysis, array dependency analysis already in open64  Communication optimizations and address calculation optimizations share common analyses  Address calculation optimizations are likely to offer better performance improvements at this stage

The End

Address Arithmetic Simplification  Address Components Reuse  Idea – view shared pointers as three separate components (A, T, P) : (addr, thread, phase)  Exploit the implicit reuse of the thread and phase fields shared [B] float a[N],b[N] upc_forall(i=l;i<u;i+=s;&a[i]) a[i] = b[i+k];

Address Component Reuse P0P1 B1B2 B3B6B4B5 B-k T a = T b P b =P a +k a[i] = b[i+k]; a -> (A a,T a,P a ) b -> (A b,T b,P b ) BiBi bibi eiei

Address Component Reuse Ta = 0; for (i=first_block; i<last_block; i=next_block) { for(j=bi,Pa=0; j < ei-k; j++,Pa++) put(Aa,Ta,Pa, get(Ab,Ta,Pa+k)); ……… for(; j<ei; j++) put(Aa,Ta,Pa, get(Ab,Ta+1,Pa-j)); ……… }