Code Optimization of Parallel Programs Vivek Sarkar Rice University Vivek Sarkar Rice University

Slides:



Advertisements
Similar presentations
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
SSA.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
Program Representations. Representing programs Goals.
1 Program Slicing Purvi Patel. 2 Contents Introduction What is program slicing? Principle of dependences Variants of program slicing Slicing classifications.
Introduction to Advanced Topics Chapter 1 Mooly Sagiv Schrierber
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Program Representations Xiangyu Zhang. CS590F Software Reliability Why Program Representations  Initial representations Source code (across languages).
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Cpeg421-08S/final-review1 Course Review Tom St. John.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Lecture 1  Getting ready to program  Hardware Model  Software Model  Programming Languages  The C Language  Software Engineering  Programming.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Precision Going back to constant prop, in what cases would we lose precision?
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Evaluation of Memory Consistency Models in Titanium.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
1 Program Slicing Amir Saeidi PhD Student UTRECHT UNIVERSITY.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
Processor Architecture
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Pointer Analysis for Multithreaded Programs Radu Rugina and Martin Rinard M I T Laboratory for Computer Science.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.
Chapter 1 Introduction.
Conception of parallel algorithms
Optimizing Compilers Background
Static Single Assignment
Pattern Parallel Programming
Chapter 1 Introduction.
课程名 编译原理 Compiling Techniques
Shared Memory Programming
Control Flow Analysis (Chapter 7)
Presentation transcript:

Code Optimization of Parallel Programs Vivek Sarkar Rice University Vivek Sarkar Rice University

2 Parallel Software Challenges & Focus Area for this Talk Middleware Parallel Runtime & System Libraries OS and Hypervisors Languages Programming Tools Parallelism in middleware e.g., transactions, relational databases, web services, J2EE containers Explicitly parallel languages e.g., OpenMP, Java Concurrency,.NET Parallel Extensions, Intel TBB, CUDA, Cilk, MPI, Unified Parallel C, Co-Array Fortran, X10, Chapel, Fortress Parallel Debugging and Performance Tools e.g., Eclipse Parallel Tools Platform, TotalView, Thread Checker Code partitioning for accelerators, data transfer optimizations, SIMDization, space-time scheduling, power management Parallel runtime and system libraries for task scheduling, synchronization, parallel data structures Virtualization, scalable management of heterogeneous resources per core (frequency, power) Static & Dynamic Optimizing Compilers Domain-specific Programming Models Domain-specific implicitly parallel programming models e.g., Matlab, stream processing, map-reduce (Sawzall), Application Libraries Parallel application libraries e.g., linear algebra, graphics imaging, signal processing, security Parallel intermediate representation, optimization of synchronization & data transfer, automatic parallelization Multicore Back-ends

3 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code  Rice Habanero Multicore Software project

4 Our Current Paradigm for Code Optimization has served us well for Fifty Years …. Translation Fortran Autocoder II ALPHA IL OPTIMIZER REGISTER ALLOCATOR IL ASSEMBLER STRETCH STRETCH-HARVEST OBJECT CODE Stretch – Harvest Compiler Organization ( ) Source: “Compiling for Parallelism”, Fran Allen, Turning Lecture, June 2007

5 … and has been adapted to meet challenges along the way …  Interprocedural analysis  Array dependence analysis  Pointer alias analysis  Instruction scheduling & software pipelining  SSA form  Profile-directed optimization  Dynamic compilation  Adaptive optimization  Auto-tuning ...

6 … but is now under siege because of parallelism  Proliferation of parallel hardware  Multicore, manycore, accelerators, clusters, …  Proliferation of parallel libraries and languages  OpenMP, Java Concurrency,.NET Parallel Extensions, Intel TBB, Cilk, MPI, UPC, CAF, X10, Chapel, Fortress, …

7 Paradigm Shifts  "The Structure of Scientific Revolutions”, Thomas S. Kuhn (1970)  A paradigm is a scientific structure or framework consisting of Assumptions, Laws, Techniques  Normal science is a puzzle solving activity governed by the rules of the paradigm.  It is uncritical of the current paradigm,  Crisis sets in when a series of serious anomalies appear  “The emergence of new theories is generally preceded by a period of pronounced professional insecurity”  Scientists engage in philosophical and metaphysical disputes.  A revolution or paradigm shift occurs when an an entire paradigm is replaced by another

8 Kuhn’s History of Science Normal Science Immature Science Anomalies Crisis Revolution Revolution: A new paradigm emerges Old Theory: well established, many followers, many anomalies New Theory: few followers, untested, new concepts/techniques, accounts for anomalies, asks new questions Source: ug_phil_sci1h/phil_sci_files/L10_Kuhn1.ppt

9 Some Well Known Paradigm Shifts  Newton’s Laws to Einstein's Theory of Relativity  Ptolemy’s geocentric view to Copernicus and Galileo’s heliocentric view  Creationism to Darwin’s Theory of Evolution

10 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches  Rice Habanero Multicore Software project

11 What anomalies do we see when optimizing parallel code? Examples 1.Control flow rules 2.Data flow rules 3.Load elimination rules

12 1. Control Flow Rules from Sequential Code Optimization  Control Flow Graph  Node = Basic Block  Edge = Transfer of Control Flow  Succ(b) = successors of block b  Pred(b) = predecessors of block b  Dominators  Block d dominates block b if every (sequential) path from START to b includes d  Dom(b) = set of dominators of block b  Every block has a unique immediate dominator (parent in dominator tree)

13 Dominator Example START BB1 BB2BB3 BB4 STOP Control Flow Graph TF START BB1 BB2BB3 BB4 STOP Dominator Tree

14 Anomalies in Control Flow Rules for Parallel Code BB1 parbegin BB2 || BB3 parend BB4  Does B4 have a unique immediate dominator?  Can the dominator relation be represented as a tree? BB1 FORK BB2BB3 JOIN BB4 Parallel Control Flow Graph

15 2. Data Flow Rules from Sequential Code Optimization Example: Reaching Definitions  REACH in (n) = set of definitions d s.t. there is a (sequential) path from d to n in the CFG, and d is not killed along that path.

16 Anomalies in Data Flow Rules for Parallel Code What definitions reach COEND? What if there were no synchronization edges? How should the data flow equations be defined for parallel code? control sync S1: X 1 := … parbegin // Task 1 S2: X 2 := … post(ev2); S3:... post(ev3); S4: wait(ev8); X 4 := … || // Task 2 S5:... S6: wait(ev2); S7: X 7 := … S8: wait(ev3); post(ev8); parend...

17 3. Load Elimination Rules from Sequential Code Optimization  A load instruction at point P, T3 := *q, is redundant, if the value of *q is available at point P T1 := *q T2 := *p T3 := *q T1 := *q T2 := *p T3 := T1

18 Anomalies in Load Elimination Rules for Parallel Code (Original Version) TASK 1... T1 := *q T2 := *p T3 := *q print T1, T2, T3 Question: Is [0, 1, 0] permitted as a possible output? Answer: It depends on the programming model. It is not permitted by Sequential Consistency [Lamport 1979]  But it is permitted by Location Consistency [Gao & Sarkar 1993, 2000] TASK 2... *p = 1... Assume that p = q, and that *p = *q = 0 initially.

19 Anomalies in Load Elimination Rules for Parallel Code (After Load Elimination) TASK 1... T1 := *q T2 := *p T3 := T1 print T1, T2, T3 Question: Is [0, 1, 0] permitted as a possible output? Answer: Yes, it will be permitted by Sequential Consistency, if load elimination is performed! TASK 2... *p = 1... Assume that p = q, and that *p = *q = 0 initially.

20 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code  Rice Habanero Multicore Software project

21 Incremental Approaches to coping with Parallel Code Optimization  Large investment in infrastructures for sequential code optimization  Introduce ad hoc rules to incrementally extend them for parallel code optimization  Code motion fences at sycnhronization operations  Task creation and termination via function call interfaces  Use of volatile storage modifiers ...

22 More Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the Future  Need for a new Parallel Intermediate Representation (PIR) with robust support for code optimization of parallel programs  Abstract execution model for PIR  Storage classes (types) for locality and memory hierarchies  General framework for task partitioning and code motion in parallel code  Compiler-friendly memory model  Combining automatic parallelization and explicit parallelism ...

23 Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]  A Program Dependence Graph, PDG = (N', E cd, E dd ) is derived from a CFG and consists of:

24 PDG Example /* S1 */ max = a[i]; /* S2 */ div = a[i] / b[i] ; /* S3 */ if ( max < b[i] ) /* S4 */ max = b[i] ; S1S2S3 S4 max (true) max (output) max (anti)

25 PDG restrictions  Control Dependence  Predicate-ancestor condition : if there are two disjoint c.d. paths from (ancestor) node A to node N, then A cannot be a region node i.e., A must be a predicate node  No-postdominating-descendant condition: if node P postdominates node N in the CFG, then there cannot be a c.d. path from node N to node P

26 Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]  Node 4 is executed twice in this acyclic PDG “Parallel Program Graphs and their Classification”, V.Sarkar & B.Simons, LCPC 1993

27 PDG restrictions (contd)  Data Dependence  There cannot be a data dependence edge in the PDG from node A to node B if there is no path from A to B in the CFG  The context C of a data dependence edge (A,B,C) must be plausible i.e., it cannot identify a dependence from an execution instance I A of node A to an execution instance I B of node B if I B precedes I A in the CFG's execution  e.g., a data dependence from iteration i+1 to iteration i is not plausible in a sequential program

28 Limitations of Program Dependence Graphs  PDGs and CFGs are tightly coupled  A transformation in one must be reflected in the other  PDGs reveal maximum parallelism in the program  CFGs reveal sequential execution  Neither is well suited for code optimization of parallel programs e.g., how do we represent a partitioning of { 1, 3, 4 } and { 2 } into two tasks?

29 Another Limitation: no Parallel Execution Semantics defined for PDGs  What is the semantics of control dependence edges with cycles?  What is the semantics of data dependences when a source or destination node may have zero, one or more instances? A[f(i,j)] = … … = A[g(i)]

30 Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992] A Parallel Program Graph, PPG = (N, E control, E sync ) consists of:  N, a set of compute, predicate, and parallel nodes  A parallel node creates parallel threads of computation for each of its successors  E control, a set of labeled control edges. Edge (A,B,L) in E control identifies a control edge from node A to node B with label L.  E sync, a set of synchronization edges. Edge (A,B,F) in E sync defines a synchronization from node A to node B with synchronization condition F which identifies execution instances of A and B that need to be synchronized “A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs”, V.Sarkar, LCPC 1992

31 PPG Example

32 Relating CFGs to PPGs  Construction of PPG for a sequential program  PPG nodes = CFG nodes  PPG control edges = CFG edges  PPG synchronization edges = empty set

33 Relating PDGs to PPGs  Construction of PPG for PDGs  PPG nodes = PDG nodes  PPG parallel nodes = PDG regions nodes  PPG control edges = PDG control dependence edges  PPG synchronization edges = PDG data dependence edges  Synchronization condition F in PPG synchronization edge mirrors context of PDG data dependence edge

34 Example of Transforming PPGs

35 Abstract Interpreter for PPGs  Build a partial order  of dynamic execution instances of PPG nodes as PPG execution unravels.  Each execution instance I A is labeled with its history (calling context), H(I A ).  Initialize  to a singleton set containing an instance of the start node, I START, with H(I START ) initialized to the empty sequence.

36 Abstract Interpreter for PPGs (contd) Each iteration of the scheduling algorithm:  Selects an execution instance I A in  such that all of I A 's predecessors in  have been scheduled  Simulates execution of I A and evaluates branch label L  Creates an instance I B of each c.d. successor B of A for label L  Adds (I B, I C ) to , if instance I C has been created in  and there exists a PPG synchronization edge from B to C (or from a PPG descendant of B to C)  Adds (I C, I B ) to , if instance I C has been created in  and there exists a PPG synchronization edge from C to B (or from a PPG descendant of C to B)

37 Abstract Interpreter for PPGs: Example 1.Create I START 2.Schedule I START 3.Create I PAR 4.Schedule I PAR 5.Create I 1, I 2, I 3 6.Add (I 1, I 3 ) to  7.Schedule I 2 8.Schedule I 1 9.Schedule I

38 Weak (Deterministic) Memory Model for PPGs  All memory accesses are assumed to be non-atomic  Read-write hazard --- if I a reads a location for which there is a parallel write of a different value, then the execution result is an error  Analogous to an exception thrown if a data race occurs  May be thrown when read or write operation is performed  Write-write hazard --- if I a writes into a location for which there is a parallel write of a different value, then the resulting value in the location is undefined  Execution results in an error if that location is subsequently read  Separation of data communication and synchronization:  Data communication specified by read/write operations  Sequencing specified by synchronization and control edges

39 Soundness Properties  Reordering Theorem  For a given Parallel Program Graph, G, and input store,  i, the final store  f = G(  i ) obtained is the same for all possible scheduled sequences in the abstract interpreter  Equivalence Theorem  A sequential program and its PDG have identical semantics i.e., they yield the same output store when executed with the same input store

40 Reaching Definitions Analysis on PPGs [LCPC 1997] “Analysis and Optimization of Explicitly Parallel Programs using the Parallel Program Graph Representation”, V.Sarkar, LCPC 1997 A definition D is redefined at program point P if there is a control path from D to P, and D is killed along all paths from D to P.

41 Reaching Definitions Analysis on PPGs control sync // Task 1 S2: X 2 := … post(ev2); S3:... post(ev3); S4: wait(ev8); X 4 := … // Task 2 S5:... S6: wait(ev2); S7: X 7 := … S8: wait(ev3); post(ev8); S1: X 1 := …

42 PPG Limitations  Past work has focused on comprehensive representation and semantics for deterministic programs  Extensions needed for  Atomicity and mutual exclusion  Stronger memory models  Storage classes with explicit locality

43 Issues in Modeling Synchronized/Atomic Blocks [LCPC 1999] Questions:  Can the load of p.x be moved below the store of q.y?  Can the load of p.x be moved outside the synchronized block?  Can the load of r.z be moved inside the synchronized block?  Can the load of r.z be moved back outside the synchronized block?  How should the data dependences be modeled? a =... synchronized (L) {... = p.x q.y =... b = }... = r.z “Dependence Analysis for Java”, C.Chambers et al, LCPC 1999

44 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code  Rice Habanero Multicore Software project

45 Habanero Project (habanero.rice.edu) 1) Habanero Programming Language Sequential C, Fortran, Java, … Foreign Function Interface Parallel Applications Multicore Hardware Vendor Compiler & Libraries 2) Habanero Static Compiler 3) Habanero Virtual Machine 4) Habanero Concurrency Library X10 … 5) Habanero Toolkit

46 2) Habanero Static Parallelizing & Optimizing Compiler Front End IRGen AST C / Fortran (restricted code regions for targeting accelerators & high-end computing) Interprocedural Analysis Parallel IR (PIR) Annotated Classfiles PIR Analysis & Optimization Portable Managed Runtime Platform-specific static compiler Partitioned Code Sequential C, Fortran, Java, … Foreign Function Interface X10/Habanero Language Classfile Transformations

47 Habanero Target Applications and Platforms Applications: Parallel Benchmarks  SSCA’s #1, #2, #3 from DARPA HPCS program  NAS Parallel Benchmarks  JGF, JUC, SciMark benchmarks Medical Imaging  Back-end processing for Compressive Sensing (  Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA) Seismic Data Processing  Rice Inversion project (  Contact: Bill Symes (Rice), James Gunning (CSIRO) Computer Graphics and Visualization  Mathematical modeling and smoothing of meshes  Contact: Joe Warren (Rice) Computational Chemistry  Fock Matrix Construction  Contacts: David Bernholdt, Wael Elwasif, Robert Harrison, Annirudha Shet (ORNL) Habanero Compiler  Implement Habanero compiler in Habanero so as to exploit multicore parallelism within the compiler Platforms:  AMD Barcelona Quad-Core  Clearspeed Advance X620  DRC Coprocessor Module w/ Xilinx Virtex FPGA  IBM Cell  IBM Cyclops-64 (C-64)  IBM Power5+, Power6  Intel Xeon Quad-Core  NVIDIA Tesla S870  Sun UltraSparc T1, T2 ... Additional suggestions welcome!

48 Habanero Research Topics 1) Language Research  Explicit parallelism: portable constructs for homogeneous & heterogeneous multicore  Implicit deterministic parallelism: array views, single-assignment constructs  Implicit non-deterministic parallelism: unordered iterators, partially ordered statement blocks  Builds on our experiences with the X10, CAF, HPF, Matlab D, Fortran 90 and Sisal languages 2) Compiler research  New Parallel Intermediate Representation (PIR)  Automatic analysis, transformation, and parallelization of PIR  Optimization of high-level arrays and iterators  Optimization of synchronization, data transfer, and transactional memory operations  Code partitioning for accelerators  Builds on our experiences with the D System, Massively Scalar, Telescoping Languages Framework, ASTI and PTRAN research compilers

49 Habanero Research Topics (contd.) 3) Virtual machine research  VM support for work-stealing scheduling algorithms with extensions for places, transactions, task groups  Runtime support for other Habanero language constructs (phasers, regions, distributions)  Integration and exploitation of lightweight profiling in VM scheduler and memory management system  Builds on our experiences with the Jikes Research Virtual Machine 4) Concurrency library research  New nonblocking data structures to support the Habanero runtime  Efficient software transactional memory libraries  Builds on our experiences with the java.util.concurrent and DSTM2 libraries 5) Toolkit research  Program analysis for common parallel software errors  Performance attribution of shared code regions (loops, procedure calls) using static and dynamic calling context  Builds on our experiences with the HPCToolkit, Eclipse PTP and DrJava projects

50 Opportunities for Broader Impact  Education  Influence how parallelism is taught in future Computer Science curricula  Open Source  Build an open source testbed to grow ecosystem for researchers in Parallel Software area  Industry standards  Use research results as proofs of concept for new features that can be standardized  Infrastructure can provide foundation for reference implementations  Collaborations welcome!

51 Habanero Team (Nov 2007) Send to Vivek Sarkar if you are in a PhD, postdoc, research scientist, or programmer position in the Habanero project, or in collaborating with us!

52 Other Challenges in Code Optimization of Parallel Code  Optimization of task coordination  Task creation and termination --- fork, join  Mutual exclusion --- locks, transactions  Synchronization --- semaphores, barriers  Data Locality Optimizations  Computation and data alignment  Communication optimizations  Deployment and Code Generation  Homogeneous Multicore  Heterogeneous Multicore and Accelerators  Automatic Parallelization Revisited ...

53 Related Work (Incomplete List)  Analysis of nondeterministic sequentially consistent parallel programs  [Shasha, Snir 1988], [Midkiff et al 1989], [Chow, Harrison 1992], [Lee et al 1997], …  Analysis of deterministic parallel programs with copy-in/copy-out semantics  [Srinivasan 1994], [Ferrante et al 1996], …  Value-oriented semantics for functional subsets of PDGs  [Selke 1989], [Cartwright, Felleisen 1989], [Beck, Pingali 1989], [Ottenstein, Ballance, Maccabe 1990], …  Serialization of restricted subsets of PDGs  [Ferrante, Mace, Simons 1988], [Simons et al 1990], …  Concurrency analysis  [Long, Clarke 1989], [Duesterwald, Soffa 1991], [Masticola, Ryder 1993], [Naumovich, Avrunin 1998], [Agarwal et al 2007], …

54 PLDI 2008 Tutorial (Tucson, AZ)  Analysis and Optimization of Parallel Programs  Intermediate representations for parallel programs  Data flow analysis frameworks for parallel programs  Locality analyses: scalar/array privatization, escape analysis of objects, locality types  Memory models and their impact on code optimization of locks and transactional memory operations  Optimizations of task partitions and synchronization operations  Sam Midkiff, Vivek Sarkar  Sunday afternoon (June 8, 2008, 1:30pm - 5:00pm)

55 Conclusions  New paradigm shift in Code Optimization due to Parallel Programs  Foundations of Code Optimization will need to be revisited from scratch  Foundations will impact high-level and low-level optimizers, as well as tools  Exciting times to be a compiler researcher!