Presentation is loading. Please wait.

Presentation is loading. Please wait.

Code Optimization of Parallel Programs Vivek Sarkar Rice University Vivek Sarkar Rice University

Similar presentations


Presentation on theme: "Code Optimization of Parallel Programs Vivek Sarkar Rice University Vivek Sarkar Rice University"— Presentation transcript:

1 Code Optimization of Parallel Programs Vivek Sarkar Rice University Vivek Sarkar Rice University

2 2 Parallel Software Challenges & Focus Area for this Talk Middleware Parallel Runtime & System Libraries OS and Hypervisors Languages Programming Tools Parallelism in middleware e.g., transactions, relational databases, web services, J2EE containers Explicitly parallel languages e.g., OpenMP, Java Concurrency,.NET Parallel Extensions, Intel TBB, CUDA, Cilk, MPI, Unified Parallel C, Co-Array Fortran, X10, Chapel, Fortress Parallel Debugging and Performance Tools e.g., Eclipse Parallel Tools Platform, TotalView, Thread Checker Code partitioning for accelerators, data transfer optimizations, SIMDization, space-time scheduling, power management Parallel runtime and system libraries for task scheduling, synchronization, parallel data structures Virtualization, scalable management of heterogeneous resources per core (frequency, power) Static & Dynamic Optimizing Compilers Domain-specific Programming Models Domain-specific implicitly parallel programming models e.g., Matlab, stream processing, map-reduce (Sawzall), Application Libraries Parallel application libraries e.g., linear algebra, graphics imaging, signal processing, security Parallel intermediate representation, optimization of synchronization & data transfer, automatic parallelization Multicore Back-ends

3 3 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code  Rice Habanero Multicore Software project

4 4 Our Current Paradigm for Code Optimization has served us well for Fifty Years …. Translation Fortran Autocoder II ALPHA IL OPTIMIZER REGISTER ALLOCATOR IL ASSEMBLER STRETCH STRETCH-HARVEST OBJECT CODE Stretch – Harvest Compiler Organization ( ) Source: “Compiling for Parallelism”, Fran Allen, Turning Lecture, June 2007

5 5 … and has been adapted to meet challenges along the way …  Interprocedural analysis  Array dependence analysis  Pointer alias analysis  Instruction scheduling & software pipelining  SSA form  Profile-directed optimization  Dynamic compilation  Adaptive optimization  Auto-tuning ...

6 6 … but is now under siege because of parallelism  Proliferation of parallel hardware  Multicore, manycore, accelerators, clusters, …  Proliferation of parallel libraries and languages  OpenMP, Java Concurrency,.NET Parallel Extensions, Intel TBB, Cilk, MPI, UPC, CAF, X10, Chapel, Fortress, …

7 7 Paradigm Shifts  "The Structure of Scientific Revolutions”, Thomas S. Kuhn (1970)  A paradigm is a scientific structure or framework consisting of Assumptions, Laws, Techniques  Normal science is a puzzle solving activity governed by the rules of the paradigm.  It is uncritical of the current paradigm,  Crisis sets in when a series of serious anomalies appear  “The emergence of new theories is generally preceded by a period of pronounced professional insecurity”  Scientists engage in philosophical and metaphysical disputes.  A revolution or paradigm shift occurs when an an entire paradigm is replaced by another

8 8 Kuhn’s History of Science Normal Science Immature Science Anomalies Crisis Revolution Revolution: A new paradigm emerges Old Theory: well established, many followers, many anomalies New Theory: few followers, untested, new concepts/techniques, accounts for anomalies, asks new questions Source: ug_phil_sci1h/phil_sci_files/L10_Kuhn1.ppt

9 9 Some Well Known Paradigm Shifts  Newton’s Laws to Einstein's Theory of Relativity  Ptolemy’s geocentric view to Copernicus and Galileo’s heliocentric view  Creationism to Darwin’s Theory of Evolution

10 10 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches  Rice Habanero Multicore Software project

11 11 What anomalies do we see when optimizing parallel code? Examples 1.Control flow rules 2.Data flow rules 3.Load elimination rules

12 12 1. Control Flow Rules from Sequential Code Optimization  Control Flow Graph  Node = Basic Block  Edge = Transfer of Control Flow  Succ(b) = successors of block b  Pred(b) = predecessors of block b  Dominators  Block d dominates block b if every (sequential) path from START to b includes d  Dom(b) = set of dominators of block b  Every block has a unique immediate dominator (parent in dominator tree)

13 13 Dominator Example START BB1 BB2BB3 BB4 STOP Control Flow Graph TF START BB1 BB2BB3 BB4 STOP Dominator Tree

14 14 Anomalies in Control Flow Rules for Parallel Code BB1 parbegin BB2 || BB3 parend BB4  Does B4 have a unique immediate dominator?  Can the dominator relation be represented as a tree? BB1 FORK BB2BB3 JOIN BB4 Parallel Control Flow Graph

15 15 2. Data Flow Rules from Sequential Code Optimization Example: Reaching Definitions  REACH in (n) = set of definitions d s.t. there is a (sequential) path from d to n in the CFG, and d is not killed along that path.

16 16 Anomalies in Data Flow Rules for Parallel Code What definitions reach COEND? What if there were no synchronization edges? How should the data flow equations be defined for parallel code? control sync S1: X 1 := … parbegin // Task 1 S2: X 2 := … post(ev2); S3:... post(ev3); S4: wait(ev8); X 4 := … || // Task 2 S5:... S6: wait(ev2); S7: X 7 := … S8: wait(ev3); post(ev8); parend...

17 17 3. Load Elimination Rules from Sequential Code Optimization  A load instruction at point P, T3 := *q, is redundant, if the value of *q is available at point P T1 := *q T2 := *p T3 := *q T1 := *q T2 := *p T3 := T1

18 18 Anomalies in Load Elimination Rules for Parallel Code (Original Version) TASK 1... T1 := *q T2 := *p T3 := *q print T1, T2, T3 Question: Is [0, 1, 0] permitted as a possible output? Answer: It depends on the programming model. It is not permitted by Sequential Consistency [Lamport 1979]  But it is permitted by Location Consistency [Gao & Sarkar 1993, 2000] TASK 2... *p = 1... Assume that p = q, and that *p = *q = 0 initially.

19 19 Anomalies in Load Elimination Rules for Parallel Code (After Load Elimination) TASK 1... T1 := *q T2 := *p T3 := T1 print T1, T2, T3 Question: Is [0, 1, 0] permitted as a possible output? Answer: Yes, it will be permitted by Sequential Consistency, if load elimination is performed! TASK 2... *p = 1... Assume that p = q, and that *p = *q = 0 initially.

20 20 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code  Rice Habanero Multicore Software project

21 21 Incremental Approaches to coping with Parallel Code Optimization  Large investment in infrastructures for sequential code optimization  Introduce ad hoc rules to incrementally extend them for parallel code optimization  Code motion fences at sycnhronization operations  Task creation and termination via function call interfaces  Use of volatile storage modifiers ...

22 22 More Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the Future  Need for a new Parallel Intermediate Representation (PIR) with robust support for code optimization of parallel programs  Abstract execution model for PIR  Storage classes (types) for locality and memory hierarchies  General framework for task partitioning and code motion in parallel code  Compiler-friendly memory model  Combining automatic parallelization and explicit parallelism ...

23 23 Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]  A Program Dependence Graph, PDG = (N', E cd, E dd ) is derived from a CFG and consists of:

24 24 PDG Example /* S1 */ max = a[i]; /* S2 */ div = a[i] / b[i] ; /* S3 */ if ( max < b[i] ) /* S4 */ max = b[i] ; S1S2S3 S4 max (true) max (output) max (anti)

25 25 PDG restrictions  Control Dependence  Predicate-ancestor condition : if there are two disjoint c.d. paths from (ancestor) node A to node N, then A cannot be a region node i.e., A must be a predicate node  No-postdominating-descendant condition: if node P postdominates node N in the CFG, then there cannot be a c.d. path from node N to node P

26 26 Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]  Node 4 is executed twice in this acyclic PDG “Parallel Program Graphs and their Classification”, V.Sarkar & B.Simons, LCPC 1993

27 27 PDG restrictions (contd)  Data Dependence  There cannot be a data dependence edge in the PDG from node A to node B if there is no path from A to B in the CFG  The context C of a data dependence edge (A,B,C) must be plausible i.e., it cannot identify a dependence from an execution instance I A of node A to an execution instance I B of node B if I B precedes I A in the CFG's execution  e.g., a data dependence from iteration i+1 to iteration i is not plausible in a sequential program

28 28 Limitations of Program Dependence Graphs  PDGs and CFGs are tightly coupled  A transformation in one must be reflected in the other  PDGs reveal maximum parallelism in the program  CFGs reveal sequential execution  Neither is well suited for code optimization of parallel programs e.g., how do we represent a partitioning of { 1, 3, 4 } and { 2 } into two tasks?

29 29 Another Limitation: no Parallel Execution Semantics defined for PDGs  What is the semantics of control dependence edges with cycles?  What is the semantics of data dependences when a source or destination node may have zero, one or more instances? A[f(i,j)] = … … = A[g(i)]

30 30 Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992] A Parallel Program Graph, PPG = (N, E control, E sync ) consists of:  N, a set of compute, predicate, and parallel nodes  A parallel node creates parallel threads of computation for each of its successors  E control, a set of labeled control edges. Edge (A,B,L) in E control identifies a control edge from node A to node B with label L.  E sync, a set of synchronization edges. Edge (A,B,F) in E sync defines a synchronization from node A to node B with synchronization condition F which identifies execution instances of A and B that need to be synchronized “A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs”, V.Sarkar, LCPC 1992

31 31 PPG Example

32 32 Relating CFGs to PPGs  Construction of PPG for a sequential program  PPG nodes = CFG nodes  PPG control edges = CFG edges  PPG synchronization edges = empty set

33 33 Relating PDGs to PPGs  Construction of PPG for PDGs  PPG nodes = PDG nodes  PPG parallel nodes = PDG regions nodes  PPG control edges = PDG control dependence edges  PPG synchronization edges = PDG data dependence edges  Synchronization condition F in PPG synchronization edge mirrors context of PDG data dependence edge

34 34 Example of Transforming PPGs

35 35 Abstract Interpreter for PPGs  Build a partial order  of dynamic execution instances of PPG nodes as PPG execution unravels.  Each execution instance I A is labeled with its history (calling context), H(I A ).  Initialize  to a singleton set containing an instance of the start node, I START, with H(I START ) initialized to the empty sequence.

36 36 Abstract Interpreter for PPGs (contd) Each iteration of the scheduling algorithm:  Selects an execution instance I A in  such that all of I A 's predecessors in  have been scheduled  Simulates execution of I A and evaluates branch label L  Creates an instance I B of each c.d. successor B of A for label L  Adds (I B, I C ) to , if instance I C has been created in  and there exists a PPG synchronization edge from B to C (or from a PPG descendant of B to C)  Adds (I C, I B ) to , if instance I C has been created in  and there exists a PPG synchronization edge from C to B (or from a PPG descendant of C to B)

37 37 Abstract Interpreter for PPGs: Example 1.Create I START 2.Schedule I START 3.Create I PAR 4.Schedule I PAR 5.Create I 1, I 2, I 3 6.Add (I 1, I 3 ) to  7.Schedule I 2 8.Schedule I 1 9.Schedule I

38 38 Weak (Deterministic) Memory Model for PPGs  All memory accesses are assumed to be non-atomic  Read-write hazard --- if I a reads a location for which there is a parallel write of a different value, then the execution result is an error  Analogous to an exception thrown if a data race occurs  May be thrown when read or write operation is performed  Write-write hazard --- if I a writes into a location for which there is a parallel write of a different value, then the resulting value in the location is undefined  Execution results in an error if that location is subsequently read  Separation of data communication and synchronization:  Data communication specified by read/write operations  Sequencing specified by synchronization and control edges

39 39 Soundness Properties  Reordering Theorem  For a given Parallel Program Graph, G, and input store,  i, the final store  f = G(  i ) obtained is the same for all possible scheduled sequences in the abstract interpreter  Equivalence Theorem  A sequential program and its PDG have identical semantics i.e., they yield the same output store when executed with the same input store

40 40 Reaching Definitions Analysis on PPGs [LCPC 1997] “Analysis and Optimization of Explicitly Parallel Programs using the Parallel Program Graph Representation”, V.Sarkar, LCPC 1997 A definition D is redefined at program point P if there is a control path from D to P, and D is killed along all paths from D to P.

41 41 Reaching Definitions Analysis on PPGs control sync // Task 1 S2: X 2 := … post(ev2); S3:... post(ev3); S4: wait(ev8); X 4 := … // Task 2 S5:... S6: wait(ev2); S7: X 7 := … S8: wait(ev3); post(ev8); S1: X 1 := …

42 42 PPG Limitations  Past work has focused on comprehensive representation and semantics for deterministic programs  Extensions needed for  Atomicity and mutual exclusion  Stronger memory models  Storage classes with explicit locality

43 43 Issues in Modeling Synchronized/Atomic Blocks [LCPC 1999] Questions:  Can the load of p.x be moved below the store of q.y?  Can the load of p.x be moved outside the synchronized block?  Can the load of r.z be moved inside the synchronized block?  Can the load of r.z be moved back outside the synchronized block?  How should the data dependences be modeled? a =... synchronized (L) {... = p.x q.y =... b = }... = r.z “Dependence Analysis for Java”, C.Chambers et al, LCPC 1999

44 44 Outline  Paradigm Shifts  Anomalies in Optimizing Parallel Code  Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code  Rice Habanero Multicore Software project

45 45 Habanero Project (habanero.rice.edu) 1) Habanero Programming Language Sequential C, Fortran, Java, … Foreign Function Interface Parallel Applications Multicore Hardware Vendor Compiler & Libraries 2) Habanero Static Compiler 3) Habanero Virtual Machine 4) Habanero Concurrency Library X10 … 5) Habanero Toolkit

46 46 2) Habanero Static Parallelizing & Optimizing Compiler Front End IRGen AST C / Fortran (restricted code regions for targeting accelerators & high-end computing) Interprocedural Analysis Parallel IR (PIR) Annotated Classfiles PIR Analysis & Optimization Portable Managed Runtime Platform-specific static compiler Partitioned Code Sequential C, Fortran, Java, … Foreign Function Interface X10/Habanero Language Classfile Transformations

47 47 Habanero Target Applications and Platforms Applications: Parallel Benchmarks  SSCA’s #1, #2, #3 from DARPA HPCS program  NAS Parallel Benchmarks  JGF, JUC, SciMark benchmarks Medical Imaging  Back-end processing for Compressive Sensing (www.dsp.ece.rice.edu/cs)  Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA) Seismic Data Processing  Rice Inversion project (www.trip.caam.rice.edu)  Contact: Bill Symes (Rice), James Gunning (CSIRO) Computer Graphics and Visualization  Mathematical modeling and smoothing of meshes  Contact: Joe Warren (Rice) Computational Chemistry  Fock Matrix Construction  Contacts: David Bernholdt, Wael Elwasif, Robert Harrison, Annirudha Shet (ORNL) Habanero Compiler  Implement Habanero compiler in Habanero so as to exploit multicore parallelism within the compiler Platforms:  AMD Barcelona Quad-Core  Clearspeed Advance X620  DRC Coprocessor Module w/ Xilinx Virtex FPGA  IBM Cell  IBM Cyclops-64 (C-64)  IBM Power5+, Power6  Intel Xeon Quad-Core  NVIDIA Tesla S870  Sun UltraSparc T1, T2 ... Additional suggestions welcome!

48 48 Habanero Research Topics 1) Language Research  Explicit parallelism: portable constructs for homogeneous & heterogeneous multicore  Implicit deterministic parallelism: array views, single-assignment constructs  Implicit non-deterministic parallelism: unordered iterators, partially ordered statement blocks  Builds on our experiences with the X10, CAF, HPF, Matlab D, Fortran 90 and Sisal languages 2) Compiler research  New Parallel Intermediate Representation (PIR)  Automatic analysis, transformation, and parallelization of PIR  Optimization of high-level arrays and iterators  Optimization of synchronization, data transfer, and transactional memory operations  Code partitioning for accelerators  Builds on our experiences with the D System, Massively Scalar, Telescoping Languages Framework, ASTI and PTRAN research compilers

49 49 Habanero Research Topics (contd.) 3) Virtual machine research  VM support for work-stealing scheduling algorithms with extensions for places, transactions, task groups  Runtime support for other Habanero language constructs (phasers, regions, distributions)  Integration and exploitation of lightweight profiling in VM scheduler and memory management system  Builds on our experiences with the Jikes Research Virtual Machine 4) Concurrency library research  New nonblocking data structures to support the Habanero runtime  Efficient software transactional memory libraries  Builds on our experiences with the java.util.concurrent and DSTM2 libraries 5) Toolkit research  Program analysis for common parallel software errors  Performance attribution of shared code regions (loops, procedure calls) using static and dynamic calling context  Builds on our experiences with the HPCToolkit, Eclipse PTP and DrJava projects

50 50 Opportunities for Broader Impact  Education  Influence how parallelism is taught in future Computer Science curricula  Open Source  Build an open source testbed to grow ecosystem for researchers in Parallel Software area  Industry standards  Use research results as proofs of concept for new features that can be standardized  Infrastructure can provide foundation for reference implementations  Collaborations welcome!

51 51 Habanero Team (Nov 2007) Send to Vivek Sarkar if you are in a PhD, postdoc, research scientist, or programmer position in the Habanero project, or in collaborating with us!

52 52 Other Challenges in Code Optimization of Parallel Code  Optimization of task coordination  Task creation and termination --- fork, join  Mutual exclusion --- locks, transactions  Synchronization --- semaphores, barriers  Data Locality Optimizations  Computation and data alignment  Communication optimizations  Deployment and Code Generation  Homogeneous Multicore  Heterogeneous Multicore and Accelerators  Automatic Parallelization Revisited ...

53 53 Related Work (Incomplete List)  Analysis of nondeterministic sequentially consistent parallel programs  [Shasha, Snir 1988], [Midkiff et al 1989], [Chow, Harrison 1992], [Lee et al 1997], …  Analysis of deterministic parallel programs with copy-in/copy-out semantics  [Srinivasan 1994], [Ferrante et al 1996], …  Value-oriented semantics for functional subsets of PDGs  [Selke 1989], [Cartwright, Felleisen 1989], [Beck, Pingali 1989], [Ottenstein, Ballance, Maccabe 1990], …  Serialization of restricted subsets of PDGs  [Ferrante, Mace, Simons 1988], [Simons et al 1990], …  Concurrency analysis  [Long, Clarke 1989], [Duesterwald, Soffa 1991], [Masticola, Ryder 1993], [Naumovich, Avrunin 1998], [Agarwal et al 2007], …

54 54 PLDI 2008 Tutorial (Tucson, AZ)  Analysis and Optimization of Parallel Programs  Intermediate representations for parallel programs  Data flow analysis frameworks for parallel programs  Locality analyses: scalar/array privatization, escape analysis of objects, locality types  Memory models and their impact on code optimization of locks and transactional memory operations  Optimizations of task partitions and synchronization operations  Sam Midkiff, Vivek Sarkar  Sunday afternoon (June 8, 2008, 1:30pm - 5:00pm)

55 55 Conclusions  New paradigm shift in Code Optimization due to Parallel Programs  Foundations of Code Optimization will need to be revisited from scratch  Foundations will impact high-level and low-level optimizers, as well as tools  Exciting times to be a compiler researcher!


Download ppt "Code Optimization of Parallel Programs Vivek Sarkar Rice University Vivek Sarkar Rice University"

Similar presentations


Ads by Google