Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz University of California, Santa Barbara Santa.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Lightweight Abstraction for Mathematical Computation in Java 1 Pavel Bourdykine and Stephen M. Watt Department of Computer Science Western University London.
Automatic Parallelization of Divide and Conquer Algorithms Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz April 7 th, 2010 Youngjoon Jo.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 5 - Functions Outline 5.1Introduction 5.2Program.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Previous finals up on the web page use them as practice problems look at them early.
Encapsulation by Subprograms and Type Definitions
Compositional Pointer and Escape Analysis for Java Programs Martin Rinard Laboratory for Computer Science MIT John Whaley IBM Tokyo Research Laboratory.
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz
Schedule Midterm out tomorrow, due by next Monday Final during finals week Project updates next week.
PRASHANTHI NARAYAN NETTEM.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
Parallel Programming in Java with Shared Memory Directives.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
C++ for Engineers and Scientists Second Edition Chapter 6 Modularity Using Functions.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. C How To Program - 4th edition Deitels Class 05 University.
Mark Marron, Mario Mendez-Lojo Manuel Hermenegildo, Darko Stefanovic, Deepak Kapur 1.
ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.
Chapter 6 Programming Languages © 2007 Pearson Addison-Wesley. All rights reserved.
C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Effective Fine-Grain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives Martin Rinard University of California,
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
CS 343 presentation Concrete Type Inference Department of Computer Science Stanford University.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Serialization Sets A Dynamic Dependence-Based Parallel Execution Model Matthew D. Allen Srinath Sridharan Gurindar S. Sohi University of Wisconsin-Madison.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Tuning Threaded Code with Intel® Parallel Amplifier.
DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard University of California, Santa Barbara.
Constructors and Destructors
Distributed Shared Memory
Java Primer 1: Types, Classes and Operators
Compositional Pointer and Escape Analysis for Java Programs
Computer Engg, IIT(BHU)
Martin Rinard Laboratory for Computer Science
CS212: Object Oriented Analysis and Design
Department of Computer Science University of California,Santa Barbara
C++ for Engineers and Scientists Second Edition
Chapter 5 - Functions Outline 5.1 Introduction
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard University.
Constructors and Destructors
Presentation transcript:

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz University of California, Santa Barbara Santa Barbara, California

Goal Develop a Parallelizing Compiler for Object-Oriented Computations Current Focus Irregular Computations Dynamic Data Structures Future Persistent Data Distributed Computations New Analysis Technique: Commutativity Analysis

Structure of Talk Model of Computation Example Commutativity Testing Steps To Practicality Experimental Results Conclusion

Model of Computation operations objects initial object state executing operation new object state invoked operations operation

Graph Traversal Example class graph { int val, sum; graph *left, *right; }; void graph::traverse(int v) { sum += v; if (left !=NULL) left->traverse(val); if (right!=NULL) right->traverse(val); } Goal Execute left and right traverse operations in parallel

Parallel Traversal

Commuting Operations in Parallel Traversal

Model of Computation Operations: Method Invocations In Example: Invocations of graph::traverse left->traverse(3) right->traverse(2) Objects: Instances of Classes In Example: Graph Nodes Instance Variables Implement Object State In Example: val, sum, left, right

Model of Computation Operations: Method Invocations In Example: Invocations of graph::traverse left->traverse(3) right->traverse(2) Objects: Instances of Classes In Example: Graph Nodes

Separable Operations Each Operation Consists of Two Sections Object Section Only Accesses Receiver Object Invocation Section Only Invokes Operations Both Sections Can Access Parameters

Basic Approach Compiler Chooses A Computation to Parallelize In Example: Entire graph::traverse Computation Compiler Computes Extent of the Computation Representation of all Operations in Computation Current Representation: Set of Methods In Example: { graph::traverse } Do All Pairs of Operations in Extent Commute? No - Generate Serial Code Yes - Generate Parallel Code In Example: All Pairs Commute

Code Generation For Each Method in Parallel Computation Augments Class Declaration With Mutual Exclusion Lock Generates Driver Version of Method Invoked from Serial Code to Start Parallel Execution Invokes Parallel Version of Operation Waits for Entire Parallel Computation to Finish Generates Parallel Version of Method Object Section Lock Acquired at Beginning Lock Released at End Ensure Atomic Execution Invocation Section Invoked Operations Execute in Parallel Invokes Parallel Version

Class Declaration class graph { lock mutex; int val, sum; graph *left, *right; }; Driver Version void graph::traverse(int v){ parallel_traverse(v); wait(); } Code Generation In Example

Parallel Version In Example void graph::parallel_traverse(int v) { mutex.acquire(); sum += v; mutex.release(); if (left != NULL) spawn(left->parallel_traverse(val)); if (right != NULL) spawn(right->parallel_traverse(val)); }

Compiler Structure Computation Selection Extent Computation Commutativity Testing Generate Parallel Code Generate Serial Code All Operations Commute Operations May Not Commute Entire Computation of Each Method Traverse Call Graph to Extract Extent All Pairs of Operations In Extent

Traditional Approach Data Dependence Analysis Analyzes Reads and Writes Independent Pieces of Code Execute in Parallel Demonstrated Success for Array-Based Programs

Data Dependence Analysis in Example For Data Dependence Analysis To Succeed in Example left and right traverse Must Be Independent left and right Subgraphs Must Be Disjoint Graph Must Be a Tree Depends on Global Topology of Data Structure Analyze Code that Builds Data Structure Extract and Propagate Topology Information Fails For Graphs

Properties of Commutativity Analysis Oblivious to Data Structure Topology Local Analysis Simple Analysis Wide Range of Computations Lists, Trees and Graphs Updates to Central Data Structure General Reductions Introduces Synchronization Relies on Commuting Operations

Commutativity Testing

Commutativity Testing Conditions Do Two Operations A and B Commute? Compiler Considers Two Execution Orders A;B - A executes before B B;A - B executes before A Compiler Must Check Two Conditions Instance Variables New values of instance variables are same in both execution orders Invoked Operations A and B together directly invoke same set of operations in both execution orders

Commutativity Testing Conditions

Commutativity Testing Algorithm Symbolic Execution: Compiler Executes Operations Computes with Expressions not Values Compiler Symbolically Executes Operations In Both Execution Orders Expressions for New Values of Instance Variables Expressions for Multiset of Invoked Operations

Expression Simplification and Comparison Compiler Applies Rewrite Rules to Simplify Expressions a*(b+c)  a*b)+(a*c) b+(a+c)  (a+b+c) a+if(b<c,d,e)  if(b<c,a+d,a+e) Compiler Compares Corresponding Expressions If All Equal - Operations Commute If Not All Equal - Operations May Not Commute

Commutativity Testing Example Two Operations r->traverse(v1) and r->traverse(v2) In Order r->traverse(v1) ; r->traverse(v2) Instance Variables New sum= (sum+v1)+v2 Invoked Operations if(right!=NULL,right->traverse(val)), if(left!=NULL,left->traverse(val)), if(right!=NULL,right->traverse(val)), if(left!=NULL,left->traverse(val)) In Order r->traverse(v2) ; r->traverse(v1) Instance Variables New sum= (sum+v2)+v1 Invoked Operations if(right!=NULL,right->traverse(val)), if(left!=NULL,left->traverse(val)), if(right!=NULL,right->traverse(val)), if(left!=NULL,left->traverse(val))

Important Special Case Independent Operations Commute Analysis in Current Compiler Dependence Analysis Operations on Objects of Different Classes Independent Operations on Objects of Same Class Symbolic Commutativity Testing Dependent Operations on Objects of Same Class Future Integrate Pointer or Alias Analysis Integrate Array Data Dependence Analysis

Important Special Case Independent Operations Commute Conditions for Independence Operations Have Different Receivers Neither Operation Writes an Instance Variable that Other Operation Accesses Detecting Independent Operations In Type-Safe Languages Class Declarations Instance Variable Accesses Pointer or Alias Analysis

Analysis in Current Compiler Dependence Analysis Operations on Objects of Different Classes Independent Operations on Objects of Same Class Symbolic Commutativity Testing Dependent Operations on Objects of Same Class Future Integrate Pointer or Alias Analysis Integrate Array Data Dependence Analysis

Steps to Practicality

Programming Model Extensions Extensions for Read-Only Data Allow Operations to Freely Access Read-Only Data Enhances Ability of Compiler to Represent Expressions Increases Set of Programs that Compiler can Analyze Analysis Granularity Extensions Integrate Operations into Callers for Analysis Purposes Coarsens Commutativity Testing Granularity Reduces Number of Pairs Tested for Commutativity Enhances Effectiveness of Commutativity Testing

Optimizations Synchronization Optimizations Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data Reduce Number of Acquire and Release Constructs Parallel Loop Optimization Suppress Exploitation of Excess Concurrency

Extent Constants Motivation: Allow Parallel Operations to Freely Access Read-Only Data Extent Constant VariableGlobal variable or instance variable written by no operation in extent Extent Constant ExpressionExpression whose value depends only on extent constant variables or parameters Extent Constant ValueValue computed by extent constant expression Extent ConstantAutomatically generated opaque constant used to represent an extent constant value Requires: Interprocedural Data Usage Analysis Result Summarizes How Operations Access Instance Variables Interprocedural Pointer Analysis for Reference Parameters

Extent Constant Variables In Example void graph::traverse(int v) { sum += v; if (left != NULL) left->traverse(val); if (right != NULL) right->traverse(val); } Extent Constant Variable

Advantages of Extent Constants Extent Constants Extend Programming Model Enable Direct Global Variable Access Enable Direct Access of Objects other than Receiver Extent Constants Make Compiler More Effective Enable Compact Representations of Large Expressions Enable Compiler to Represent Values Computed by Otherwise Unanalyzable Constructs

Auxiliary Operations Motivation: Coarsen Granularity of Commutativity Testing An Operation is an Auxiliary Operation if its Entire Computation Only Computes Extent Constant Values Only Externally Visible Writes are to Local Variables of Caller Auxiliary Operations are Conceptually Part of Caller Analysis Integrates Auxiliary Operations into Caller Represents Computed Values using Extent Constants Requires: Interprocedural Data Usage Analysis Interprocedural Pointer Analysis for Reference Parameters Intraprocedural Reaching Definition Analysis

Auxiliary Operation Example int graph::square_and_add(int v) { return(val*val + v); } void graph::traverse(int v) { sum += square_and_add(v); if (left != NULL) left->traverse(val); if (right != NULL) right->traverse(val); } Extent Constant Expression Extent Constant Variable Parameter

Advantages of Auxiliary Operations Coarsen Granularity of Commutativity Testing Reduces Number of Pairs Tested for Commutativity Enhances Effectiveness of Commutativity Testing Algorithm Support Modular Programming

Synchronization Optimizations Goal: Eliminate or Reduce Synchronization Overhead Synchronization Elimination Data Use One Lock for Multiple Objects Computation Generate One Lock Acquire and Release for Multiple Operations on the Same Object An Operation Only Computes Extent Constant Values Compiler Does Not Generate Lock Acquire and Release ThenIf Lock Coarsening

Data Lock Coarsening Example class vector { lock mutex; double val[NDIM]; } void vector::add(double *v){ mutex.acquire(); for(int i=0; i < NDIM; i++) val[i] += v[i]; mutex.release(); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; mutex.release(); acc.add(v); } class vector { double val[NDIM]; } void vector::add(double *v){ for(int i=0; i < NDIM; i++) val[i] += v[i]; } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } Original CodeOptimized Code

Computation Lock Coarsening Example class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; p = computeInter(b,v); phi -= p; acc.add(v); } void body::loopsub(body *b){ int i; mutex.acquire(); for (i = 0; i < N; i++) { this->gravsub(b+i); } mutex.release(); } Original CodeOptimized Code

Parallel Loops Goal: Generate Efficient Code for Parallel Loops If A Loop is in the Following Form for (i = exp1; i < exp2; i += exp3) { exp4->op(exp5,exp6,...); } Where exp1, exp2,... Extent Constant Expressions Then Compiler Generates Parallel Loop Code

Parallel Loop Optimization Without Parallel Loop Optimization Each Loop Iteration Generates a Task Tasks are Created and Scheduled Sequentially Each Iteration Incurs Task Creation and Scheduling Overhead With Parallel Loop Optimization Generated Code Immediately Exposes All Iterations Scheduler Operates on Chunks of Loop Iterations Each Chunk of Iterations Incurs Scheduling Overhead Advantages Enables Compact Representation for Loop Computation Reduces Task Creation and Scheduling Overhead Parallelizes Overhead

Suppressing Excess Concurrency Goal: Reduce Overhead of Exploiting Parallelism Goal Achieved by Generating Computations that Execute Operations Serially with No Parallelization Overhead Use Synchronization Required to Execute Safely in Parallel Context Mechanism: Mutex Versions of Methods Object Section Acquires Lock at Beginning Releases Lock at End Invocation Section Operations Execute Serially Invokes Mutex Version Current Policy: Each Parallel Loop Iteration Invokes Mutex Version of Operation Suppresses Parallel Execution Within Iterations of Parallel Loops

Experimental Results

Methodology Built Prototype Compiler Built Run Time System Concurrency Generation and Task Management Dynamic Load Balancing Synchronization Acquired Two Complete Applications Barnes-Hut N-Body Solver Water Code Automatically Parallelized Applications Ran Applications on Stanford DASH Machine Compare Performance with Highly Tuned, Explicitly Parallel Versions from SPLASH-2 Benchmark Suite

Prototype Compiler Clean Subset of C++ Sage++ is Front End Structured As a Source-To-Source Translator Analysis Finds Parallel Loops and Methods Compiler Generates Annotation File Identifies Parallel Loops and Methods Classes to Augment with Locks Code Generator Reads Annotation File Generates Parallel Versions of Methods Inserts Synchronization and Parallelization Code Parallelizes Unannotated Programs

Major Restrictions Motivation: Simplify Implementation of Prototype No Virtual Methods No Operator or Method Overloading No Multiple Inheritance or Templates No typedef, struct, union or enum types Global Variables must be Class Types No Static Members or Pointers to Members No Default Arguments or Variable Numbers of Arguments No Operation Accesses a Variable Declared in a Class from which its Receiver Class Inherits

Run Time Library Motivation: Provide Basic Concurrency Managment Single Program, Multiple Data Execution Model Single Address Space Alternate Serial and Parallel Phases Library Provides Task Creation and Synchronization Primitives Dynamic Load Balancing Implemented Stanford DASH Shared-Memory Multiprocessor SGI Shared-Memory Multiprocessors

Applications Barnes-Hut O(NlgN) N-Body Solver Space Subdivision Tree 1500 Lines of C++ Code Water Simulates Liquid Water O(N^2) Algorithm 1850 Lines of C++ Code

Obtaining Serial C++ Version of Barnes-Hut Started with Explicitly Parallel Version (SPLASH-2) Removed Parallel Constructs to get Serial C Converted to Clean Object-Based C++ Major Structural Changes Eliminated Scheduling Code and Data Structures Split a Loop in Force Computation Phase Introduced New Field into Particle Data Structure

Obtaining Serial C++ Version of Water Started with Serial C translated from FORTRAN Converted to Clean Object-Based C++ Major Structural Change Auxiliary Objects for O(N^2) phases

Commutativity Statistics for Barnes-Hut Position (3 Methods) Force (6 Methods) Velocity (3 Methods) Symbolically Executed Pairs Independent Pairs Pairs Tested for Commutativity Parallel Extent

Auxiliary Operation Statistics for Barnes-Hut Position (3 Methods) Force (6 Methods) Velocity (3 Methods) Call Sites Auxiliary Operation Call Sites Parallel Extent

Performance Results for Barnes-Hut Speedup J J J H J J J H H J J J H J H H H H H H Ideal Commutativity Analysis J H SPLASH-2 Barnes-Hut on DASH Data Set - 8K Particles Number of Processors Barnes-Hut on DASH Data Set - 16K Particles Number of Processors H H H J J J J J J J J J J H H H H H H H Ideal Commutativity Analysis J H SPLASH-2 Speedup

Performance Analysis Motivation: Understand Behavior of Parallelized Program Instrumented Code to Measure Execution Time Breakdowns Parallel Idle - Time Spent Idle in Parallel Section Serial Idle - Time Spent Idle in a Serial Section Blocked - Time Spent Waiting to Acquire a Lock Held by Another Processor Parallel Compute - Time Spent Doing Useful Work in a Parallel Section Serial Compute - Time Spent Doing Useful Work in a Serial Section

Performance Analysis for Barnes-Hut Barnes-Hut on DASH Data Set - 16K Particles Barnes-Hut on DASH Data Set - 8K Particles Cumulative Total Time (seconds) Number of Processors Cumulative Total Time (seconds) Number of Processors Serial Compute Parallel Compute Blocked Serial Idle Parallel Idle

Performance Results for Water J J J J J H H H H H H H H J H J J H J J J J J J J H H H J J J J J H H H H H H H Water on DASH Data Set Molecules Number of Processors Water on DASH Data Set Molecules Number of Processors Speedup Ideal Commutativity Analysis J H SPLASH-2 Ideal Commutativity Analysis J H SPLASH-2 Speedup

Performance Results for Computation Replication Version of Water H H H H H H H H H H J J J J J J J J J J H H H H H H H H H H J J J J J J J J J J Speedup Water on DASH Data Set Molecules Number of Processors Water on DASH Data Set Molecules Number of Processors Ideal Commutativity Analysis J H SPLASH-2 Ideal Commutativity Analysis J H SPLASH-2

Commutativity Statistics for Water Virtual (3 Methods) Forces (2 Methods) Loading (4 Methods) Momenta (2 Methods) Energy (5 Methods) Symbolically Executed Pairs Independent Pairs Pairs Tested for Commutativity Parallel Extent

Auxiliary Operation Statistics for Water Virtual (3 Methods) Forces (2 Methods) Loading (4 Methods) Momenta (2 Methods) Energy (5 Methods) Call Sites Auxiliary Operation Call Sites Parallel Extent

Performance Analysis for Water Water on DASH Data Set molecules Water on DASH Data Set molecules Cumulative Total Time (seconds) Number of Processors Cumulative Total Time (seconds) Number of Processors Serial Compute Parallel Compute Blocked Serial Idle Parallel Idle

Future Work Relative Commutativity Integrate Other Analysis Frameworks Pointer or Alias Analysis Array Data Dependence Analysis Analysis Problems Synchronization Optimizations Analysis Granularity Optimizations Generation of Self-Tuning Code Message Passing Implementation

Bernstein (IEEE Transactions on Computers 1966) Dependence Analysis for Pointer-Based Data Structures Reduction Analysis Ghuloum and Fisher (PPOPP 95) Pinter and Pinter (POPL 92) Callahan (LCPC 91) Commuting Operations in Parallel Languages Rinard and Lam (PPOPP 91) Steele (POPL 90) Barth, Nikhil and Arvind (FPCA 91) Related Work Landi, Ryder and Zhang (PLDI 93) Hendren, Hummel and Nicolau (PLDI 92) Plevyak, Karamcheti and Chien (LCPC 93) Chase, Wegman and Zadek (PLDI 90) Larus and Hilfinger (PLDI 88) Ghiya and Hendren (POPL 96) Ruf (PLDI 95) Wilson and Lam (PLDI 95) Deutsch (PLDI 94) Choi, Burke and Carini (POPL 93)

Conclusions

Conclusion Commutativity Analysis New Analysis Framework for Parallelizing Compilers Basic Idea Recognize Commuting Operations Generate Parallel Code Current Focus Dynamic, Pointer-Based Data Structures Good Initial Results Future Persistent Data Distributed Computations

Latest Version of Paper

What if Operations Do Not Commute? Parallel Tree Traversal Example: Distance of Node from Root class tree { int distance; tree *left; tree *right; }; tree::set_distance(int d) { distance = d; if (left != NULL) left->set_distance(d+1); if (right != NULL) right->set_distance(d+1); }

Equivalent Computation with Commuting Operations tree::sum_distance(int d) { distance = distance + d; if (left != NULL) left->sum_distance(d+1); if (right != NULL) right->sum_distance(d+1); } tree::zero_distance() { distance = 0; if (left != NULL) left->zero_distance(); if (right != NULL) right->zero_distance(); } tree::set_distance(int d) { zero_distance(); sum_distance(d); }

Theoretical Result For Any Tree Traversal on Data With A Commutative Operator (for example +) that has A Zero Element (for example 0) There Exists A Program P such that P Computes the Traversal Commutativity Analysis Can Automatically Parallelize P Complexity Results: Program P is asymptotically optimal if the Data Struture is a Perfectly Balanced Tree Program P has complexity O(N^2) if the Data Structure is a Linked-List

Pure Object-Based Model of Computation Goal Obtain a Powerful, Clean Model of Computation Enable Compiler to Analyze Program Objects: Instances of Classes Implement State with Instance Variables Primitive Types from Underlying Language (int,...) References to Other Objects Nested Objects Operations: Invocations of Methods Each Operation Has Single Receiver Object