The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

1 Programming Languages (CS 550) Lecture Summary Functional Programming and Operational Semantics for Scheme Jeremy R. Johnson.

Chapter 3 Loaders and Linkers

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Programming Types of Testing.

Programming Logic and Design Sixth Edition

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.

CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.

1 CSE1301 Computer Programming: Lecture 15 Flowcharts and Debugging.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.

Techniques for Reducing the Overhead of Run-time Parallelization Lawrence Rauchwerger Department of Computer Science Texas A&M University

Multiscalar processors

Parallelizing Compilers Presented by Yiwei Zhang.

Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.

1 CSE1301 Computer Programming: Lecture 15 Flowcharts, Testing and Debugging.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Algorithms. Introduction Before writing a program: –Have a thorough understanding of the problem –Carefully plan an approach for solving it While writing.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

CHAPTER 7: SORTING & SEARCHING Introduction to Computer Science Using Ruby (c) Ophir Frieder at al 2012.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

REPETITION STRUCTURES. Topics Introduction to Repetition Structures The while Loop: a Condition- Controlled Loop The for Loop: a Count-Controlled Loop.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Synchronization (Barriers) Parallel Processing (CS453)

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

COP4020 Programming Languages

Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.

Putting Pointer Analysis to Work Rakesh Ghiya and Laurie J. Hendren Presented by Shey Liggett & Jason Bartkowiak.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.

Thread-Level Speculation Karan Singh CS

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

CSC 211 Data Structures Lecture 13

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.

How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.

Implementation of a Run Time System to support parallelization of Partially Parallel loops using R-LRPD Test Pranav Garg ( Y5313 ) Virajith Jalaparti (

Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!

Programming Logic and Design Fifth Edition, Comprehensive Chapter 6 Arrays.

1 CSE1301 Computer Programming: Lecture 16 Flow Diagrams and Debugging.

VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)

Lecture 38: Compiling for Modern Architectures 03 May 02

Code Optimization.

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Multiprocessor Highlights

Instruction Level Parallelism (ILP)

CSC3050 – Computer Architecture

Programming with Shared Memory Specifying parallelism

Loop-Level Parallelism

C. M. Overstreet Old Dominion University Fall 2005

Presentation transcript:

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min

Introduction Motivation : Current parallelizing compilers cannot handle complex or statically insufficiently defined access patterns. ( input dependent, run-time dependent conditions, subscripted subscripts, etc…) LRPD Test - Speculatively executes the loop as a doall - applies a fully parallel data dependency test (x-iter.) - if the test fails, then the loop is re-executed serially

Inspector-Executor Method Inspector/Executor - extract and analyze the memory access pattern - transform the loop if necessary and execute Disadvantage - cost and side effect : if the address computation of the array under test depends on the actual data computation. - parallel execution of the inspector loop is not always possible

speculative run-time parallelization Static analysis Run-time transformations Polaris Checkpoint Speculative parallel execution test restore heuristic fail pass reorder sequential execution Compile time Run Time

Hazards (during the speculative execution) Exceptions - invalidate the parallel execution - clear the exception flag, restore the values of any altered variables, and execute serially. Cross-iteration dependencies in the loop - LRPD Test

LPD Test (The Lazy Privatizing doall Test) 1. Marking Phase - For each shared array A[1:s] - read, write and not-private shadow arrays, A r [1:s], A w [1:s], and A np [1:s] (a) Uses : if this array element has not been modified, then set corresponding elem. in A r and A np (b) Defs : set corresp. elem. in A w and clear in A r if set. (c) tw i (A) : Count the total number of write accesses to A that are set in this iteration (i : iteration #)

LPD Test (The Lazy Privatizing doall Test) 2. Analysis Phase (Performed after the speculative exec.) (a) Compute (i) tw(A) = (tw i (A)) (ii) tm(A) = sum(A w [1:s]) (iii) tm(A) != tw(A) : cross iteration output depend. (b) If any(A w [:] & A r [:]), then ends the phase. : def and use values stored at the same location in different iterations (flow/anti dependency)

LPD Test (The Lazy Privatizing doall Test) 2. Analysis Phase (Performed after the speculative exec.) (c) Else if tw(A) == tm(A), then the loop is doall (without privatizing the array A) (d) Else if any(A w [:] & A np [:]), then the array A is not privatizable. (there is at least one iteration in which some element of A was used before modified) (e) Otherwise, the loop was made into a doall by privatizing the shared array A.

Dynamic dead reference elimination To avoid introducing false dependences, the marking of the read and private shadow arrays, A r and A np can be postponed until the value of the shared variable is actually used. Definition : A dynamic dead read reference in a loop is a read access of a shared variable that does not contribute to the computation of any other shared variable which is live at loop end. The “lazy” marking employed by the LPD test, i.e., the dynamic dead reference elimination tech., allows it to qualify more loops than the PD test.

PD Test Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then A(L(i)) = z + C(i) endif enddo PD testShadow arraystwtm 1234 AwAw ArAr 1111 A np 1111 A w (:) & A r (:) A w (:) & A np (:) Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endif enddo B1(1:5) = ( ) K(1:5) = ( ) L(1:5) = ( )

PD Test Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then A(L(i)) = z + C(i) endif enddo PD testShadow arraystwtm 1234 AwAw ArAr 1010 A np 1111 A w (:) & A r (:)0000 A w (:) & A np (:)0101 Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endif enddo B1(1:5) = ( ) K(1:5) = ( ) L(1:5) = ( )

LPD Test Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then A(L(i)) = z + C(i) endif enddo PD testShadow arraysTwtm 1234 AwAw ArAr 1010 A np 1010 A w (:) & A w (:)0000 A w (:) & A np (:)0000 Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then markread(K(i)) markwrite(L(i)) A(L(i)) = z + C(i) endif enddo B1(1:5) = ( ) K(1:5) = ( ) L(1:5) = ( )

Run-time Reduction Parallelization Recognition of reduction variable + Parallelizing reduction variable Pattern matching identification - The DD test to qualify a statement as a reduction statement cannot be performed statically in the presence of input- dependent access patterns. - Syntactic pattern matching cannot identify all potential reduction variables (e.g. subscripted subscripts)

The LRPD Test : Extending the LPD Test for Reduction Validation do i = 1, n S1: A(K(i)) = ……… S2: ……… = A(L(i)) S3: A(R(i)) = A(R(i)) + exp() enddo doall i = 1, n markwrite(K(i)) markredux(K(i)) S1: A(K(i)) = ……… markread(L(i)) markredux(L(i)) S2: ……… = A(L(i)) markwrite(R(i)) S3: A(R(i)) = A(R(i)) + exp() enddo (a) Source program (b) transformed program markredux operation sets the shadow array element of A nx to true A nx : To check only that the reduction variable is not accessed outside the single reduction statement.

LRPD Test Modified Analysis Pass - 2(d’) Else if any(A w [:] & A np [:] & A nx [:]), then some elements of A written in the loop is neither a reduction variable nor privatizable. Thus, the loop is not a doall and the phase ends. - 2(e’) Otherwise, the loop was made into a doall by parallelizing reduction and privatization.

Performance (1)

Performance (2)

Experimental Results Summary

Other Run-time Parallelization Papers “Techniques for Speculative Run-Time Parallelization of Loops”, Manish, Gupta and Rahul Nim, SC’98. - More efficient run-time array privatization - No rolling back of entire loop computation and complete the loop (by generating synchronization) - Early hazard detection

Other Run-time Parallelization Papers “Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors”, Ye Zhang, L., Rauchwerger, and Josep Torrellas. HPCA Run-time parallelization techniques are often computationally expensive and not general enough. - Idea : execute the code in parallel speculatively and let extended cache coherence protocol hardware detect any dependence violations. - Perf. 7.3 for 16 procs. & 50% faster than soft-only