Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov 2010 1.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Masahiro Fujita Yoshihisa Kojima University of Tokyo May 2, 2008

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

SoC CAD 1 Automatically Exploiting Cross-Invocation Paralleism Using Runtime Information 徐子傑 Hsu,Zi Jei Department of Electrical Engineering National.

Models of Concurrency Manna, Pnueli.

Practical techniques & Examples

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Pointer Analysis – Part I Mayur Naik Intel Research, Berkeley CS294 Lecture March 17, 2009.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

1 Program Slicing Purvi Patel. 2 Contents Introduction What is program slicing? Principle of dependences Variants of program slicing Slicing classifications.

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

Reference: Message Passing Fundamentals.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Computer Architecture II 1 Computer architecture II Lecture 9.

Multiscalar processors

Overview of program analysis Mooly Sagiv html://

Chapter 18 Testing Conventional Applications

Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.

Overview of program analysis Mooly Sagiv html://

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Architectural Design.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

“Software” Esterel Execution (work in progress) Dumitru POTOP-BUTUCARU Ecole des Mines de Paris

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

Unit Testing 101 Black Box v. White Box. Definition of V&V Verification - is the product correct Validation - is it the correct product.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)

Agenda  Quick Review  Finish Introduction  Java Threads.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

Automatic Thread Extraction with Decoupled Software Pipelining

Multiscalar Processors

Optimizing Compilers Background

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.

Specifying Multithreaded Java semantics for Program Verification

Henk Corporaal TUEindhoven 2009

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Control Flow Analysis (Chapter 7)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Presentation transcript:

Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov

Automatic Parallelization is… …the extraction of concurrency from sequential code by the compiler. Variations: – Granularity: Instruction, Data, Task – Explicitly- or implicitly-parallel languages 2

Overview This time: preliminaries. Soundness Dependence Analysis and Representation Parallel Execution Models and Transforms – DOALL, DOACROSS, DSWP Family Next time: breakthroughs. 3

SOUNDNESS Why is automatic parallelization hard? 4

int main() { printf(“Hello ”); printf(“World ”); return 0; } Expected output: Hello World 5

int main() { printf(“Hello ”); printf(“World ”); return 0; } Expected output: Hello World Invalid output: World Hello Can we formally describe the difference? 6

Soundness Constraint Compilers must preserve the observable behavior of the program. Observable behavior – Consuming bytes of input. – Program output. – Program termination. – etc. 7

Corollaries Compiler must prove that a transform preserves observable behavior. – Same side effects, in the same order. In absence of a proof, the compiler must be conservative. 8

Semantics: simple example Observable Behavior – Operations – Partial order Compiler must respect partial order when optimizing. main: printf(“Hello ”); printf(“World ”); return 0; Must happen before. 9

Importance to Parallelism Parallel execution: task interleaving. If two operations P and Q are ordered, – Concurrent execution of P, Q may violate the partial order. To schedule operations for concurrent execution, the compiler must be aware of this partial order! P QP Q T1T2T1T2 Scenario AScenario B Time 10

DEPENDENCE ANALYSIS How the compiler discovers its freedom. 11

Sequential languages present a total order of the program statements. Only a partial order is required to preserve observable behavior. The partial order must be discovered. float foo(a, b) { float t1 = sin(a); float t2 = cos(b); return t1 / t2; } 12

Although t1 appears before t2 in the program… Re-ordering t1, t2 cannot change observable behavior. float foo(a, b) { float t1 = sin(a); float t2 = cos(b); return t1 / t2; } 13

Dependence Analysis Source-code order is pessimistic. Dependence analysis identifies a more precise partial order. This gives the compiler freedom to transform the code. 14

Analysis is incomplete a precise answer in the best case a conservative approximation in the worst case. ‘Conservative’ depends on the user. Approximation begets, – spurious dependences, limited compiler freedom. 15

Program Order from Data Flow Data Dependence One operation computes a value which is used by another P = …; Q = …; R = P + Q; S = Q + 1; PQ RS 16

Program Order from Data Flow Data Dependence One operation computes a value which is used by another Sub-types P = …; Q = …; R = P + Q; S = Q + 1; PQ RS Flow—Read after Write Anti—Write after Read Output—Write after Write 17 Artifacts of shared resource

Program Order from Control Flow Control Dependence One operation may enable/disable the execution of another, and… The target sources deps to operations outside of the control region. Dependent: – P enables Q or R. Independent: – S will execute no matter what. if( P ) Q; else R; S; if( P )S QR 18

Control Dep: Example. The effect of X is local to this region. Executing X outside of the if-statement cannot change behavior. X independent of the if- statement. if( P ) { X = Y + 1; print(X); } 19 if( P ) print X = Y + 1

Program Order from SysCalls. Side Effects Observable behavior is accomplished via system calls. Very difficult to prove that system calls are independent. print(P); print(Q); P Q 20

Analysis is non-trivial. Consider two iterations Earlier iteration: B stores Later iteration: A loads Dependence? n = list->front; while( n != null) { // A t = n->value; // B n->value = t+1; // C n = n->next; } ? 21

Intermediate Representation Summarize a high-level view of program semantics. For parallelism, we want – Explicit dependences. 22

The Program Dependence Graph A directed multigraph – Vertices: operations; Edges: dependences. Benefits: – Dependence is explicit. Detriments: – Expensive to compute: O(N 2 ) dependence queries – Loop structure not always visible. [Ferrante et al, 1987] 23

void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) PDG Example Control DepData Dep L L L L L 24

PARALLEL EXECUTION MODELS 25

A parallel execution model is… a general strategy for the distribution of work across multiple computational units. Today, we will cover: – DOALL (IMT, “Embarrassingly Parallel”) – DOACROSS (CMT) – DSWP (PMT, Pipeline) 26

Visual Vocabulary: Timing Diagrams A1 Time T1T2T3 Execution Contexts Work Units Communication Name and iteration number Idle Context (wasted parallelism) 27 Synchronization

The Sequential Model W1 W2 W3 T1T2T3 W3 All subsequent models are compared to this… Time 28

IMT: “Embarrassingly Parallel” Model A set of independent work units W i No synch necessary between work units. Speedup proportional to number of contexts. Can be automated for independent iterations of a loop W1 W4 W7 Time W3 W6 W9 T1T2T3 W2 W5 W8 W10W12W11 29

No cite available; older than history. Search for loops without dependences between iterations. Partition the iteration space across contexts. void foo() { start( task(0,4) ); start( task(1,4) ); start( task(2,4) ); start( task(3,4) ); wait(); } void task(k,M) { for(i=k; i<N; i+=M) array[i] = work(array[i]); } void foo() { for(i=0; i<N; ++i) array[i] = work(array[i]); } The DOALL Transform Before After 30

Limitations of DOALL void foo() { for(i=0; i<N; ++i) array[i] = work(array[i-1]); } void foo() { for(i=0; i<N; ++i) { array[i] = work(array[i]); if( array[i] > 4 ) break; } void foo() { for(i in LinkedList) *i = work(*i); } Inapplicable 31

Variants of DOALL Different iteration orders to optimize for the memory hierarchy. – Skewing, tiling, the polyhedral model, etc. Enabling transformations, – reductions – privatization 32

CMT: A more universal model. Any dependence which crosses contexts can be respected via synchronization or communication. W1 Time T1T2T3 W2 W3 W4 W5 W6 W7 W2 33

The DOACROSS Transform [Cytron, 1986] 34 void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; }

The DOACROSS Transform [Cytron, 1986] 35 void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) Control DepData Dep L L L L L

The DOACROSS Transform [Cytron, 1986] void foo(lst) { Node *j = lst->front start( task() ); produce(q 1, j); produce(q 2, ‘io); wait(); } void task() { while( true ) { j = consume(q 1 ); if( !j ) break; q = work( j->value()); consume(q 2 ); printf(“%d\n”, q); produce(q 2, ‘io); j = j->next(); produce(q 1, j); } produce(q 1, null); } 36 void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; }

void foo(lst) { Node *j = lst->front start( task() ); produce(q 1, j); produce(q 2, ‘io); wait(); } void task() { while( true ) { j = consume(q 1 ); if( !j ) break; q = work( j->value()); consume(q 2 ); printf(“%d\n”, q); produce(q 2, ‘io); j = j->next(); produce(q 1, j); } produce(q 1, null); } The DOACROSS Transform [Cytron, 1986] 37 void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; }

void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } Limitations of DOACROSS [1/2] W1 Time T1T2T3 W2 W3 W4 W5 W6 W7 W2 Synchronized Region 38

Limitations of DOACROSS [2/2] Dependences are on the critical path. Work/time decreases with communication latency. W1 Time W3 T1T2T3 W2 39

PMT: The Pipeline Model Execution in stages. Communication in one direction; cycles are local to a stage. Work/time is insensitive to communication latency. Speedup limited by latency of slowest stage. X1 Time Z1 T1T2T3 Y1 X2 Z2 Y2 X3 Z3 Y3 40 X4 Y4 Z4

PMT: The Pipeline Model Execution in stages. Communication in one direction; cycles are local to a stage. Work/time is insensitive to communication latency. Speedup limited by latency of slowest stage. X1 Time Z1 T1T2T3 Y1 X2 Z2 Y2 X3 Y3 41 X4

Decoupled Software Pipelining (DSWP) Goal: partition pieces of an iteration for – acyclic communication. – balanced stages. Two pieces, often confused: – DSWP: analysis [Rangan et al, ‘04]; [Ottoni et al, ‘05] – MTCG: code generation [Ottoni et al, ’05] 42

Finding a Pipeline Start with the PDG of a program. Compute the Strongly Connected Components (SCCs) of the PDG. – Result is a DAG. Greedily assign SCCs to stages as to balance the pipeline. 43

DSWP: Source Code void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } 44

DSWP: PDG void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) Control DepData Dep L L L L 45 L

j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) Control DepData Dep DSWP: Identify SCCs void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } L L L L 46 L

j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) Control DepData Dep DSWP: Assign to Stages. void foo(LinkedList *lst) { Node *j = lst->front; while( j ) { int q=work(j->value); printf(“%d\n”, q); j = j->next; } L L L L 47 L

Multithreaded Code Generation (MTCG) Given a partition: – Stage 0: { while(j); j=j->next; } – Stage 1: { q = work( j->value ); } – Stage 2: { printf(“%d”, q); } Next, MTCG will generate code. Special care for deps which span stages! 48 [Ottoni et al, ‘05]

MTCG void stage0(Node *j) { } void stage1() { } void stage2() { } 49

MTCG: Copy Instructions void stage0(Node *j) { while( j ) { j = j->next; } } void stage1() { q = work(j->value); } void stage2() { printf(“%d”,q); } 50

MTCG: Replicate Control void stage0(Node *j) { while( j ) { produce(q1,‘c); produce(q2,‘c); j = j->next; } produce(q1,‘b); produce(q2,‘b); } void stage1() { while( true ) { if(consume(q1)!=‘c) break; q = work(j->value); } } void stage2() { while(true) { if(consume(q2)!=‘c) break; printf(“%d”,q); } } 51

MTCG: Communication void stage0(Node *j) { while( j ) { produce(q1,‘c); produce(q2,‘c); produce(q3,j); j = j->next; } produce(q1,‘b); produce(q2,‘b); } void stage1() { while( true ) { if(consume(q1)!=‘c) break; j = consume(q3); q = work(j->value); produce(q4,q); } } void stage2() { while(true) { if(consume(q2)!=‘c) break; q = consume(q4); printf(“%d”,q); } } 52

j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) Control DepData Dep MTCG Example while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() T1T2T3 53

j=j->next while(j) q=work(j->value) printf() j=lst->front (entry) Control DepData Dep MTCG Example while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() while(j) j=j->nextq=work(j->value) printf() T1T2T3 54

55 Loop Speedup Assuming 32-entry hardware queue [Rangan et al, ‘08]

Summary Soundness limits compiler freedom Execution models are general strategies for distributing work, managing communication. DOALL, DOACROSS, DSWP. Next Class: DSWP+, Speculation 56