University of Michigan November 7, 2018

University of Michigan November 7, 2018
EECS 583 – Class 16 Automatic Parallelization Via Decoupled Software Pipelining University of Michigan November 7, 2018

Announcements + Reading Material
Midterm exam – Next Wednesday in class Starts at 10:40am sharp Open book/notes No regular class next Monday Group office hours in class in 2246 SRB Reminder – Sign up for paper presentation slot on my door if you haven’t done so already Today’s class reading “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov “Revisiting the Sequential Programming Model for Multi-Core,” M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007.

Source: Intel/Wikipedia
Moore’s Law Source: Intel/Wikipedia 2

Single-Threaded Performance Not Improving
3

What about Parallel Programming
What about Parallel Programming? –or- What is Good About the Sequential Model? Sequential is easier People think about programs sequentially Simpler to write a sequential program Deterministic execution Reproducing errors for debugging Testing for correctness No concurrency bugs Deadlock, livelock, atomicity violations Locks are not composable Performance extraction Sequential programs are portable Are parallel programs? Ask GPU developers  Performance debugging of sequential programs straight-forward

Compilers are the Answer? - Proebsting’s Law
“Compiler Advances Double Computing Power Every 18 Years” Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED. Conclusion – Compilers not about performance! 5

Can We Convert Single-threaded Programs into Multi-threaded?

Loop Level Parallelization
Loop Chunk i = 0-9 i = 10-19 Bad news: limited number of parallel loops in general purpose applications 1.3x speedup for SpecINT2000 on 4 cores i = 0-39 i = 20-29 i = 30-39 Rich tradition in parallelizing scientific applications(like SUIF) Our solution: Resolve cross iteration dependences by speculative compiler transformations - Expose more parallelism Compiler can resolve many control, register and memory dependences in a speculative platform. Thread 0 Thread 1

DOALL Loop Coverage 8

What’s the Problem? 1. Memory dependence analysis
for (i=0; i<100; i++) { . . . = *p; *q = . . . } 1. Memory dependence analysis Memory dependence profiling and speculative parallelization

DOALL Coverage – Provable and Profiled
Pointer analysis is done in the provable… successful in many applications like specfp… but even a single dependence ruins everything People have used profiling to find more parallel loops This figure shows the coverage of doall loops using compiler analysis and profile information. Y-axis shows.. Blue bar: time spent in loops can be proved to be doall by compiler Red bar: doall loops identified by profiler Compiler analysis limited for general applications. Profiled can identify more loops. Good but still need to improve Either remove or isolate dependeces Still not good enough!

What’s the Next Problem?
2. Data dependences while (ptr != NULL) { . . . ptr = ptr->next; sum = sum + foo; } Compiler transformations

We Know How to Break Some of These Dependences – Recall ILP Optimizations
Apply accumulator variable expansion! sum1 += x sum2 += x sum+=x Rich tradition in parallelizing scientific applications(like SUIF) Our solution: Resolve cross iteration dependences by speculative compiler transformations - Expose more parallelism Compiler can resolve many control, register and memory dependences in a speculative platform. Thread 1 Thread 0 sum = sum1 + sum2

Data Dependences Inhibit Parallelization
Accumulator, induction, and min/max expansion only capture a small set of dependences 2 options 1) Break more dependences – New transformations 2) Parallelize in the presence of dependences – more than DOALL parallelization We will talk about both, but for now ignore this issue

What’s the Next Problem?
Low Level Reality What’s the Next Problem? Core 1 Core 2 Core 3 alloc1 3. C/C++ too restrictive alloc2 alloc3 char *memory; void * alloc(int size); Time alloc4 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Here we have a very simple memory allocator that just allocates blocks from a character array malloc’d at the beginning of the program. In the iteration, there are two allocation sites, one at the beginning of the loop, the other at the end. It would require truly heroic speculation to break the loop-carried dependence, as the value of memory is highly unpredictable. alloc5 alloc6 14

Loops cannot be parallelized even if computation is independent
Low Level Reality Core 1 Core 2 Core 3 alloc1 char *memory; void * alloc(int size); alloc2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc3 Time alloc4 Similarly, scheduling provides no benefit, as the allocation feeds most of the iteration with then feeds the later allocation. At this point, the parallelism seems impossible to extract, this dependence must be respect for correct execution, yet we can’t speculate it, schedule it, or otherwise deal with it in the compiler. As programmers, though, we know that the alloc functions obeys the contract of malloc, and that we only care that alloc returns a unique memory location. The value returned by any particular call to alloc is irrelevant. As such, the actual order of calls to alloc can occur in any order. But the sequential programming language constrains the compiler to maintaining the single sequential order of dependences inside the alloc call. alloc5 Loops cannot be parallelized even if computation is independent alloc6 15

Commutative Extension
Interchangeable call sites Programmer doesn’t care about the order that a particular function is called Multiple different orders are all defined as correct Impossible to express in C Prime example is memory allocation routine Programmer does not care which address is returned on each call, just that the proper space is provided Enables compiler to break dependences that flow from 1 invocation to next forcing sequential behavior

void * alloc(int size); @Commutative
Low Level Reality Core 1 Core 2 Core 3 alloc1 char *memory; void * alloc(int size); @Commutative alloc2 alloc3 alloc4 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Time The solution we utilize is to add an annotation to the sequential programming model that tells the compiler that calls to this function can be executed in any order. <ANIMATE> We call this annotation COMMUTATIVE, utilizing the programmer’s intuitive notion of a commutative mathematical operator. Armed with this information, the compiler can now execute iterations in parallel. alloc5 alloc6 17

Implementation dependences should not cause serialization.
Low Level Reality Core 1 Core 2 Core 3 alloc1 char *memory; void * alloc(int size); alloc3 alloc4 @Commutative alloc5 alloc6 alloc2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Time The actual dependence pattern is then the dynamic dependence pattern. Though the details of COMMUTATIVE are available in the paper, there are several things to note. First, the annotation is NOT, and I want to stress NOT, a parallel annotation in the spirit of locks or OpenMP pragma’s. It is also not a hint to the compiler about something that it can already figure out, ala the register and inline keywords in C. It is an annotation that communicates information about dependences to the compiler. Additionally, though we arrived at commutative by looking at the execution of program’s, we do NOT expect the programmer to do so. Instead, we expect the programmer to integrate Commutative directly into the programming process where it should be annotated on the function declaration not the implementation. That is, Commutative is a property of contract of the function and not the potentially many implementations. Implementation dependences should not cause serialization. 18

What is the Next Problem?
4. C does not allow any prescribed non-determinism Thus sequential semantics must be assumed even though they not necessary Restricts parallelism (useless dependences) Non-deterministic branch  programmer does not care about individual outcomes They attach a probability to control how statistically often the branch should take Allow compiler to tradeoff ‘quality’ (e.g., compression rates) for performance When to create a new dictionary in a compression scheme

while((char = read(1))) { profitable = compress(char, dict)
280 chars 3000 chars 70 chars 520 chars Sequential Program #define CUTOFF 100 dict = create_dict(); count = 0; while((char = read(1))) { profitable = compress(char, dict) if (!profitable) dict=restart(dict); if (count == CUTOFF){ count=0; } count++; finish_dict(dict); dict = create_dict(); while((char = read(1))) { profitable = compress(char, dict) if (!profitable) { dict = restart(dict); } finish_dict(dict); !profit !profit !profit !profit 100 chars 80 chars 20 Parallel Program cutoff !profit 20

Reset every 2500 characters Reset every 100 characters
2-Core Parallel Program 1500 chars 2500 chars 800 chars 1200 chars dict = create_dict(); while((char = read(1))) { profitable = compress(char, dict) @YBRANCH(probability=.01) if (!profitable) { dict = restart(dict); } finish_dict(dict); Reset every 2500 characters !prof !prof 1700 chars !prof 1300 chars 1000 chars cutoff cutoff cutoff cutoff 64-Core Parallel Program 100 chars 80 chars 20 Reset every 100 characters Compilers are best situated to make the tradeoff between output quality and performance !prof cutoff cutoff cutoff cutoff 21

Capturing Output/Performance Tradeoff: Y-Branches in 164.gzip
dict = create_dict(); while((char = read(1))) { profitable = compress(char, dict) if (!profitable) { dict = restart(dict); } finish_dict(dict); dict = create_dict(); while((char = read(1))) { profitable = compress(char, dict) @YBRANCH(probability=.00001) if (!profitable) { dict = restart(dict); } finish_dict(dict); #define CUTOFF dict = create_dict(); count = 0; while((char = read(1))) { profitable = compress(char, dict) if (!profitable) dict=restart(dict); if (count == CUTOFF){ count=0; } count++; finish_dict(dict); Another language limitation occurs when a programmer manually breaks a dependence to enable parallelization. 22

Modified only 60 LOC out of ~500,000 LOC
Iter. Inv. Value Spec. Programmer Mod. Increased Scope Loop Alias Spec. Nested Parallel LoC Changed Commutative Y-Branch 164.gzip 26 x 175.vpr 1 176.gcc 18 181.mcf 186.crafty 9 197.parser 3 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Modified only 60 LOC out of ~500,000 LOC 23

What prevents the automatic extraction of parallelism?
Performance Potential The aggressive framework coupled with the Commutative and Y-Branch annotations allowed us to extract large amounts of parallelism. For several benchmarks, 164.gzip, 186.crafty, and 197.parser, the parallelism scales almost linearly with the number of threads. However, these numbers are not upper-bounds on the potential performance that can be extracted. With the exception of two benchmarks, mcf and crafty, only one loop in each benchmark was parallelized. We firmly believe that more parallelism can and will be extracted in the future. What prevents the automatic extraction of parallelism? Lack of an Aggressive Compilation Framework Sequential Programming Model 24

What About Non-Scientific Codes???
Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 Core 1 Core 2 Independent Multithreading (IMT) Example: DOALL parallelization Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] LD:1 1 X:1 LD:2 2 LD:3 X:2 3 X:3 LD:4 4 LD:5 X:4 5 X:5 LD:6 25

Alternative Parallelization Approaches
while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 Cyclic Multithreading (CMT) Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] 26

Comparison: IMT, PMT, CMT
1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 PMT 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 CMT 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 27

1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 IMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 PMT 1 iter/cycle 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 CMT 1 iter/cycle lat(comm) = 1: 1 iter/cycle 28

1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 IMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1 Core 2 PMT 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1 Core 2 CMT lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 1 iter/cycle 1 iter/cycle 0.5 iter/cycle 29

Thread-local Recurrences  Fast Execution IMT PMT CMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 LD:1 LD:1 1 1 LD:2 X:1 2 2 LD:3 X:1 LD:2 3 3 LD:4 X:2 X:2 4 4 LD:5 X:3 LD:3 5 5 LD:6 X:4 X:3 Cross-thread Dependences  Wide Applicability 30

Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP
197.parser This is parser with two input sets shown in green. The red dots are the historic performance trend lines, our goal, shown the first slide. We can automatically parallelize parser in a scalable way - all we need is to have the programmer remove or mark the custom allocator. That is so much easier than rewriting parser in a new language. Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage) 31

Decoupled Software Pipelining

Inter-thread communication latency is a one-time cost
Decoupled Software Pipelining (DSWP) [MICRO 2005] A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D Thread 1 Thread 2 D A B A B D B C C C intra-iteration loop-carried register control communication queue Inter-thread communication latency is a one-time cost 33

Implementing DSWP L1: DFG Aux: register control memory intra-iteration
Choose a valid partitioning Goals: balance load, reduce sync Transform the code intra-iteration loop-carried register memory control

Optimization: Node Splitting To Eliminate Cross Thread Control
intra-iteration loop-carried register memory control

Optimization: Node Splitting To Reduce Communication
intra-iteration loop-carried register memory control

Constraint: Strongly Connected Components
Consider: Solution: DAGSCC Choose a valid partitioning Goals: balance load, reduce sync Transform the code intra-iteration loop-carried register memory control Eliminates pipelined/decoupled property

2 Extensions to the Basic Transformation
Speculation Break statistically unlikely dependences Form better-balanced pipelines Parallel Stages Execute multiple copies of certain “large” stages Stages that contain inner loops perfect candidates

Why Speculation? A: while(node) B: ncost = doit(node);
C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D D B A B intra-iteration loop-carried register control communication queue C C

Why Speculation? A: while(cost < T && node) B: ncost = doit(node);
C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D D B A B C D A B intra-iteration loop-carried register control communication queue C C Predictable Dependences

Why Speculation? A: while(cost < T && node) B: ncost = doit(node);
C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC D D B A B C intra-iteration loop-carried register control communication queue C Predictable Dependences A

Misspeculation Recovery
Execution Paradigm DAGSCC D B Misspeculation detected C Misspeculation Recovery Rerun Iteration 4 A intra-iteration loop-carried register control communication queue 42

Understanding PMT Performance
1 2 3 4 5 A:1 A:2 B:1 B:2 B:3 B:4 A:3 A:4 A:5 A:6 B:5 Core 1 Core 2 1 2 3 4 5 A:1 B:1 C:1 C:3 A:2 B:2 A:3 B:3 Core 1 Core 2 Idle Time Rate ti is at least as large as the longest dependence recurrence. NP-hard to find longest recurrence. Large loops make problem difficult in practice. Slowest thread: 1 cycle/iter 2 cycle/iter Iteration Rate: 1 iter/cycle 0.5 iter/cycle

Selecting Dependences To Speculate
A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC D D Thread 1 B Thread 2 A B C Thread 3 intra-iteration loop-carried register control communication queue C A Thread 4

Detecting Misspeculation
A1: while(consume(4)) D : node = node->next produce({0,1},node); Thread 1 A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(consume(6)) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); Thread 4

A1: while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); Thread 4

A1: while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); Thread 4

Breaking False Memory Dependences
Oldest Version Committed by Recovery Thread Dependence Graph D A B Memory Version 3 C false memory intra-iteration loop-carried register control communication queue

Adding Parallel Stages to DSWP
Core 1 Core 2 Core 3 LD:1 1 LD:2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 2 LD:3 X:1 3 LD:4 X:2 LD = 1 cycle X = 2 cycles The schedule here shows how PS-DSWP would operate when the latency of X and the inter-core communication latencies are both 2 cycles. PS-DSWP replicaes the stage containing X into two cores. Thus, two Xs from two different iterations can be executed concurrently in two different cores. This helps PS-DSWP to achieve a throughput of 1 iteration per cycle. This is in contrast to the ½ iteration per cycle throughput provided by DOACROSS and DSWP. While in the example, we have shown X as a single statement with a ltency of 2, X could be a stage with varying latency. For instance, X could be an inner loop with varying number of iterations which could not be partitioned by DSWP if it is a single recurrence. As we will see later, one of the key observations that enable PS-DSWP to perform well is that a recurrence in a loop does not necessarily imply the presence of loop carried dependences. 4 LD:5 X:3 Comm. Latency = 2 cycles 5 LD:6 X:4 Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle 6 LD:7 X:5 7 LD:8 49

Things to Think About – Speculation
How do you decide what dependences to speculate? Look solely at profile data? How do you ensure enough profile coverage? What about code structure? What if you are wrong? Undo speculation decisions at run-time? How do you manage speculation in a pipeline? Traditional definition of a transaction is broken Transaction execution spread out across multiple cores

Things to Think About 2 – Pipeline Structure
When is a pipeline a good/bad choice for parallelization? Is pipelining good or bad for cache performance? Is DOALL better/worse for cache? Can a pipeline be adjusted when the number of available cores increases/decreases, or based on what else is running on the processor? How many cores can DSWP realistically scale to?

University of Michigan November 7, 2018

Similar presentations

Presentation on theme: "University of Michigan November 7, 2018"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan November 7, 2018

Similar presentations

Presentation on theme: "University of Michigan November 7, 2018"— Presentation transcript:

Similar presentations

About project

Feedback