EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining
University of Michigan November 12, 2012

Announcements + Reading Material
1st paper review due today Should have submitted to andrew.eecs.umich.edu:/y/submit Next Monday – Midterm exam in class See room assignments Today’s class reading “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov Next class “Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

Midterm Exam When: Monday, Nov 19, 2012, 10:40-12:30 Where
1003 EECS Uniquenames (not first or last names!) starting with A-E 1005 EECS Uniquenames starting with F-N 3150 Dow (our classroom) Uniquenames starting with P-Z What to expect Open book/notes, no laptops Apply techniques we discussed in class on examples Reason about solving compiler problems – why things are done A couple of thinking problems No LLVM code Reasonably long but you should finish

Midterm Exam Last 2 years exams are posted on the course website
Note – Past exams may not accurately predict future exams!! Office hours between now and Monday if you have questions James: Tue, Thu, Fri: 1-3pm Scott: Tue 4:30-5:30pm, Fri 5-6pm (extra slot) Studying Yes, you should study even though its open notes Lots of material that you have likely forgotten from early this semester Refresh your memories No memorization required, but you need to be familiar with the material to finish the exam Go through lecture notes, especially the examples! If you are confused on a topic, go through the reading If still confused, come talk to me or James Go through the practice exams as the final step

Exam Topics Control flow analysis Predicated execution
Control flow graphs, Dom/pdom, Loop detection Trace selection, superblocks Predicated execution Control dependence analysis, if-conversion, hyperblocks Can ignore control height reduction Dataflow analysis Liveness, reaching defs, DU/UD chains, available defs/exprs Static single assignment Optimizations Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion Speculative optimization – like HW 1

Exam Topics - Continued
Acyclic scheduling Dependence graphs, Estart/Lstart/Slack, list scheduling Code motion across branches, speculation, exceptions Can ignore sentinel scheduling Software pipelining DSA form, ResMII, RecMII, modulo scheduling Make sure you can modulo schedule a loop! Execution control with LC, ESC Register allocation Live ranges, graph coloring Research topics Can ignore automatic parallelization

Last Class Scientific codes – Successful parallelization C programs
KAP, SUIF, Parascope, gcc w/ Graphite Affine array dependence analysis DOALL parallelization C programs Not dominated by array accesses – classic parallelization fails Speculative parallelization – Hydra, Stampede, Speculative multithreading Profiling to identify statistical DOALL loops But not all loops DOALL, outer loops typically not!! This class – Parallelizing loops with dependences

What About Non-Scientific Codes???
Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 Core 1 Core 2 Independent Multithreading (IMT) Example: DOALL parallelization Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] LD:1 1 X:1 LD:2 2 LD:3 X:2 3 X:3 LD:4 4 LD:5 X:4 5 X:5 LD:6 7

Alternative Parallelization Approaches
while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 Cyclic Multithreading (CMT) Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] 8

Comparison: IMT, PMT, CMT
1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 PMT 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 CMT 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 9

1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 IMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 PMT 1 iter/cycle 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 CMT 1 iter/cycle lat(comm) = 1: 1 iter/cycle 10

1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 IMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1 Core 2 PMT 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1 Core 2 CMT lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 1 iter/cycle 1 iter/cycle 0.5 iter/cycle 11

Thread-local Recurrences  Fast Execution IMT PMT CMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 LD:1 LD:1 1 1 LD:2 X:1 2 2 LD:3 X:1 LD:2 3 3 LD:4 X:2 X:2 4 4 LD:5 X:3 LD:3 5 5 LD:6 X:4 X:3 Cross-thread Dependences  Wide Applicability 12

Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP
197.parser This is parser with two input sets shown in green. The red dots are the historic performance trend lines, our goal, shown the first slide. We can automatically parallelize parser in a scalable way - all we need is to have the programmer remove or mark the custom allocator. That is so much easier than rewriting parser in a new language. Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage) 13

Decoupled Software Pipelining

Inter-thread communication latency is a one-time cost
Decoupled Software Pipelining (DSWP) [MICRO 2005] A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D Thread 1 Thread 2 D A B A B D B C C C intra-iteration loop-carried register control communication queue Inter-thread communication latency is a one-time cost 15

Implementing DSWP L1: DFG Aux: register control memory intra-iteration
Choose a valid partitioning Goals: balance load, reduce sync Transform the code intra-iteration loop-carried register memory control

Optimization: Node Splitting To Eliminate Cross Thread Control
intra-iteration loop-carried register memory control

Optimization: Node Splitting To Reduce Communication
intra-iteration loop-carried register memory control

Constraint: Strongly Connected Components
Consider: Solution: DAGSCC Choose a valid partitioning Goals: balance load, reduce sync Transform the code intra-iteration loop-carried register memory control Eliminates pipelined/decoupled property

2 Extensions to the Basic Transformation
Speculation Break statistically unlikely dependences Form better-balanced pipelines Parallel Stages Execute multiple copies of certain “large” stages Stages that contain inner loops perfect candidates

Why Speculation? A: while(node) B: ncost = doit(node);
C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D D B A B intra-iteration loop-carried register control communication queue C C

Why Speculation? A: while(cost < T && node) B: ncost = doit(node);
C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D D B A B C D A B intra-iteration loop-carried register control communication queue C C Predictable Dependences

Why Speculation? A: while(cost < T && node) B: ncost = doit(node);
C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC D D B A B C intra-iteration loop-carried register control communication queue C Predictable Dependences A

Misspeculation Recovery
Execution Paradigm DAGSCC D B Misspeculation detected C Misspeculation Recovery Rerun Iteration 4 A intra-iteration loop-carried register control communication queue 24

Understanding PMT Performance
1 2 3 4 5 A:1 A:2 B:1 B:2 B:3 B:4 A:3 A:4 A:5 A:6 B:5 Core 1 Core 2 1 2 3 4 5 A:1 B:1 C:1 C:3 A:2 B:2 A:3 B:3 Core 1 Core 2 Idle Time Rate ti is at least as large as the longest dependence recurrence. NP-hard to find longest recurrence. Large loops make problem difficult in practice. Slowest thread: 1 cycle/iter 2 cycle/iter Iteration Rate: 1 iter/cycle 0.5 iter/cycle

Selecting Dependences To Speculate
A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC D D Thread 1 B Thread 2 A B C Thread 3 intra-iteration loop-carried register control communication queue C A Thread 4

Detecting Misspeculation
A1: while(consume(4)) D : node = node->next produce({0,1},node); Thread 1 A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(consume(6)) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); Thread 4

A1: while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); Thread 4

A1: while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); Thread 4

Breaking False Memory Dependences
Oldest Version Committed by Recovery Thread Dependence Graph D A B Memory Version 3 C false memory intra-iteration loop-carried register control communication queue

Adding Parallel Stages to DSWP
Core 1 Core 2 Core 3 LD:1 1 LD:2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 2 LD:3 X:1 3 LD:4 X:2 LD = 1 cycle X = 2 cycles The schedule here shows how PS-DSWP would operate when the latency of X and the inter-core communication latencies are both 2 cycles. PS-DSWP replicaes the stage containing X into two cores. Thus, two Xs from two different iterations can be executed concurrently in two different cores. This helps PS-DSWP to achieve a throughput of 1 iteration per cycle. This is in contrast to the ½ iteration per cycle throughput provided by DOACROSS and DSWP. While in the example, we have shown X as a single statement with a ltency of 2, X could be a stage with varying latency. For instance, X could be an inner loop with varying number of iterations which could not be partitioned by DSWP if it is a single recurrence. As we will see later, one of the key observations that enable PS-DSWP to perform well is that a recurrence in a loop does not necessarily imply the presence of loop carried dependences. 4 LD:5 X:3 Comm. Latency = 2 cycles 5 LD:6 X:4 Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle 6 LD:7 X:5 7 LD:8 31

Thread Partitioning p = list; sum = 0; A: while (p != NULL) {
B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q->next; H: if (q != NULL)‏ I: sum += p->value; } J: p = p->next; 10 10 10 10 5 PS-DSWP works on the program dependence graph representation the loop. I n the figure, we can see the PDG of the loop on the left with the nodes of the PDG r epresenting statementrs and the edges representing dependence between statements. The actual code in the loop is not important for the purpose of our discussion, except to note that it contains complex control flow and ireegular memory make accesses making it hard for many techniques to parallelize them. The nodes in the PDG are annotated with profile weights which represent the number of times the corresponding statement is executed. The edges too have some annotations. The self edge in node I is annotated as a reduction edge since sum reduction or accumulator expansion can be applied to statement I. Once the PDG is constructed, the nodes are divided into a set of strongly connected components. The three non-trivial SCCs are highlighted in the figure 5 50 50 5 Reduction 3 32

Thread Partitioning: DAGSCC
The next step is to construct the DAG_SCC. As the name indicates, it is a DAG. The nodes represent SCCs in the original PDG. 33

Thread Partitioning Merging Invariants No cycles
20 Merging Invariants No cycles No loop-carried dependence inside a doall node 10 15 5 The nodes in the DAG_SCC are annotated as DOALL or sequential nodes. A node in the DAG_SCC is a sequential node if the corresponding SCC nodes are connected by a loop carried dependence., unless that loop carried edge is marked as a reduction edge. All the other nodes are labeled as doall. A couple of interesting things to note in this graph. Node I is labeled doall even though it has a self arc, because that self arc is a reduction arc. Node FG, which corresponds to the entire inner loop is a doall.. The execution latency of this loop could be arbitrarily high depending on its trip count, but it could not be partitioned into multiple cores since it is a single SCC. The key observation here is that what matters while replicating a stage is whether it has any loop carried deps wrt the loop being parallelized, which is the outer loop in this case. At this point, each node in the DAG_SCC can be made into a pipeline stage and if it is a doall stage, it could be replicated. But this would result in more stages than the threads available. So we want to merge nodes together till there are just a few nodes such that each node could be made into a stage. In our current partitioning algorithm, the goal is to get one large doall stage and a minimal number of sequential stages. This works well in many cases where the loop is almost a doall except for a small portion which is serialized. However, there are cases where this partitioning does not do well, and we are developing a new heuristic. Since we want a large doall stage, we adapt a greedy strategy to merge the doall nodes first. The merging of doall nodes must respect two invariants. First, no cycles should be formed in the graph as a result of merging. Second, two doall nodes can not be merged together if they are connected by a loop carried dependence. We adapt a greedy strategy in doing the merge. We start with the largest doall node, FG in this case, and merge it with a neighbor with a large weight, E in this case. Note that this greedy strategy prevents nodes B and E to be merged which results in a slightly poor partitioning as would be seen later. Subsequently we merge nodes H and I also 100 5 3 34

Thread Partitioning 20 10 Treated as sequential 15 113
Thus we end up with a large doall node EFGHI and another doall node B. Since we are looking for one single doall stage, we relabel B as a sequential node. If we had merged E and B together, our single parallel stage would have had a slightly lesser weight. The next step is to greedily merge sequential nodes taking care that there are no cycles in the resulting graph. 113 35

Thread Partitioning 45 113 This results in a single sequential stage ABCDJ. We allot one thread for the seq stage and the rest of the threads to the doall stage. From this partitioning, we generate multi-threaded code by adapting the MTCG algorithm proposed by Ottoni et al. Modified MTCG[Ottoni, MICRO’05] to generate code from partition 36

Discussion Point 1 – Speculation
How do you decide what dependences to speculate? Look solely at profile data? How do you ensure enough profile coverage? What about code structure? What if you are wrong? Undo speculation decisions at run-time? How do you manage speculation in a pipeline? Traditional definition of a transaction is broken Transaction execution spread out across multiple cores

Discussion Point 2 – Pipeline Structure
When is a pipeline a good/bad choice for parallelization? Is pipelining good or bad for cache performance? Is DOALL better/worse for cache? Can a pipeline be adjusted when the number of available cores increases/decreases?

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Similar presentations

Presentation on theme: "EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Similar presentations

Presentation on theme: "EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback