EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Slides:

Advertisements

Similar presentations

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Program Representations. Representing programs Goals.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Cpeg421-08S/final-review1 Course Review Tom St. John.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

1 Compiling with multicore Jeehyung Lee Spring 2009.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.

EECS 583 – Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012.

Automatic Thread Extraction with Decoupled Software Pipelining

Code Optimization Overview and Examples

Code Optimization.

Conception of parallel algorithms

ECE 565 High-Level Synthesis—An Introduction

Multiscalar Processors

Lecture 26 CSE 331 Nov 2, 2016.

Graph-Based Operational Semantics

Optimization Code Optimization ©SoftMoore Consulting.

White-Box Testing.

CSL718 : VLIW - Software Driven ILP

Greedy Algorithms / Interval Scheduling Yin Tat Lee

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

Princeton University Spring 2016

CSCI1600: Embedded and Real Time Software

White-Box Testing.

EECS 583 – Class 15 Register Allocation

Compiler techniques for exposing ILP (cont)

How to improve (decrease) CPI

Code Optimization Overview and Examples Control Flow Graph

Instruction Level Parallelism (ILP)

University of Michigan November 7, 2018

Lecture 27 CSE 331 Oct 31, 2014.

Lecture 28 CSE 331 Nov 7, 2012.

Lecture 27 CSE 331 Nov 2, 2010.

EECS 583 – Class 4 If-conversion

EECS 583 – Class 15 Register Allocation

How to improve (decrease) CPI

EECS 583 – Class 9 Classic and ILP Optimization

Programming with Shared Memory Specifying parallelism

Loop-Level Parallelism

CSCI1600: Embedded and Real Time Software

Pointer analysis John Rollinson & Kaiyuan Li

Lecture 27 CSE 331 Nov 1, 2013.

Presentation transcript:

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 12, 2012

Announcements + Reading Material 1st paper review due today Should have submitted to andrew.eecs.umich.edu:/y/submit Next Monday – Midterm exam in class See room assignments Today’s class reading “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005. Next class “Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

Midterm Exam When: Monday, Nov 19, 2012, 10:40-12:30 Where 1003 EECS Uniquenames (not first or last names!) starting with A-E 1005 EECS Uniquenames starting with F-N 3150 Dow (our classroom) Uniquenames starting with P-Z What to expect Open book/notes, no laptops Apply techniques we discussed in class on examples Reason about solving compiler problems – why things are done A couple of thinking problems No LLVM code Reasonably long but you should finish

Midterm Exam Last 2 years exams are posted on the course website Note – Past exams may not accurately predict future exams!! Office hours between now and Monday if you have questions James: Tue, Thu, Fri: 1-3pm Scott: Tue 4:30-5:30pm, Fri 5-6pm (extra slot) Studying Yes, you should study even though its open notes Lots of material that you have likely forgotten from early this semester Refresh your memories No memorization required, but you need to be familiar with the material to finish the exam Go through lecture notes, especially the examples! If you are confused on a topic, go through the reading If still confused, come talk to me or James Go through the practice exams as the final step

Exam Topics Control flow analysis Predicated execution Control flow graphs, Dom/pdom, Loop detection Trace selection, superblocks Predicated execution Control dependence analysis, if-conversion, hyperblocks Can ignore control height reduction Dataflow analysis Liveness, reaching defs, DU/UD chains, available defs/exprs Static single assignment Optimizations Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion Speculative optimization – like HW 1

Exam Topics - Continued Acyclic scheduling Dependence graphs, Estart/Lstart/Slack, list scheduling Code motion across branches, speculation, exceptions Can ignore sentinel scheduling Software pipelining DSA form, ResMII, RecMII, modulo scheduling Make sure you can modulo schedule a loop! Execution control with LC, ESC Register allocation Live ranges, graph coloring Research topics Can ignore automatic parallelization

Last Class Scientific codes – Successful parallelization C programs KAP, SUIF, Parascope, gcc w/ Graphite Affine array dependence analysis DOALL parallelization C programs Not dominated by array accesses – classic parallelization fails Speculative parallelization – Hydra, Stampede, Speculative multithreading Profiling to identify statistical DOALL loops But not all loops DOALL, outer loops typically not!! This class – Parallelizing loops with dependences

What About Non-Scientific Codes??? Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 Core 1 Core 2 Independent Multithreading (IMT) Example: DOALL parallelization Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] LD:1 1 X:1 LD:2 2 LD:3 X:2 3 X:3 LD:4 4 LD:5 X:4 5 X:5 LD:6 7

Alternative Parallelization Approaches while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 Cyclic Multithreading (CMT) Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] 8

Comparison: IMT, PMT, CMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 PMT 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 CMT 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 9

Comparison: IMT, PMT, CMT 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 IMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5 Core 1 Core 2 PMT 1 iter/cycle 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6 Core 1 Core 2 CMT 1 iter/cycle lat(comm) = 1: 1 iter/cycle 10

Comparison: IMT, PMT, CMT 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1 Core 2 IMT 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1 Core 2 PMT 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1 Core 2 CMT lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 1 iter/cycle 1 iter/cycle 0.5 iter/cycle 11

Comparison: IMT, PMT, CMT Thread-local Recurrences  Fast Execution IMT PMT CMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 LD:1 LD:1 1 1 LD:2 X:1 2 2 LD:3 X:1 LD:2 3 3 LD:4 X:2 X:2 4 4 LD:5 X:3 LD:3 5 5 LD:6 X:4 X:3 Cross-thread Dependences  Wide Applicability 12

Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP 197.parser This is parser with two input sets shown in green. The red dots are the historic performance trend lines, our goal, shown the first slide. We can automatically parallelize parser in a scalable way - all we need is to have the programmer remove or mark the custom allocator. That is so much easier than rewriting parser in a new language. Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage) 13

Decoupled Software Pipelining

Inter-thread communication latency is a one-time cost Decoupled Software Pipelining (DSWP) [MICRO 2005] A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D Thread 1 Thread 2 D A B A B D B C C C intra-iteration loop-carried register control communication queue Inter-thread communication latency is a one-time cost 15

Implementing DSWP L1: DFG Aux: register control memory intra-iteration Choose a valid partitioning Goals: balance load, reduce sync Transform the code intra-iteration loop-carried register memory control

Optimization: Node Splitting To Eliminate Cross Thread Control intra-iteration loop-carried register memory control

Optimization: Node Splitting To Reduce Communication intra-iteration loop-carried register memory control

Constraint: Strongly Connected Components Consider: Solution: DAGSCC Choose a valid partitioning Goals: balance load, reduce sync Transform the code intra-iteration loop-carried register memory control Eliminates pipelined/decoupled property

2 Extensions to the Basic Transformation Speculation Break statistically unlikely dependences Form better-balanced pipelines Parallel Stages Execute multiple copies of certain “large” stages Stages that contain inner loops perfect candidates

Why Speculation? A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D D B A B intra-iteration loop-carried register control communication queue C C

Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC A D D B A B C D A B intra-iteration loop-carried register control communication queue C C Predictable Dependences

Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC D D B A B C intra-iteration loop-carried register control communication queue C Predictable Dependences A

Misspeculation Recovery Execution Paradigm DAGSCC D B Misspeculation detected C Misspeculation Recovery Rerun Iteration 4 A intra-iteration loop-carried register control communication queue 24

Understanding PMT Performance 1 2 3 4 5 A:1 A:2 B:1 B:2 B:3 B:4 A:3 A:4 A:5 A:6 B:5 Core 1 Core 2 1 2 3 4 5 A:1 B:1 C:1 C:3 A:2 B:2 A:3 B:3 Core 1 Core 2 Idle Time Rate ti is at least as large as the longest dependence recurrence. NP-hard to find longest recurrence. Large loops make problem difficult in practice. Slowest thread: 1 cycle/iter 2 cycle/iter Iteration Rate: 1 iter/cycle 0.5 iter/cycle

Selecting Dependences To Speculate A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph DAGSCC D D Thread 1 B Thread 2 A B C Thread 3 intra-iteration loop-carried register control communication queue C A Thread 4

Detecting Misspeculation A1: while(consume(4)) D : node = node->next produce({0,1},node); Thread 1 A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(consume(6)) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); Thread 4

Detecting Misspeculation A1: while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); Thread 4

Detecting Misspeculation A1: while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A2: while(TRUE) B : ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 DAGSCC D B A3: while(TRUE) B3: ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 C A A : while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); Thread 4

Breaking False Memory Dependences Oldest Version Committed by Recovery Thread Dependence Graph D A B Memory Version 3 C false memory intra-iteration loop-carried register control communication queue

Adding Parallel Stages to DSWP Core 1 Core 2 Core 3 LD:1 1 LD:2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 2 LD:3 X:1 3 LD:4 X:2 LD = 1 cycle X = 2 cycles The schedule here shows how PS-DSWP would operate when the latency of X and the inter-core communication latencies are both 2 cycles. PS-DSWP replicaes the stage containing X into two cores. Thus, two Xs from two different iterations can be executed concurrently in two different cores. This helps PS-DSWP to achieve a throughput of 1 iteration per cycle. This is in contrast to the ½ iteration per cycle throughput provided by DOACROSS and DSWP. While in the example, we have shown X as a single statement with a ltency of 2, X could be a stage with varying latency. For instance, X could be an inner loop with varying number of iterations which could not be partitioned by DSWP if it is a single recurrence. As we will see later, one of the key observations that enable PS-DSWP to perform well is that a recurrence in a loop does not necessarily imply the presence of loop carried dependences. 4 LD:5 X:3 Comm. Latency = 2 cycles 5 LD:6 X:4 Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle 6 LD:7 X:5 7 LD:8 31

Thread Partitioning p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q->next; H: if (q != NULL)‏ I: sum += p->value; } J: p = p->next; 10 10 10 10 5 PS-DSWP works on the program dependence graph representation the loop. I n the figure, we can see the PDG of the loop on the left with the nodes of the PDG r epresenting statementrs and the edges representing dependence between statements. The actual code in the loop is not important for the purpose of our discussion, except to note that it contains complex control flow and ireegular memory make accesses making it hard for many techniques to parallelize them. The nodes in the PDG are annotated with profile weights which represent the number of times the corresponding statement is executed. The edges too have some annotations. The self edge in node I is annotated as a reduction edge since sum reduction or accumulator expansion can be applied to statement I. Once the PDG is constructed, the nodes are divided into a set of strongly connected components. The three non-trivial SCCs are highlighted in the figure 5 50 50 5 Reduction 3 32

Thread Partitioning: DAGSCC The next step is to construct the DAG_SCC. As the name indicates, it is a DAG. The nodes represent SCCs in the original PDG. 33

Thread Partitioning Merging Invariants No cycles 20 Merging Invariants No cycles No loop-carried dependence inside a doall node 10 15 5 The nodes in the DAG_SCC are annotated as DOALL or sequential nodes. A node in the DAG_SCC is a sequential node if the corresponding SCC nodes are connected by a loop carried dependence., unless that loop carried edge is marked as a reduction edge. All the other nodes are labeled as doall. A couple of interesting things to note in this graph. Node I is labeled doall even though it has a self arc, because that self arc is a reduction arc. Node FG, which corresponds to the entire inner loop is a doall.. The execution latency of this loop could be arbitrarily high depending on its trip count, but it could not be partitioned into multiple cores since it is a single SCC. The key observation here is that what matters while replicating a stage is whether it has any loop carried deps wrt the loop being parallelized, which is the outer loop in this case. At this point, each node in the DAG_SCC can be made into a pipeline stage and if it is a doall stage, it could be replicated. But this would result in more stages than the threads available. So we want to merge nodes together till there are just a few nodes such that each node could be made into a stage. In our current partitioning algorithm, the goal is to get one large doall stage and a minimal number of sequential stages. This works well in many cases where the loop is almost a doall except for a small portion which is serialized. However, there are cases where this partitioning does not do well, and we are developing a new heuristic. Since we want a large doall stage, we adapt a greedy strategy to merge the doall nodes first. The merging of doall nodes must respect two invariants. First, no cycles should be formed in the graph as a result of merging. Second, two doall nodes can not be merged together if they are connected by a loop carried dependence. We adapt a greedy strategy in doing the merge. We start with the largest doall node, FG in this case, and merge it with a neighbor with a large weight, E in this case. Note that this greedy strategy prevents nodes B and E to be merged which results in a slightly poor partitioning as would be seen later. Subsequently we merge nodes H and I also 100 5 3 34

Thread Partitioning 20 10 Treated as sequential 15 113 Thus we end up with a large doall node EFGHI and another doall node B. Since we are looking for one single doall stage, we relabel B as a sequential node. If we had merged E and B together, our single parallel stage would have had a slightly lesser weight. The next step is to greedily merge sequential nodes taking care that there are no cycles in the resulting graph. 113 35

Thread Partitioning 45 113 This results in a single sequential stage ABCDJ. We allot one thread for the seq stage and the rest of the threads to the doall stage. From this partitioning, we generate multi-threaded code by adapting the MTCG algorithm proposed by Ottoni et al. Modified MTCG[Ottoni, MICRO’05] to generate code from partition 36

Discussion Point 1 – Speculation How do you decide what dependences to speculate? Look solely at profile data? How do you ensure enough profile coverage? What about code structure? What if you are wrong? Undo speculation decisions at run-time? How do you manage speculation in a pipeline? Traditional definition of a transaction is broken Transaction execution spread out across multiple cores

Discussion Point 2 – Pipeline Structure When is a pipeline a good/bad choice for parallelization? Is pipelining good or bad for cache performance? Is DOALL better/worse for cache? Can a pipeline be adjusted when the number of available cores increases/decreases?