EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today: Daya Khudia

- 1 - Announcements & Reading Material v This class »“Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008. v Next class – GPU compilation »“Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

- 2 - Stream Graph Modulo Scheduling (SGMS) v Coarse grain software pipelining »Equal work distribution »Communication/computation overlap »Synchronization costs v Target : Cell processor »Cores with disjoint address spaces »Explicit copy to access remote data  DMA engine independent of PEs v Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7

- 3 - Preliminaries v Synchronous Data Flow (SDF) [Lee ’87] v StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } Push and pop items from input/output FIFOs Stateless int wgts[N]; wgts = adapt(wgts); Stateful

- 4 - SGMS Overview PE0PE1PE2PE3 PE0 T1T1 T4T4 T4T4 T1T1 ≈ 4 DMA Prologue Epilogue

- 5 - SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap

- 6 - Processor Assignment: Maximizing Throughputs for all filter i = 1, …, N for all PE j = 1,…,P Minimize II BC E D F W: 20 W: 30 W: 50 W: 30 A BC E D F A A D B C E F Minimum II: 50 Balanced workload! Maximum throughput T2 = 50 A B C D E F PE0 T1 = 170 T1/T2 = 3. 4 PE0 PE1PE2PE3 Four Processing Elements Assigns each filter to a processor PE1PE0 PE2PE3 W: workload

- 7 - Need More Than Just Processor Assignment v Assign filters to processors »Goal : Equal work distribution v Graph partitioning? v Bin packing? A B D C A BC D 5 5 4010 Original stream program PE0PE1 B A C D Speedup = 60/40 = 1.5 A B1 D C B2 J S Modified stream program B2 C J B1 A S D Speedup = 60/32 ~ 2

- 8 - Filter Fission Choices PE0PE1PE2PE3 Speedup ~ 4 ?

- 9 - Integrated Fission + PE Assign v Exact solution based on Integer Linear Programming (ILP) … Split/Join overhead factored in v Objective function- Maximal load on any PE »Minimize v Result »Number of times to “split” each filter »Filter → processor mapping

- 10 - Step 2: Forming the Software Pipeline v To achieve speedup »All chunks should execute concurrently »Communication should be overlapped v Processor assignment alone is insufficient information A B C A C B PE0PE1 PE0PE1 Time A B A1A1 B1B1 A2A2 A1A1 B1B1 A2A2 A→BA→B A1A1 B1B1 A2A2 A→BA→B A3A3 A→BA→B Overlap A i+2 with B i X

- 11 - Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 Preserve causality (producer-consumer dependence) Communication-computation overlap v Data flow traversal of the stream graph »Assign stages using above two rules

- 12 - Stage Assignment Example A B1 D C B2 J S A S B1 Stage 0 DMA Stage 1 C B2 J Stage 2 D DMA Stage 3 Stage 4 DMA PE 0 PE 1

- 13 - Step 3: Code Generation for Cell v Target the Synergistic Processing Elements (SPEs) »PS3 – up to 6 SPEs »QS20 – up to 16 SPEs v One thread / SPE v Challenge »Making a collection of independent threads implement a software pipeline »Adapt kernel-only code schema of a modulo schedule

- 14 - Complete Example void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } A S B1 DMA C B2 J D DMA A S B1 A S B1toJ StoB2 AtoC A S B1 B2 J C B1toJ StoB2 AtoC A S B1 JtoD CtoD B2 J C B1toJ StoB2 AtoC A S B1 JtoD D CtoD B2 J C B1toJ StoB2 AtoC SPE1DMA1SPE2DMA2 Time

- 15 - SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) Solver time < 30 seconds for 16 processors

- 16 - SGMS Conclusions v Streamroller »Efficient mapping of stream programs to multicore »Coarse grain software pipelining v Performance summary »14.7x speedup on 16 cores »Up to 35% better than greedy solution (11% on average) v Scheduling framework »Tradeoff memory space vs. load balance  Memory constrained (embedded) systems  Cache based system

- 17 - Discussion Points v Is it possible to convert stateful filters into stateless? v What if the application does not behave as you expect? »Filters change execution time? »Memory faster/slower than expected? v Could this be adapted for a more conventional multiprocessor with caches? v Can C code be automatically streamized? v Now you have seen 3 forms of software pipelining: »1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling »Where else can it be used?

“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”

- 19 - Static Verses Dynamic Scheduling Core A A B1 B3 F F E E C C Splitter Joiner B2 B4 D1 D2 Splitter Joiner Memory Core Memory Core Memory Core Memory ? v Performing graph modulo scheduling on a stream graph statically. v What happens in case of dynamic resource changes?

- 20 - Overview of Flextream Prepass Replication Work Partitioning Partition Refinement Stage Assignment Buffer Allocation Static Dynamic Streaming Application MSL Commands Adjust the amount of parallelism for the target system by replicating actors. Performs light-weight adaptation of the schedule for the current configuration of the target hardware. Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance) Find an optimal schedule for a virtualized member of a family of processors. Specifies how actors execute in time in the new actor- processor mapping. Find optimal modulo schedule for a virtualized member of a family of processors. Tries to efficiently allocate the storage requirements of the new schedule into available memory units. Goal: To perform Adaptive Stream Graph Modulo Scheduling.

- 21 - Overall Execution Flow v For every application may see multiple iterations of: Resource change Request Resource change Granted

- 22 - Prepass Replication [static] A A B B C C D D E E F F E0 E1 P0 : 10 P1 : 86P2 : 246P3 : 326 P4 : 566P5 : 10P6 : 0 P7 : 0 P4 : 283 D0 D1 P6 : 283 P7 : 163 P3 : 163 P0 : 151.5 P1 : 147.5P2 : 184.5P3 : 163 P4 : 141.5P5 : 151.5P6 : 141.5 P7 : 163 E0 E1 E2 E3 C0 C1 C2 C3 A A F F E E D D C C 10 246 326 566 10 B B 86 C0 C2 S0 J0 61.5 6 6 6 6 C1 C3 D0 S1 J1 163 6 6 6 6 D1 E0 E2 S2 J2 6 6 6 6 E1 E3 141.5 21 22

- 23 - Partition Refinement [dynamic 1] v Available resources at runtime can be more limited than resources in static target architecture. v Partition refinement tunes actor to processor mapping for the active configuration. v A greedy iterative algorithm is used to achieve this goal.

- 24 - Partition Refinement Example v Pick processors with most number of actors. v Sort the actors v Find processor with max work v Assign min actors until threshold A A B B P0 : 184.5 P1 : 141.5P2 : 171.5P3 : 141.5 P4 : 151.5P5 : 173 P7 : 159.5 D0 D1 P6 : 140 E0 E1 E2 E3 C0 C1 C2 C3 S1 S2 J0 S0 J1 J2 B B E2 C2 C0 C3 C1 S1 S2 J0 S0 J1 J2 P5 : 183 S0 C3 S1 S2 J1 J2 C2 C1 B B C0 E2 F F J0 P5 : 193 P5 : 270.5P4 : 274.5 P1 : 283 P3 : 289

- 25 - Stage Assignment [dynamic 2] v Processor assignment only specifies how actors are overlapped across processors. v Stage assignment finds how actors are overlapped in time. v Relative start time of the actors is based on stage numbers. v DMA operations will have a separate stage.

- 26 - Stage Assignment Example A A F F B B C0 C2 S0 J0 C1 C3 D0 S1 J1 D1 E0 E2 S2 J2 E1 E3 A A D0 D1 E0 E1 E3 S0 C3 S1 S2 J1 J2 C2 C1 B B C0 E2 F F J0 0 0 2 2 4 4 6 6 1010 1010 8 8 1212 1212 1616 1616 1818 1818 1414 1414

- 27 - Performance Comparison

- 28 - Overhead Comparison

- 29 - Flextream Conclusions v Static scheduling approaches are promising but not enough. v Dynamic adaptation is necessary for future systems. v Flextream provides a hybrid static/dynamic approach to improve efficiency.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Similar presentations

Presentation on theme: "EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Similar presentations

Presentation on theme: "EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:"— Presentation transcript:

Similar presentations

About project

Feedback