University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong §, Scott Mahlke, and Trevor Mudge University of Michigan, § ARM Ltd.

University of Michigan Electrical Engineering and Computer Science 2 Stream Programming Programming style –Embedded domain Audio/video (H.264), wireless (WCDMA) –Mainstream Continuous query processing, Search Stream –Collection of data records Kernels/Filters –Functions applied to streams –Input/Output are streams –Coarse grain dataflow –Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]

University of Michigan Electrical Engineering and Computer Science 3 Compiling Stream Programs Core 1Core 2Core 3Core 4 Mem ? Stream ProgramMulticore System Coarse-grain Software pipelining[PLDI’08] –Equal work distribution –Communication/computation overlap –Assumed an infinite amount of local memory Local storage constraints - Spilling to main memory, infeasible solution Latency constraints - Often found in stream programs compiler

University of Michigan Electrical Engineering and Computer Science 4 Target Architecture Target : Cell processor –Cores with disjoint address spaces –Explicit copy to access remote data DMA engine independent of PEs SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7

University of Michigan Electrical Engineering and Computer Science 5 Outline Review: stream graph modular scheduling Memory-aware stream graph scheduling Latency-aware stream graph scheduling Experimental results

University of Michigan Electrical Engineering and Computer Science 6 Processor Assignment: Maximizing Throughputs for all filter i = 1, …, N for all PE j = 1,…,P Minimize II BC E D F W: 20 W: 30 W: 50 W: 30 A BC E D F A A D B C E F Minimum II: 50 Balanced workload! Maximum throughput T2 = 50 A B C D E F PE0 T1 = 170 T1/T2 = 3. 4 Partition problem: NP-hard PE0 PE1PE2PE3 Four Processing Elements Assigns each filter to a processor PE1PE0 PE2PE3 W: workload

University of Michigan Electrical Engineering and Computer Science 7 Forming Pipelines: Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 producer-consumer dependence Communication-computation overlap BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 D A A A A D B C B C F E Prologue D D D D B C B C F F F F F E E E E E A D B C F E PE0PE1PE2 PE3 Traversing dataflow order II Epilogue Assigns each filter to a pipeline stage

University of Michigan Electrical Engineering and Computer Science 8 Excess Buffer Requirements 12 18 8 2 LS 0 LS 1 LS 2 LS 3 PE0 A D B C E F PE1PE2PE3 LS size : 14 Maximum throughput, but not feasible! S j – S i + 1 i j DMA PE 1 PE 2 S DMA – S i + 1 S j - S DMA +1 i j PE 1 DMA A BC D E F S:0 S:1 S:2 S:3 S:4 S:5 S:6 S:7 S:8 2 3 1 1 1 1 1 2 3 II = 50 Infeasible schedule Multiple buffering PE1PE0 PE2PE3

University of Michigan Electrical Engineering and Computer Science 9 Processor Assignment for balancing workloads Stage Assignment for handling data dependences Previous approach Only considers balancing workloads over PEs without considering limited local storage per PE. Memory-aware Stream Graph Scheduling Buffer Requirement Estimation using Conservative Stage Assignment Processor Assignment under Memory Constraints Stage Optimization for reducing buffers/DMAs and stages Memory requirement Processor assignment best-so-far Scheduling result Polynomial NP-hard Polynomial Phased approach for solving each step optimally Maximizes the usage of limited local storage Attempts to find more solutions, not degrading the performance

University of Michigan Electrical Engineering and Computer Science 10 BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Conservative stage assignment 1 2 3 W: 20 W: 30 W: 50 W: 30 Buffer Usage Estimation Using Conservative Stage Assignment BC E D F A Given stream graph BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Conservative stage assignment Variable: filter i to PE j Maximize throughput under memory constraints! Buffer usage of filter i Compute buffer requirements for a filter (S j – S i + 1). for all filter i = 1, …, N for all PE j = 1,…,P Minimize II

University of Michigan Electrical Engineering and Computer Science 11 BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Conservative stage assignment 1 2 3 Memory-aware Processor Assignment BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Processor assignment 1 2 3 14 8 8 10 LS 0 LS 1 LS 2 LS 3 LS size : 14 A B FD PE2PE3 C F Minimum II: 50 Maximum throughput, fitting into LS! PE0PE1 Starts with same filter workload, same local storage size, same processors Considers buffer requirements per filter that will be allocated to the local storage of the assigned processor Generates different processor assignments fits into the LS PE0 PE1PE2PE3

University of Michigan Electrical Engineering and Computer Science 12 Reducing Overheads: Stage Optimization BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 1 2 3 Initial stages BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Optimized stages B E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 C decreased increased Earliest stages Always minimizes buffers/ DMAs/stages from the given schedule

University of Michigan Electrical Engineering and Computer Science 13 Latency-aware Stream Graph Scheduling Does not always maximizes throughputs Achieves the throughput that can match the given latencies Generates a schedule that satisfies latency constraints using the least number of PEs. Latency constraints Calculate the Target Throughput Processor Assignment for Achieving Target Throughput

University of Michigan Electrical Engineering and Computer Science 14 Latency Constraints within a Stream Graph BC E D F A LAT = {lat(A, C), lat(B,E)} BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 S:8 A0 A1 A2 A3 A4 A5 A6 B0 B1 B2 B3 B4 C0 C1 C2 C3 C4E0 E1 E2 E3 A7 A8 A9 C5 C6 C7 B5 B6 B7 lat(A, C) = start( C) – completion( A) (2-0+1) x II (6-2+1) x II (Sj – Si + 1) x II

University of Michigan Electrical Engineering and Computer Science 15 Latency-aware Stream Graph Scheduling Calculate the target throughput –Conservative stage assignment to make (Sj – Si + 1) a constant value –Calculates α, where II ≤ α, α = min(lat(i, j) / (Sj – Si + 1)) Processor assignment –Minimize the number of PEs, achieving α. (bin-packing) (s(C) – s(A) + 1 ) x II = (2-0+1) x II ≤ lat(A, C) = 300 (s(E) – s(B) + 1 ) x II = (6-2+1) x II ≤ lat(B, E) = 450 α=min( 300 / 3, 450/5) = 90 A: 20 B: 20 C: 20 F: 30 E:30 F:50 90 Two PEs !

University of Michigan Electrical Engineering and Computer Science 16 Bounds on the Number of PEs LB PE : solution from latency-aware scheduling II best : best possible II, largest workload among all filters UB PE : solution from latency-aware scheduling when α is substituted by II best. LB PE ≤ num(PE) ≤ UB PE II best = 50 A: 20 B: 20 C: 20 F: 30 E:30 F:50 PE0 A B FD PE1PE2PE3 C F α = 90 PE0PE1 2 ≤ num(PE) ≤ 4

University of Michigan Electrical Engineering and Computer Science 17 Design Space Exploration: Memory and Latency Inputs - Maximum workload - Timing constraints - Memory constraints Calculate UB pe, LB pe UB pe = min(UB pe, Available pe ) Memory-aware scheduling UB pe < LB pe P = LB pe Solution exists P = P + 1 P ≤ UB pe no No feasible sol. yes No feasible sol. no Solution found!

University of Michigan Electrical Engineering and Computer Science 18 Experimental Results Benchmarks: software defined radio protocols – WCDMA: common 3G protocol – DVB: digital media broadcasting protocol – 4G: next generation wireless protocol –10 to 20 filters Platform –PS3 : up to 6 SPEs Software –SPEX-C to C : SUIF –IBM Cell SDK 3.0

University of Michigan Electrical Engineering and Computer Science 19 Scalability of Memory-aware Scheduling 0 1 2 3 4 5 6 123456 PE Speed up 4G 0 0.5 1 1.5 2 2.5 3 3.5 123456 DVB 0 1 2 3 4 5 123456 WCDMA Calculated II Measured exec time Ub: 15Ub: 4 Ub: 5 - Synchronization cost - Unhidden communication cost -Imbalanced task set: tiny workload smaller then DMA, Centralized DMAs

University of Michigan Electrical Engineering and Computer Science 20 Memory-aware Stream Graph Scheduling ****** 1M ***** 512K ****** 256K +++ 128K + 64K 32K 654321 4G # PE ****** 1M ****** 512K ***** 256K +++++ 128K 64K 32K 654321 DVB ****** 1M ****** 512K ****** 256K +++++ 128K ++++64K 32K 654321 WCDMA Found more feasible solutions! Achieved the same II in many cases! * * Sum of total data sizes 4G: 200KB DVB: 133KB WCDMA : 90KB LS size

University of Michigan Electrical Engineering and Computer Science 21 Conclusions Coarse-grain software pipelining of stream programs considering –memory constraints –latency constraints Performance summary –Up to 50% more scheduling solutions –Does not degrade the quality of the solutions Future directions –Modeling DMA costs, reducing synchronization costs –Getting uniform workload

University of Michigan Electrical Engineering and Computer Science 22 Thank you!

University of Michigan Electrical Engineering and Computer Science 23 Input language func_a ( int* a, int* b) { int i, j; int dat; for (i = 0; i < counter; i++) { dat = b[i]; dat = dat * dat + 10; a[i] = dat; } } stream { // enclosing the stream structure for (i = 0; i < 1000; i++) { func_a(aout, ain); func_b(bout, aout); func_c(cout, bout); func_d(ifout, cout); func_e(eout, ifout); } } A kernel function Main function Input language : stylized C

University of Michigan Electrical Engineering and Computer Science 24 //kernel function definitions //kernel stub definitions //Data buffer definitions … While(1 ){ switch(cmd){ case: ‘runFilter’ case: ‘DMA’ … //send ACK to PPE; } } //kernel function definitions //kernel stub definitions //Data buffer definitions … While(1 ){ switch(cmd){ case: ‘runFilter’ case: ‘DMA’ … //send ACK to PPE; } } void thread( ) { for(…){ if(s[0]){ doDMA(..); blockingRead(..); } if(s[1]){ runfilter(..); blockingRead(..); } … barrier( ); } } void thread( ) { for(…){ if(s[0]){ doDMA(..); blockingRead(..); } if(s[1]){ runfilter(..); blockingRead(..); } … barrier( ); } } Parallelized Code on Cell void thread( ) { for(…){ if(s[0]){ doDMA(..); blockingRead(..); } if(s[1]){ runfilter(..); blockingRead(..); } … barrier( ); } } Function offloading //kernel function definitions //kernel stub definitions //Data buffer definitions … While(1 ){ switch(cmd){ case: ‘runFilter’ case: ‘DMA’ … //send ACK to PPE; } } commands PPU SPU

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong."— Presentation transcript:

Similar presentations

About project

Feedback