University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Courseware Scheduling of Distributed Real-Time Systems Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Martha Garcia.  Goals of Static Process Scheduling  Types of Static Process Scheduling  Future Research  References.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
A Scalable Low-power Architecture For Software Radio
11 1 SPEX: A Programming Language for Software Defined Radio Yuan Lin, Robert Mullenix, Mark Woh, Scott Mahlke, Trevor Mudge, Alastair Reid 1, and Krisztián.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Network Aware Resource Allocation in Distributed Clouds.
High Performance Linear Transform Program Generation for the Cell BE
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Leiden Embedded Research Center Prof. Dr. Ed F. Deprettere, Dr. Bart Kienhuis, Dr. Todor Stefanov Leiden Embedded Research Center (LERC) Leiden Institute.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Static Process Scheduling
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
CS427 Multicore Architecture and Parallel Computing
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong §, Scott Mahlke, and Trevor Mudge University of Michigan, § ARM Ltd.

University of Michigan Electrical Engineering and Computer Science 2 Stream Programming Programming style –Embedded domain Audio/video (H.264), wireless (WCDMA) –Mainstream Continuous query processing, Search Stream –Collection of data records Kernels/Filters –Functions applied to streams –Input/Output are streams –Coarse grain dataflow –Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]

University of Michigan Electrical Engineering and Computer Science 3 Compiling Stream Programs Core 1Core 2Core 3Core 4 Mem ? Stream ProgramMulticore System Coarse-grain Software pipelining[PLDI’08] –Equal work distribution –Communication/computation overlap –Assumed an infinite amount of local memory Local storage constraints - Spilling to main memory, infeasible solution Latency constraints - Often found in stream programs compiler

University of Michigan Electrical Engineering and Computer Science 4 Target Architecture Target : Cell processor –Cores with disjoint address spaces –Explicit copy to access remote data DMA engine independent of PEs SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7

University of Michigan Electrical Engineering and Computer Science 5 Outline Review: stream graph modular scheduling Memory-aware stream graph scheduling Latency-aware stream graph scheduling Experimental results

University of Michigan Electrical Engineering and Computer Science 6 Processor Assignment: Maximizing Throughputs for all filter i = 1, …, N for all PE j = 1,…,P Minimize II BC E D F W: 20 W: 30 W: 50 W: 30 A BC E D F A A D B C E F Minimum II: 50 Balanced workload! Maximum throughput T2 = 50 A B C D E F PE0 T1 = 170 T1/T2 = 3. 4 Partition problem: NP-hard PE0 PE1PE2PE3 Four Processing Elements Assigns each filter to a processor PE1PE0 PE2PE3 W: workload

University of Michigan Electrical Engineering and Computer Science 7 Forming Pipelines: Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 producer-consumer dependence Communication-computation overlap BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 D A A A A D B C B C F E Prologue D D D D B C B C F F F F F E E E E E A D B C F E PE0PE1PE2 PE3 Traversing dataflow order II Epilogue Assigns each filter to a pipeline stage

University of Michigan Electrical Engineering and Computer Science 8 Excess Buffer Requirements LS 0 LS 1 LS 2 LS 3 PE0 A D B C E F PE1PE2PE3 LS size : 14 Maximum throughput, but not feasible! S j – S i + 1 i j DMA PE 1 PE 2 S DMA – S i + 1 S j - S DMA +1 i j PE 1 DMA A BC D E F S:0 S:1 S:2 S:3 S:4 S:5 S:6 S:7 S: II = 50 Infeasible schedule Multiple buffering PE1PE0 PE2PE3

University of Michigan Electrical Engineering and Computer Science 9 Processor Assignment for balancing workloads Stage Assignment for handling data dependences Previous approach Only considers balancing workloads over PEs without considering limited local storage per PE. Memory-aware Stream Graph Scheduling Buffer Requirement Estimation using Conservative Stage Assignment Processor Assignment under Memory Constraints Stage Optimization for reducing buffers/DMAs and stages Memory requirement Processor assignment best-so-far Scheduling result Polynomial NP-hard Polynomial Phased approach for solving each step optimally Maximizes the usage of limited local storage Attempts to find more solutions, not degrading the performance

University of Michigan Electrical Engineering and Computer Science 10 BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Conservative stage assignment W: 20 W: 30 W: 50 W: 30 Buffer Usage Estimation Using Conservative Stage Assignment BC E D F A Given stream graph BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Conservative stage assignment Variable: filter i to PE j Maximize throughput under memory constraints! Buffer usage of filter i Compute buffer requirements for a filter (S j – S i + 1). for all filter i = 1, …, N for all PE j = 1,…,P Minimize II

University of Michigan Electrical Engineering and Computer Science 11 BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Conservative stage assignment Memory-aware Processor Assignment BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Processor assignment LS 0 LS 1 LS 2 LS 3 LS size : 14 A B FD PE2PE3 C F Minimum II: 50 Maximum throughput, fitting into LS! PE0PE1 Starts with same filter workload, same local storage size, same processors Considers buffer requirements per filter that will be allocated to the local storage of the assigned processor Generates different processor assignments fits into the LS PE0 PE1PE2PE3

University of Michigan Electrical Engineering and Computer Science 12 Reducing Overheads: Stage Optimization BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S: Initial stages BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 Optimized stages B E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 DMA S:8 C decreased increased Earliest stages Always minimizes buffers/ DMAs/stages from the given schedule

University of Michigan Electrical Engineering and Computer Science 13 Latency-aware Stream Graph Scheduling Does not always maximizes throughputs Achieves the throughput that can match the given latencies Generates a schedule that satisfies latency constraints using the least number of PEs. Latency constraints Calculate the Target Throughput Processor Assignment for Achieving Target Throughput

University of Michigan Electrical Engineering and Computer Science 14 Latency Constraints within a Stream Graph BC E D F A LAT = {lat(A, C), lat(B,E)} BC E D F A DMA S:1 S:0 S:2 S:3 S:4 S:5 S:6 S:7 S:8 A0 A1 A2 A3 A4 A5 A6 B0 B1 B2 B3 B4 C0 C1 C2 C3 C4E0 E1 E2 E3 A7 A8 A9 C5 C6 C7 B5 B6 B7 lat(A, C) = start( C) – completion( A) (2-0+1) x II (6-2+1) x II (Sj – Si + 1) x II

University of Michigan Electrical Engineering and Computer Science 15 Latency-aware Stream Graph Scheduling Calculate the target throughput –Conservative stage assignment to make (Sj – Si + 1) a constant value –Calculates α, where II ≤ α, α = min(lat(i, j) / (Sj – Si + 1)) Processor assignment –Minimize the number of PEs, achieving α. (bin-packing) (s(C) – s(A) + 1 ) x II = (2-0+1) x II ≤ lat(A, C) = 300 (s(E) – s(B) + 1 ) x II = (6-2+1) x II ≤ lat(B, E) = 450 α=min( 300 / 3, 450/5) = 90 A: 20 B: 20 C: 20 F: 30 E:30 F:50 90 Two PEs !

University of Michigan Electrical Engineering and Computer Science 16 Bounds on the Number of PEs LB PE : solution from latency-aware scheduling II best : best possible II, largest workload among all filters UB PE : solution from latency-aware scheduling when α is substituted by II best. LB PE ≤ num(PE) ≤ UB PE II best = 50 A: 20 B: 20 C: 20 F: 30 E:30 F:50 PE0 A B FD PE1PE2PE3 C F α = 90 PE0PE1 2 ≤ num(PE) ≤ 4

University of Michigan Electrical Engineering and Computer Science 17 Design Space Exploration: Memory and Latency Inputs - Maximum workload - Timing constraints - Memory constraints Calculate UB pe, LB pe UB pe = min(UB pe, Available pe ) Memory-aware scheduling UB pe < LB pe P = LB pe Solution exists P = P + 1 P ≤ UB pe no No feasible sol. yes No feasible sol. no Solution found!

University of Michigan Electrical Engineering and Computer Science 18 Experimental Results Benchmarks: software defined radio protocols – WCDMA: common 3G protocol – DVB: digital media broadcasting protocol – 4G: next generation wireless protocol –10 to 20 filters Platform –PS3 : up to 6 SPEs Software –SPEX-C to C : SUIF –IBM Cell SDK 3.0

University of Michigan Electrical Engineering and Computer Science 19 Scalability of Memory-aware Scheduling PE Speed up 4G DVB WCDMA Calculated II Measured exec time Ub: 15Ub: 4 Ub: 5 - Synchronization cost - Unhidden communication cost -Imbalanced task set: tiny workload smaller then DMA, Centralized DMAs

University of Michigan Electrical Engineering and Computer Science 20 Memory-aware Stream Graph Scheduling ****** 1M ***** 512K ****** 256K K + 64K 32K G # PE ****** 1M ****** 512K ***** 256K K 64K 32K DVB ****** 1M ****** 512K ****** 256K K K 32K WCDMA Found more feasible solutions! Achieved the same II in many cases! * * Sum of total data sizes 4G: 200KB DVB: 133KB WCDMA : 90KB LS size

University of Michigan Electrical Engineering and Computer Science 21 Conclusions Coarse-grain software pipelining of stream programs considering –memory constraints –latency constraints Performance summary –Up to 50% more scheduling solutions –Does not degrade the quality of the solutions Future directions –Modeling DMA costs, reducing synchronization costs –Getting uniform workload

University of Michigan Electrical Engineering and Computer Science 22 Thank you!

University of Michigan Electrical Engineering and Computer Science 23 Input language func_a ( int* a, int* b) { int i, j; int dat; for (i = 0; i < counter; i++) { dat = b[i]; dat = dat * dat + 10; a[i] = dat; } } stream { // enclosing the stream structure for (i = 0; i < 1000; i++) { func_a(aout, ain); func_b(bout, aout); func_c(cout, bout); func_d(ifout, cout); func_e(eout, ifout); } } A kernel function Main function Input language : stylized C

University of Michigan Electrical Engineering and Computer Science 24 //kernel function definitions //kernel stub definitions //Data buffer definitions … While(1 ){ switch(cmd){ case: ‘runFilter’ case: ‘DMA’ … //send ACK to PPE; } } //kernel function definitions //kernel stub definitions //Data buffer definitions … While(1 ){ switch(cmd){ case: ‘runFilter’ case: ‘DMA’ … //send ACK to PPE; } } void thread( ) { for(…){ if(s[0]){ doDMA(..); blockingRead(..); } if(s[1]){ runfilter(..); blockingRead(..); } … barrier( ); } } void thread( ) { for(…){ if(s[0]){ doDMA(..); blockingRead(..); } if(s[1]){ runfilter(..); blockingRead(..); } … barrier( ); } } Parallelized Code on Cell void thread( ) { for(…){ if(s[0]){ doDMA(..); blockingRead(..); } if(s[1]){ runfilter(..); blockingRead(..); } … barrier( ); } } Function offloading //kernel function definitions //kernel stub definitions //Data buffer definitions … While(1 ){ switch(cmd){ case: ‘runFilter’ case: ‘DMA’ … //send ACK to PPE; } } commands PPU SPU