Presentation is loading. Please wait.

Presentation is loading. Please wait.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Similar presentations


Presentation on theme: "Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School."— Presentation transcript:

1 Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

2 Uniprocessor Performance

3 Motivation 1985199019801970197519952000 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2 200520?? # of cores 1 2 4 8 16 32 64 128 256 512 Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045

4 Motivation For uniprocessors, C was: Portable High Performance Composable Malleable Maintainable Uniprocessors: C is the common machine language 1985199019801970197519952000 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2 2005 Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 20?? # of cores 1 2 4 8 16 32 64 128 256 512 Opteron 4P Xeon MP Athlon Ambric AM2045

5 Motivation What is the common machine language for multicores? 1985199019801970197519952000 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2 2005 Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 20?? # of cores 1 2 4 8 16 32 64 128 256 512 Opteron 4P Xeon MP Athlon Ambric AM2045

6 Common Machine Languages Common Properties Single flow of control Single memory image Uniprocessors: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences Stream Programming Language is a common machine language for multicores

7 Properties of Stream Programs [W. Thies ‘02] A large (possibly infinite) amount of data Limited lifespan of each data item Little processing of each data item A regular, static computation pattern Stream program structure is relatively constant A lot of opportunities for compiler optimizations

8 Application of Streaming Programming

9 Model of Computation Synchronous Dataflow [Lee ‘92] –Graph of autonomous filters –Communicate via FIFO channels Static I/O rates [Edward ‘87] –Compiler decides on an order of execution (schedule) –Static estimation of computation Adder Speaker AtoD FMDemod Scatter Gather LPF 2 LPF 3 HPF 2 HPF 3 LPF 1 HPF 1

10 parallel computation StreamIt Language Overview [Thies ‘04] StreamIt is a novel language for streaming –Exposes parallelism and communication –Architecture independent –Modular and composable Simple structures composed to creates complex graphs –Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter

11 11 Mapping of Filters to Multicores Task Parallelism [Edward ‘87] Fine-Grained Data Parallelism [Michael ‘06] 3-phase solution [Michael ’06] Orchestrating the Execution of Stream Programs [Kudlur ‘08]

12 12 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core

13 13 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner

14 14 3-Phase Solution [Michael ‘06] RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect 6 6 20 2 1 1 1 2 1 1 1 Data Parallel Target a 4 core machine Data Parallel, but too little work!

15 15 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner 66 20 2 1 1 1 2 1 1 1 5 5 Target a 4 core machine

16 16 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner 66 2 1 1 1 2 1 1 1 5 5

17 17 Better Mapping Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner 66 2 1 1 1 2 1 1 1 5 5 16

18 18 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning

19 19 Greedy Partitioning [Michael ‘06] Target 4 core machine Time 16 Cores To Schedule:

20 Static Translation of Stream Programs [Proposal] We study –A mathematical model and algorithms to resolve bottlenecks in stream programs –Map actors of stream programs to processors in a parallel systems –Compute a schedule for each processor Goal is to statically optimize the throughput of a stream program Assuming constant input bandwidth

21 Research Question: Removing the bottleneck from the stream graph A BC D Original stream graph Filter B is the bottleneck A C D BB́ S J After removing the bottleneck Filter B is duplicated

22 Research Method Perform a quantitative analysis that detects bottlenecks in the stream graph The bottleneck resolver duplicates actors that impose a bottleneck. The process continues until the program is bottleneck free Then mapping the actors to processors is performed via Integer Linear Programming

23 Plan Background study Research question Proposal Implementation Results Publication

24 Question?


Download ppt "Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School."

Similar presentations


Ads by Google