Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Slides:

Advertisements

Similar presentations

Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2,

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.

Summary Problem: Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Introduction to Parallel Processing Ch. 12, Pg

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

How to Get Into Graduate School in the USA: A Lecture and Workshop Bill Thies and Manish Bhardwaj Department of Electrical Engineering and.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

May 16-18, Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Function Level Parallelism Driven by Data Dependencies By Sean Rul, Hans Vandierendonck, Koen De Bosschere dasCMP 2006, December 10.

Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

An Overview of Parallel Processing

Memory-Aware Compilation Philip Sweany 10/20/2011.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Code Optimization.

Parallel Programming By J. H. Wang May 2, 2017.

A Common Machine Language for Communication-Exposed Architectures

Teleport Messaging for Distributed Stream Programs

Flow Path Model of Superscalars

Many-core Software Development Platforms

Cache Aware Optimization of Stream Programs

High Performance Stream Processing for Mobile Sensing Applications

StreamIt: High-Level Stream Programming on Raw

Presentation transcript:

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

Uniprocessor Performance

Motivation PentiumP2P3 P4 Itanium Itanium ?? # of cores Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045

Motivation For uniprocessors, C was: Portable High Performance Composable Malleable Maintainable Uniprocessors: C is the common machine language PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045

Motivation What is the common machine language for multicores? PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045

Common Machine Languages Common Properties Single flow of control Single memory image Uniprocessors: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences Stream Programming Language is a common machine language for multicores

Properties of Stream Programs [W. Thies ‘02] A large (possibly infinite) amount of data Limited lifespan of each data item Little processing of each data item A regular, static computation pattern Stream program structure is relatively constant A lot of opportunities for compiler optimizations

Application of Streaming Programming

Model of Computation Synchronous Dataflow [Lee ‘92] –Graph of autonomous filters –Communicate via FIFO channels Static I/O rates [Edward ‘87] –Compiler decides on an order of execution (schedule) –Static estimation of computation Adder Speaker AtoD FMDemod Scatter Gather LPF 2 LPF 3 HPF 2 HPF 3 LPF 1 HPF 1

parallel computation StreamIt Language Overview [Thies ‘04] StreamIt is a novel language for streaming –Exposes parallelism and communication –Architecture independent –Modular and composable Simple structures composed to creates complex graphs –Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter

11 Mapping of Filters to Multicores Task Parallelism [Edward ‘87] Fine-Grained Data Parallelism [Michael ‘06] 3-phase solution [Michael ’06] Orchestrating the Execution of Stream Programs [Kudlur ‘08]

12 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core

13 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner

14 3-Phase Solution [Michael ‘06] RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect Data Parallel Target a 4 core machine Data Parallel, but too little work!

15 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner Target a 4 core machine

16 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

17 Better Mapping Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

18 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning

19 Greedy Partitioning [Michael ‘06] Target 4 core machine Time 16 Cores To Schedule:

Static Translation of Stream Programs [Proposal] We study –A mathematical model and algorithms to resolve bottlenecks in stream programs –Map actors of stream programs to processors in a parallel systems –Compute a schedule for each processor Goal is to statically optimize the throughput of a stream program Assuming constant input bandwidth

Research Question: Removing the bottleneck from the stream graph A BC D Original stream graph Filter B is the bottleneck A C D BB́ S J After removing the bottleneck Filter B is duplicated

Research Method Perform a quantitative analysis that detects bottlenecks in the stream graph The bottleneck resolver duplicates actors that impose a bottleneck. The process continues until the program is bottleneck free Then mapping the actors to processors is performed via Integer Linear Programming

Plan Background study Research question Proposal Implementation Results Publication

Question?