StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Models of Computation as Program Transformations Chris Chang
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Voicu Groza, 2008 SITE, HARDWARE/SOFTWARE CODESIGN OF EMBEDDED SYSTEMS Hardware/Software Codesign of Embedded Systems Voicu Groza SITE Hall, Room.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,
CS 671 Compilers Prof. Kim Hazelwood Spring 2008.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Michael I. Gordon, William Thies, and Saman Amarasinghe
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Welcome! Simone Campanoni
StreamIt: A Language for Streaming Applications
SOFTWARE DESIGN AND ARCHITECTURE
A Common Machine Language for Communication-Exposed Architectures
Linear Filters in StreamIt
Teleport Messaging for Distributed Stream Programs
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Introduction to cosynthesis Rabi Mahapatra CSCE617
StreamIt: High-Level Stream Programming on Raw
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris Leger, Sam Larsen, and Saman Amarasinghe MIT Laboratory for Computer Science MIT Computer Architecture Workshop September 19, 2002

Von Neumann Languages Why C (FORTRAN, C++ etc.) became very successful? –Abstracted out the differences of von Neumann machines Register set structure Functional units and capabilities Pipeline depth/width Memory/cache organization –Directly expose the common properties Single memory image Single control-flow A clear notion of time –Can have a very efficient mapping to a von Neumann machine Today von Neumann languages are a curse!

StreamIt: A Spatially-Aware Language A language for streaming applications –Provides high-level stream abstraction A filter is the autonomous unit of computation. Breaks the von Neumann language barrier –Each filter has its own PC –Each filter has its own address space –No global time –Explicit data movement between filters

The Filter A filter communicates using FIFO channels, with the following operations: –pop(): dequeue the bottom item from the incoming channel. –peek(index): return the value at position index without dequeuing it. –push(value): enqueue value on the outgoing channel. The pop, peek, and push rate for each firing of a filter must be statically determined. Each filters contains: –An initialization function –A steady-state “work” function

StreamIt Language A collection of filters connected by channels. Structured Streams –Streaming applications have structure, not a free-form graph. –Use a few constructs: pipeline, splitjoin and feedback –Hierarchical composition –Intuitive textual representation –Greatly simplify compiler analysis

Hierarchical Structures pipeline –Sequential composition of streams splitjoin –Parallel composition of streams feedback loop –Cyclic composition of streams

Compiler Flow Summary Kopi Front-End StreamIt Code SIR Conversion Parse Tree SIR (unexpanded) SIR (expanded) Graph Expansion Partitioning Layout Communication Scheduler Code Generation Load-balanced Stream Graph Filters assigned to Raw tiles Switch Code Processor Code

Partitioning Goal: Granularity of the stream graph should match the target architecture. For Raw, we want the number of filters in the stream graph to equal the number of tiles. The final stream graph needs to be load balanced. Partitioning is currently driven by a simple greedy algorithm. Two primary transformations: –Fission –Fusion ?

Partitioning - Fission Fission - splitting streams –Duplicate a filter, placing the duplicates in a splitjoin to expose parallelism. Filter Joiner Splitter … – Split a filter into a pipeline for load balancing. FilterFilter0Filter1FilterN …

Partitioning - Fusion Fusion - merging streams –Reduce the number of filters in a construct for load balancing and synchronization removal. Filter FilterNFilter0 Joiner Splitter … FilterFilter0Filter1FilterN …

Partitioning Example (Sort) 242 Filters16 Filters

Layout Goal: To assign each filter to exactly one Raw tile. The layout algorithm is implemented using Simulated Annealing. The cost function (energy) tries to measure the added synchronization imposed by the layout. Want to avoid: –Crossed routes –Routes passing through tiles assigned to filters Because of the static properties of StreamIt, exact communication properties of the stream graph are known at compile time. –Cost function is quite accurate –Leads to excellent layouts

Layout Example (FFT) Partitioned Stream Graph Zero-cost layout

Layout Example (Radio) Partitioned Stream Graph Best layout

Routing At this time, data items are routed using a simple dimension-ordered router. The router traces the path from source to destination by first routing the Y dimension and then the X dimension. All items are sent over the first static network. The second static network and the dynamic network are unused.

Communication Scheduling The communication scheduler maps StreamIt’s channel abstraction to Raw’s static network. The communication scheduler simulates the execution of a given schedule, recording the communication as it simulates. –Assume that each filter fires instantaneously. –Record the routing instruction for the source, destination, and intermediate hops.

Code Generation For the compute-processor, we generate C code that is compiled using Raw's GCC port. We introduce an internal buffer for each filter. –The buffer is necessary because of the peek operation. –All items are received into this buffer. Loop “work” function infinitely in steady-state: –Each filter buffers its input until it has peek items in its buffer, then it fires. –pop() and peek(index) are reads from the buffer. –A push(value) is a static network send.

Results We have detailed performance measurements over our 9 benchmarks in our upcoming ASPLOS paper, but we will not give them here. –This is our initial implementation and we are working on optimizations. But the results show that we are not communication limited. –We need to focus on optimizing the generated compute- processor code. In the following slides we give a comparison of StreamIt and C code for our benchmarks.

Speedup Over Single Tile –For Radio we obtained the C implementation from a 3 rd party –For FIR, Sort, FFT, Filterbank, and 3GPP we wrote the C implementation following a reference algorithm.

Intel ® Xeon TM Comparison FIRRadarRadioSortFFTFilterbankGSMVocoder3GPP Throughput / cycle normalized to a 2.2GHz Sequential C program on 1 tile StreamIt program on 16 tiles 37 –For Radio, GSM, and Vocoder we obtained the C implementation from a 3 rd party –For FIR, Sort, FFT, Filterbank, Radar, and 3GPP we wrote the C implementation following a reference algorithm. –For Radar, GSM, and Vocoder the C implementation did not fit on a single Raw tile.

Conclusion First step toward a portable stream language for communication-exposed architectures. Future work: –Optimizing the implementation –Support more features of StreamIt Other cool StreamIt projects: –New syntax –DSP domain specific linear dataflow analysis and transformation. –Constrained scheduling

StreamIt Homepage For More Information William Thies, Michal Karczmarek, and Saman Amarasinghe, StreamIt: A Language for Streaming Applications, 2002 International Conference on Compiler Construction, Grenoble, France. To appear in the Springer-Verlag Lecture Notes on Computer Science. Michael I. Gordon, William Thies, et. al., A Stream Compiler for Communication-Exposed Architectures, Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October, Michael I. Gordon. A Stream-Aware Compiler for Communication-Exposed Architectures. S.M. Thesis, Massachusetts Institute of Technology, August 2002.