Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.

Slides:

Advertisements

Similar presentations

The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT

Advertisements

DSPs Vs General Purpose Microprocessors

Programmable FIR Filter Design

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Compiler Optimization Overview

Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

StreamIt: A Language for Streaming Applications

A Common Machine Language for Communication-Exposed Architectures

Linear Filters in StreamIt

Parallel Algorithm Design

Teleport Messaging for Distributed Stream Programs

Cache Aware Optimization of Stream Programs

A Reconfigurable Architecture for Load-Balanced Rendering

StreamIt: High-Level Stream Programming on Raw

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology A Stream Compiler for Communication-Exposed Architectures

The Streaming Domain Widely applicable and increasingly prevalent –Embedded systems Cell phones, handheld computers, DSP’s –Desktop applications Streaming media Real-time encryption Software radio Graphics packages –High-performance servers Software routers (Example: Click) Cell phone base stations HDTV editing consoles Based on audio, video, or data stream –Predominant data types in the current data explosion

Properties of Stream Programs A large (possibly infinite) amount of data –Limited lifetime of each data item –Little processing of each data item Computation: apply multiple filters to data –Each filter takes an input stream, does some processing, and produces an output stream –Filters are independent and self-contained A regular, static computation pattern –Filter graph is relatively constant –A lot of opportunities for compiler optimizations

StreamIt: A spatially-aware Language & Compiler A language for streaming applications –Provides high-level stream abstraction Breaks the Von Neumann language barrier –Each filter has its own control-flow –Each filter has its own address space –No global time –Explicit data movement between filters –Compiler is free to reorganize the computation Spatially-aware Compiler –Intermediate representation with stream constructs –Provides a host of stream analyses and optimizations

Hierarchical structures: –Pipeline –SplitJoin –Feedback Loop Basic programmable unit: Filter Structured Streams

float->float filter LowPassFilter(int N) { float[N] weights; init { for (int i=0; i<N; i++) weights[i] = calcWeights(i); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) result += weights[i] * peek(i); push(result); pop(); } Filter Example: LowPassFilter

N float->float filter LowPassFilter(int N) { float[N] weights; init { for (int i=0; i<N; i++) weights[i] = calcWeights(i); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) result += weights[i] * peek(i); push(result); pop(); }

N Filter Example: LowPassFilter float->float filter LowPassFilter(int N) { float[N] weights; init { for (int i=0; i<N; i++) weights[i] = calcWeights(i); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) result += weights[i] * peek(i); push(result); pop(); }

Filter Example: LowPassFilter N float->float filter LowPassFilter(int N) { float[N] weights; init { for (int i=0; i<N; i++) weights[i] = calcWeights(i); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) result += weights[i] * peek(i); push(result); pop(); }

Filter Example: LowPassFilter N float->float filter LowPassFilter(int N) { float[N] weights; init { for (int i=0; i<N; i++) weights[i] = calcWeights(i); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) result += weights[i] * peek(i); push(result); pop(); }

Example: Radar Array Front End Splitter FIRFilter RoundRobin Duplicate Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner complex->void pipeline BeamFormer(int numChannels, int numBeams) { add splitjoin { split duplicate; for (int i=0; i<numChannels; i++) { add pipeline { add FIR1(N1); add FIR2(N2); }; join roundrobin; }; add splitjoin { split duplicate; for (int i=0; i<numBeams; i++) { add pipeline { add VectorMult(); add FIR3(N3); add Magnitude(); add Detect(); }; join roundrobin(0); }; }

How to execute a Stream Graph? Method 1: Time Multiplexing Run one filter at a time Pros: –Scheduling is easy –Synchronization from Memory Cons: –If a filter run is too short Filter load overhead is high –If a filter run is too long Data spills down the cache hierarchy Long latency –Lots of memory traffic -Bad cache effects –Does not scale with spatially-aware architectures Processor Memory

How to execute a Stream Graph? Method 2: Space Multiplexing Map filter per tile and run forever Pros: –No filter swapping overhead –Exploits spatially-aware architectures Scales well –Reduced memory traffic –Localized communication –Tighter latencies –Smaller live data set Cons: –Load balancing is critical –Not good for dynamic behavior –Requires # filters ≤ # processing elements

The MIT RAW Machine A scalable computation fabric –4 x 4 mesh of tiles, each tile is a simple microprocessor Ultra fast interconnect network –Exposes the wires to the compiler –Compiler orchestrate the communication Computation Resources

Example: Radar Array Front End Splitter FIRFilter RoundRobin Duplicate Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner complex->void pipeline BeamFormer(int numChannels, int numBeams) { add splitjoin { split duplicate; for (int i=0; i<numChannels; i++) { add pipeline { add FIR1(N1); add FIR2(N2); }; join roundrobin; }; add splitjoin { split duplicate; for (int i=0; i<numBeams; i++) { add pipeline { add VectorMult(); add FIR3(N3); add Magnitude(); add Detect(); }; join roundrobin(0); }; }

Radar Array Front End on Raw Executing Instructions Blocked on Static Network Pipeline Stall

Bridging the Abstraction layers StreamIt language exposes the data movement –Graph structure is architecture independent Each architecture is different in granularity and topology –Communication is exposed to the compiler The compiler needs to efficiently bridge the abstraction –Map the computation and communication pattern of the program to the PE’s, memory and the communication substrate The StreamIt Compiler –Partitioning –Placement –Scheduling –Code generation

Bridging the Abstraction layers StreamIt language exposes the data movement –Graph structure is architecture independent Each architecture is different in granularity and topology –Communication is exposed to the compiler The compiler needs to efficiently bridge the abstraction –Map the computation and communication pattern of the program to the PE’s, memory and the communication substrate The StreamIt Compiler –Partitioning –Placement –Scheduling –Code generation

Partitioning: Choosing the Granularity Mapping filters to tiles –# filters should equal (or a few less than) # of tiles –Each filter should have similar amount of work Throughput determined by the filter with most work Compiler Algorithm –Two primary transformations Filter fission Filter fusion –Uses a greedy heuristic

Partitioning - Fission Fission - splitting streams –Duplicate a filter, placing the duplicates in a SplitJoin to expose parallelism. Filter Joiner Splitter … –Split a filter into a pipeline for load balancing FilterFilter0Filter1FilterN …

Partitioning - Fusion Fusion - merging streams –Merge filters into one filter for load balancing and synchronization removal Filter FilterNFilter0 Joiner Splitter … FilterFilter0Filter1FilterN …

Example: Radar Array Front End (Original) Splitter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner FIRFilter

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Vector Mult FIRFilter Magnitude Detector Joiner Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Vector Mult FIRFilter Magnitude Detector Joiner Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Vector Mult FIRFilter Magnitude Detector Joiner Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector

Example: Radar Array Front End Splitter FIRFilter Joiner Splitter Vector Mult FIRFilter Magnitude Detector Joiner Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector

Example: Radar Array Front End (Balanced) Splitter FIRFilter Joiner Splitter Vector Mult FIRFilter Magnitude Detector Joiner Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector

Placement: Minimizing Communication Assign filters to tiles –Communicating filters  try to make them adjacent –Reduce overlapping communication paths –Reduce/eliminate cyclic communication if possible Compiler algorithm –Uses Simulated Annealing

Placement for Partitioned Radar Array Front End FIR Vector Mult FIR Magnitude Detector FIR Vector Mult FIR Magnitude Detector Joiner FIR

Scheduling: Communication Orchestration Create a communication schedule Compiler Algorithm –Calculate an initialization and steady-state schedule –Simulate the execution of an entire cyclic schedule –Place static route instructions at the appropriate time

… push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { } pop=3 push=1 pop=2 … ABC

push=2 … push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { A } pop=3 push=1 pop=2 … ABC

push=2 … push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { A, A } pop=3 push=1 pop=2 … ABC

push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { A, A, B } pop=3 push=1 pop=2 … pop=3 push=1 ABC

… push=2 … push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { A, A, B, A } pop=3 push=1 pop=2 … ABC

push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { A, A, B, A, B } pop=3 push=1 pop=2 … pop=3 push=1 ABC

… push=2 Steady-State Schedule All data pop/push rates are constant Can find a Steady-State Schedule –# of items in the buffers are the same before and the after executing the schedule –There exist a unique minimum steady state schedule Schedule = { A, A, B, A, B, C } pop=3 push=1 pop=2 … pop=2 … ABC

Initialization Schedule When peek > pop, buffer cannot be empty after firing a filter Buffers are not empty at the beginning/end of the steady state schedule Need to fill the buffers before starting the steady state execution peek=4 pop=3 push=1

Initialization Schedule When peek > pop, buffer cannot be empty after firing a filter Buffers are not empty at the beginning/end of the steady state schedule Need to fill the buffers before starting the steady state execution peek=4 pop=3 push=1 peek=4 pop=3 push=1

Code Generation: Optimizing tile performance Creates code to run on each tile –Optimized by the existing node compiler Generates the switch code for the communication

Performance Results for Radar Array Front End Executing Instructions Blocked on Static Network Pipeline Stall

Performance of Radar Array Front End

Utilization of Radar Array Front End

StreamIt Applications: FM Radio with an Equalizer Duplicate splitter Low pass filter Low Pass filter FM Demodulator Float Diff filter Round robin joiner Float Diff filter Float Adder filter

StreamIt Applications: Vocoder Duplicate splitter DFT filter Round robin joiner DFT filter Round robin splitter Duplicate splitter FIR Smoothing FilterIdentity Round robin joiner Deconvolve filter Round robin splitter Liner Interpolator Filter Round robin joiner Multiplier filter Decimator filter Liner Interpolator Filter Decimator filter Round robin joiner Phase unwrapper filter Const Multiplier filter Linear Interpolator filter Decimator filter

StreamIt Applications: GSM decoder Hold Filter Round robin splitter Input Identity LTP Input Filter Additional Update filter Round robin joiner LTP Filter Input LTP Input Filter Round robin joiner Round robin splitter Duplicate splitter Reflection Coeff Filter Round robin splitter Input LTP Input Filter Identity Round robin joiner Short Term Synth Filter Post Processing Filter

StreamIt Applications: 3GPP Radio Access Protocol – Physical Layer

Application Performance

Scalability of StreamIt Bitonic Sort

Scalability of StreamIt Bitonic Sort

Related Work Stream-C / Kernel-C (Dally et. al) –Compiled to Imagine with time multiplexing –Extensions to C to deal with finite streams –Programmer explicitly calls stream “kernels” –Need program analysis to overlap streams / vary target granularity Brook (Buck et. al) –Architecture-independent counterpart of Stream-C / Kernel-C –Designed to be more parallelizable Ptolemy (Lee et. al) –Heterogeneous modeling environment for DSP –Many scheduling results shared with StreamIt –Don’t focus on language development / optimized code generation Other languages –Occam, SISAL – not statically schedulable –LUSTRE, Lucid, Signal, Esterel – don’t focus on parallel performance

Conclusion Streaming Programming Model –An important class of applications –Can break the von Neumann bottleneck –A natural fit for a large class of applications –Straightforward mapping to the architectural model StreamIt: A Machine Language for Communication Exposed Architectures –Expose the common properties Multiple instruction streams Software exposed communication Fast local memory co-located with execution units –Hide the differences Granularity of execution units Type and topology of the communication network Memory hierarchy A good compiler can eliminate the overhead of abstraction