A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
A Dataflow Programming Language and Its Compiler for Streaming Systems
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
An Empirical Characterization of Stream Programs and its Implications for Language and Compiler Design Bill Thies 1 and Saman Amarasinghe 2 1 Microsoft.
MotoHawk Training Model-Based Design of Embedded Systems.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
MoBIES Working group meeting, September 2001, Dearborn Ptolemy II The automotive challenge problems version 4.1 Johan Eker Edward Lee with thanks.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Group Discussion Hong Man 07/21/ UMD DIF with GNU Radio From Will Plishker’s presentation. 2 GRC The DIF Package (TDP) Platforms GPUs Multi- processors.
Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
Anne Mascarin DSP Marketing The MathWorks
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Chapter 10 Architectural Design.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
1 LabVIEW DSP Test Integration Toolkit. 2 Agenda LabVIEW Fundamentals Integrating LabVIEW and Code Composer Studio TM (CCS) Example Use Case Additional.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
CS 671 Compilers Prof. Kim Hazelwood Spring 2008.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Michael I. Gordon, William Thies, and Saman Amarasinghe
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
CS 5991 Presentation Ptolemy: A Framework For Simulating and Prototyping Heterogeneous Systems.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
StreamIt: A Language for Streaming Applications
Advanced Computer Systems
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
A Common Machine Language for Communication-Exposed Architectures
Linear Filters in StreamIt
Teleport Messaging for Distributed Stream Programs
Introduction to cosynthesis Rabi Mahapatra CSCE617
Cache Aware Optimization of Stream Programs
High Performance Stream Processing for Mobile Sensing Applications
StreamIt: High-Level Stream Programming on Raw
VSIPL++: Parallel Performance HPEC 2004
Presentation transcript:

A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric Rabbah and Saman Amarasinghe Massachusetts Institute of Technology IBM PL Day May 21, 2004

Streaming Application Domain Based on audio, video, or data stream Increasingly prevalent and important –Embedded systems Cell phones, handheld computers –Desktop applications Streaming media Real-time encryption Software radio Graphics packages –High-performance servers Software routers (Example: Click) Cell phone base stations HDTV editing consoles

Properties of Stream Programs A large (possibly infinite) amount of data –Limited lifetime of each data item –Little processing of each data item Computation: apply multiple filters to data –Each filter takes an input stream, does some processing, and produces an output stream –Filters are independent and self-contained A regular, static communication pattern –Filter graph is relatively constant –A lot of opportunities for compiler optimizations

The StreamIt Project Goals: –Provide a high-level stream programming model –Invent new compiler technology for streams Contributions: –Language Design, Structured Streams, Buffer Management (CC 2002) –Exploiting Wire-Exposed Architectures (ASPLOS 2002, ISCA 2004) –Scheduling of Static Dataflow Graphs (LCTES 2003) –Domain Specific Optimizations (PLDI 2003) –Public release: Fall 2003

Outline Introduction StreamIt Language Domain-specific Optimizations Targeting Parallel Architectures Public Release Conclusions

Outline Introduction StreamIt Language Domain-specific Optimizations Targeting Parallel Architectures Public Release Conclusions

Model of Computation Synchronous Dataflow [Lee 1992] –Graph of independent filters –Communicate via channels –Static I/O rates A/D Duplicate LED Detect LED Detect LED Detect LED Detect Band pass Freq band detector

Filter Example: LowPassFilter filter float->float filter LowPassFilter (int N, float freq) { float[N] weights; init { weights = calcWeights(N, freq); } work peek N pop 1 push 1 { float result = 0; for (int i=0; i<weights.length; i++) { result += weights[i] * peek(i); } push(result); pop(); }

N filter Filter Example: LowPassFilter float->float filter LowPassFilter (int N, float freq) { float[N] weights; init { weights = calcWeights(N, freq); } work peek N pop 1 push 1 { float result = 0; for (int i=0; i<weights.length; i++) { result += weights[i] * peek(i); } push(result); pop(); }

N filter Filter Example: LowPassFilter float->float filter LowPassFilter (int N, float freq) { float[N] weights; init { weights = calcWeights(N, freq); } work peek N pop 1 push 1 { float result = 0; for (int i=0; i<weights.length; i++) { result += weights[i] * peek(i); } push(result); pop(); }

Filter Example: LowPassFilter N filter float->float filter LowPassFilter (int N, float freq) { float[N] weights; init { weights = calcWeights(N, freq); } work peek N pop 1 push 1 { float result = 0; for (int i=0; i<weights.length; i++) { result += weights[i] * peek(i); } push(result); pop(); }

Composing Filters: Structured Streams Hierarchical structures: –Pipeline –SplitJoin –Feedback Loop Basic programmable unit: Filter

Duplicate LED Detect LED Detect LED Detect LED Detect Freq Band Detector in StreamIt A/D Band pass void->void pipeline FrequencyBand { float sFreq = 4000; float cFreq = 500/(sFreq*2*pi); float wFreq = 100/(sFreq*2*pi); add D2ASource(sFreq); add BandPassFilter(100, cFreq-wFreq, cFreq+wFreq); add splitjoin { split duplicate; for (int i=0; i<4; i++) { add pipeline { add Detect (i/4); add LED (i); } join roundrobin(0); }

Radar-Array Front End

Filterbank

FM Radio with Equalizer

Bitonic Sort

FFT

Block Matrix Multiply

MP3 Decoder

Outline Introduction StreamIt Language Domain-specific Optimizations Targeting Parallel Architectures Public Release Conclusions

Conventional DSP Design Flow DSP Optimizations Rewrite the program Design the Datapaths (no control flow) Architecture-specific Optimizations (performance, power, code size) Spec. (data-flow diagram) Coefficient Tables C/Assembly Code Signal Processing Expert in Matlab Software Engineer in C and Assembly

Any Design Modifications? Center frequency from 500 Hz to 1200 Hz? –According to TI, in the conventional design-flow: Redesign filter in MATLAB Cut-and-paste values to EXCEL Recalculate the coefficients Update assembly Source: Application Report SPRA414 Texas Instruments, 1999 A/D Duplicate LED Detect LED Detect LED Detect LED Detect Band pass

Ideal DSP Design Flow DSP Optimizations High-Level Program (dataflow + control) Architecture-Specific Optimizations C/Assembly Code Application-Level Design Compiler Application Programmer Challenge: maintaining performance

Our Focus: Linear Filters Most common target of DSP optimizations FIR filters Compressors Expanders DFT/DCT Optimizations: 1. Combining Adjacent Nodes 2. Translating to Frequency Domain 3. Selecting the Best Transformations Output is weighted sum of inputs

Extracting Linear Representation work peek N pop 1 push 1 { float sum = 0; for (int i=0; i<N; i++) { sum += h[i]*peek(i); } push(sum); pop(); } Linear Dataflow Analysis  A, b  y = x A + b x

1) Combining Linear Filters Pipelines and splitjoins can be collapsed Example: pipeline Filter 1 Filter 2 x y z y = x A z = y B Combined Filter z = x C z = x A B C

Combination Example 6 mults output 1 mults output Filter 1 Filter 2 C = [ 32 ] Combined Filter

Floating-Point Operations Reduction 0.3%

2) From Time to Frequency Domain Convolutions can be done cheaply in the Frequency Domain Painful to do by hand Blocking Coefficient calculations Startup X i *W n-i Σ X  F (x) Y  X.* H y  F - 1 (Y) FFT VVM IFFT Multiple outputs Interfacing with FFT library Verification

Floating-Point Operations Reduction -140% 0.3%

3) When to Apply Transformations? Estimate minimal cost for each structure: Linear combination Frequency translation No transformation If hierarchical, consider all rectangular groupings of children Overlapping sub-problems allows efficient dynamic programming search

Radar (Transformation Selection) Splitter Sink RR Mag Detect Duplicate Mag Detect Mag Detect BeamFrm Filter Mag Detect RR Splitter(null) Input Dec Cfilt CFilt2

RR Radar (Transformation Selection) Splitter Sink RR Filter Mag Detect Filter Mag Detect Filter Mag Detect Duplicate BeamFrm Filter Mag Detect Splitter(null) Input

RR Radar (Transformation Selection) Splitter Sink RR Filter Mag Detect Filter Mag Detect Filter Mag Detect RR Duplicate BeamFrm Filter Mag Detect Splitter(null) Input

Radar (Transformation Selection) RR Splitter Sink RR Filter Mag Detect Filter Mag Detect Filter Mag Detect Filter Mag Detect Splitter(null) Input half as many FLOPS Splitter Sink RR Mag Duplicate Mag RR Splitter(null) Input Maximal Combination and Shifting to Frequency Domain Using Transformation Selection 2.4 times as many FLOPS

Floating-Point Operations Reduction -140% 0.3%

Outline Introduction StreamIt Language Domain-specific Optimizations Targeting Parallel Architectures Public Release Conclusions

Compiling to the Raw Architecture

1. Partitioning: adjust granularity of graph

Compiling to the Raw Architecture 1. Partitioning: adjust granularity of graph 2. Layout: assign filters to tiles

1. Partitioning: adjust granularity of graph 2. Layout: assign filters to tiles 3. Scheduling: route items across network Compiling to the Raw Architecture

Scalability Results

Raw vs. Pentium III

Outline Introduction StreamIt Language Domain-specific Optimizations Targeting Parallel Architectures Public Release Conclusions

StreamIt Compiler Infrastructure Built on Kopi Java compiler (GNU license) –StreamIt frontend is on MIT license High-level hierarchical IR for streams –Host of graph transformations Filter fusion, filter fission Synchronization removal Splitjoin refactoring Graph canonicalization Low-level “flat graph” for backends –Eliminates structure; point-to-point connections Streaming benchmark suite

Compiler Flow Kopi Front-End StreamIt code SIR Conversion Parse Tree SIR (unexpanded) SIR (expanded) Graph Expansion Raw Backend StreamIt Front-End Legal Java file Any Java Compiler Scheduler StreamIt Java Library Class file Uniprocessor Backend Linear Optimizations C code for tiles Assembly code for switch ANSI C code

Building on StreamIt StreamIt to VIRAM [Yelick et al.] –Automatically generate permutation instructions StreamBit: bit-level optimization [Bodik et al.] Integration with IBM Eclipse Platform

StreamIt Graphical Editor StreamIt Component-Shortcuts Create Filters, Pipelines, SplitJoins, Feedback Loops, FIFOs Juan C. Reyes M.Eng. Thesis

StreamIt Debugging Environment StreamIt Text Editor StreamIt Graph Zoom Panel StreamIt Graph Components expanded and collapsed views of basic programmable unit communication buffer with live data General Debugging Information Compiler and Output Consoles not shown: the StreamIt On-Line Help Manual Kimberly Kuo M.Eng. Thesis

Outline Introduction StreamIt Language Domain-specific Optimizations Targeting Parallel Architectures Public Release Conclusions

Related Work Stream languages –KernelC/StreamC, Brook: augment C with data-parallel kernels –Cg: allow low-level programming of graphics processors –SISAL, functional languages: expose temporal parallelism –StreamIt exposes more task parallelism, easier to analyze Control languages for embedded systems –LUSTRE, Esterel, etc.: can verify of safety properties –Do not expose high-bandwidth data flow for optimization Prototyping environments –Ptolemy, Simulink, etc.: provide graphical abstractions –StreamIt has more of a compiler focus

Future Work Backend optimizations for linear filters –Template assembly code: asymptotically optimal Fault-tolerance on a cluster of workstations –Automatically recover if machine fail Supporting dynamic events –Point-to-point control messages –Re-initialization for parts of the stream

Conclusions StreamIt: compiler infrastructure for streams –Raising the level of abstraction in stream programming –Language design for both programmer and compiler Public release: many opportunities for collaboration Compiler AnalysisLanguage features exploited Linear Optimizations- peek primitive (data reuse) - atomic work function - structured streams Backend for Raw Architecture - exposed communication patterns - exposed parallelism - static work estimate

Extra Slides

StreamIt Language Summary Feature Programmer BenefitCompiler Benefit Autonomous Filters Encapsulation Automatic scheduling Local memories Exposes parallelism Exposes communication Peek primitive Automatic buffer management Exposes data reuse Structured Streams Natural syntax Exposes symmetry Eliminate corner cases Scripts for Graph Construction Reusable components Easy to maintain Two-stage compilation

AB for any A and B? [A] U E [A][A] U E [A][A] [A][A] [A][A] OriginalExpanded  pop =  Linear Expansion

Backend Support for Linear Filters Can generate custom code for linear filters –Many architectures have special support for matrix mult. –On Raw: assembly code templates for tiles and switch Substitute coefficients, peek/pop/push rates Preliminary result: FIR on Raw –StreamIt code: 15 lines –Manually-tuned C code:352 lines –Both achieve 99% utilization of Raw FPU’s Asymptotically optimal Current focus: integrating with general backend