Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.
Practical techniques & Examples
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
November 30, Pseudo-dynamic C metaprogramming Using strategic term rewriting and partial evaluation 7 th Stratego User Days Wouter Caarls Quantitative.
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.
VB in Context Michael B. Spring Department of Information Science and Telecommunications University of Pittsburgh Pittsburgh, Pa 15260
Composing Dataflow Analyses and Transformations Sorin Lerner (University of Washington) David Grove (IBM T.J. Watson) Craig Chambers (University of Washington)
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Course Revision Contents  Compilers  Compilers Vs Interpreters  Structure of Compiler  Compilation Phases  Compiler Construction Tools  A Simple.
EECE **** Embedded System Design
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
FALL 2001ICOM Lecture 21 ICOM 4015 Advanced Programming Lecture 2 Procedural Abstraction Reading: LNN Chapter 4, 14 Prof. Bienvenido Velez.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Computer Science Detecting Memory Access Errors via Illegal Write Monitoring Ongoing Research by Emre Can Sezer.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Interpretation Environments and Evaluation. CS 354 Spring Translation Stages Lexical analysis (scanning) Parsing –Recognizing –Building parse tree.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Advanced Computer Architecture and Parallel Processing Rabie A. Ramadan http:
May 16-18, Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Data Structure Introduction.
CPS 506 Comparative Programming Languages Syntax Specification.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Introduction to Compiling
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
ECE 103 Engineering Programming Chapter 30 C Functions Herbert G. Mayer, PSU CS Status 8/9/2014 Initial content copied verbatim from ECE 103 material developed.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 6: Stepwise refinement revisited, Midterm review.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
Dr. Hussien Sharaf Dr Emad Nabil. Dr. Hussien M. Sharaf 2 position := initial + rate * Lexical analyzer 2. Syntax analyzer id 1 := id 2 + id 3 *
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Parallel Patterns.
Compiler Design (40-414) Main Text Book:
Dynamo: A Runtime Codesign Environment
Lesson #6 Modular Programming and Functions.
Conception of parallel algorithms
Lesson #6 Modular Programming and Functions.
Closures and Streams cs784(Prasad) L11Clos
CMPE 152: Compiler Design December 5 Class Meeting
Lesson #6 Modular Programming and Functions.
Important Concepts from Clojure
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Important Concepts from Clojure
Conversions of the type of the value of an expression
Performance Optimization for Embedded Software
Overview of Compilation The Compiler BACK End
Introduction C is a general-purpose, high-level language that was originally developed by Dennis M. Ritchie to develop the UNIX operating system at Bell.
Lesson #6 Modular Programming and Functions.
Important Concepts from Clojure
Presented By: Darlene Banta
Presentation transcript:

Vermelding onderdeel organisatie April 28, Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications IPDPS 2006 Wouter Caarls, Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science and Technology

April 26, Overview Stream programming Writing stream kernels Algorithmic skeletons Writing algorithmic skeletons Skeleton merging Results Conclusion & Future work

April 26, Stream Programming FIFO-connected kernels processing series of data elements Well suited to signal processing applications Explicit communication and task decomposition Ideal for distributed-memory systems Each data element processed (mostly) independently Ideal for data-parallel systems such as SIMDs

April 26, Kernel Examples from Image Processing Pixel processing (color space conversion) Perfect match Local neighborhood processing (convolution) Requires 2D access Recursive neighborhood processing (distance transform) Regular data dependencies Stack processing (region growing) Irregular data dependencies Increasing generality & Architectural requirements

April 26, Writing Kernels The language for writing kernels should be restricted To allow efficient compilation to constrained architectures But also general So many different algorithms can be specified  Solution: a different language for each type of kernel User selects the most restricted language that supports his kernel Retargetability Efficiency Ease-of-use

April 26, Algorithmic skeletons* as kernel languages An algorithmic skeleton captures a pattern of computation Is conceptually a higher-order function, repetitively calling a kernel function with certain parameters Iteration strategy may be parallel Kernel parameters restrict dependencies Provides the environment in which the kernel runs, and can be seen as a very restricted DSL *M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, 1989

April 26, Sequential neighborhood skeleton NeighborhoodToPixelOp() Average(in stream float i[-1..1] [-1..1], out stream float *o) { int ky, kx; float acc=0; for (ky=-1; ky <=1; ky++) for (kx=-1; kx <=1; kx++) acc += i[ky][kx]; *o = acc/9; } void Average(float **i, float **o) { for (int y=1; y < HEIGHT-1; y++) for (int x=1; x < WIDTH-1; x++) { float acc=0; acc += i[y-1][x-1]; acc += i[y-1][x ]; acc += i[y-1][x+1]; acc += i[y ][x-1]; acc += i[y ][x ]; acc += i[y ][x+1]; acc += i[y+1][x-1]; acc += i[y+1][x ]; acc += i[y+1][x+1]; o[y][x] = acc/9; } } Kernel definitionResulting operation Skeleton

April 26, Skeleton tasks Implement structure Outer loop, border handling, buffering, parallel implementation  Just write C code Transform kernel Stream access, translation to target language  Term rewriting How to combine in a single language?  Partial evaluation

April 26, Term rewriting (1) Input *o = acc/9; Rewrite Rule (applied topdown to all nodes) replace(`o`, `&o[y][x]`); Output o[y][x] = acc/9;

April 26, Term rewriting (2) Using Stratego* Input acc += i[ky][kx]; Rewrite Rule (applied topdown to all nodes) RelativeToAbsolute: |[ i[~e1][~e2] ]| -> |[ i[y + ~e1][x + ~e2] ]| Output acc += i[y+ky][x+kx]; *E. Visser. Stratego: A language for program transformation based on rewriting strategies, 2001

April 26, PEPCI (1) Rule composition and code generation in C stratego RelativeToAbsolute(code i, code body) { main = (body) RelativeToAbsolute’: |[ ~i[~e1][~e2] ]| -> |[ ~i[y + ~e1][x + ~e2] ]| } for (a=0; a < arguments; a++) if (args[a].type == ARG_STREAM_IN) body = RelativeToAbsolute(args[a].id, body); else if (args[a].type == ARG_STREAM_OUT) body = DerefToArrayIndex(args[a].id, body); for (y=1; y < HEIGHT-1; y++) for (x=1; x < WIDTH-1; Rule definition Rule composition Code generation

April 26, PEPCI (2) Combining rule composition and code generation How to distinguish rule composition from code generation? for (a=0; a < arguments; a++) body = DerefToArrayIndex(args[a].id, body); for (x=0; x < stride;  Partial evaluation: evaluate only the parts of the program that are known. Output the rest arguments is known, DerefToArrayIndex is known, args[a].id is known, body is known -> evaluate stride is unknown -> output

April 26, PEPCI (3) Partial evaluation by interpretation double n, x=1; int ii, iterations=3; scanf(“%lf”, &n); for (ii=0; ii < iterations; ii++) x = (x + n/x)/2; printf(“sqrt(%f) = %f\n”, n, x); double n; double x; int ii; int iterations; x = 1; iterations = 3; scanf(“%lf”, &n); ii = 0; x = (1 + n/1)/2; ii = 1; x = (x + n/x)/2; ii = 2; x = (x + n/x)/2; ii = 3; printf(“sqrt(%f) = %f\n”, n, x); doublen doublex intii intiterations Symbol table InputOutput ?1?1 ?1?3?1?3 ?103?103 ??03??03 ??13??13 ??23??23 ??33??33

April 26, Kernelization overheads Kernelizing an application impacts performance Mapping Scheduling Buffers management Lost ILP  Merge kernels Extract static kernel sequences Statically schedule at compile-time Replace sequence with merged kernel

April 26, Skeleton merging Skeletons are completely general functions Cannot be properly analyzed or reasoned about  Restrict skeleton generality be using metaskeletons Skeletons using the same metaskeleton can be merged Merged operation still uses the original metaskeleton, and can be recursively merged

April 26, Example Philips Inca+ smart camera 640x480 sensor XeTaL 16MHz, 320-way SIMD TriMedia 180MHz, 5-issue VLIW Ball detection Filtering, Segmentation, Hough transform

April 26, Results SetupTime to process a frame (ms) TriMedia baseline133 TriMedia optimized100 TriMedia kernelized160 TriMedia merged134 TriMedia + XeTaL merged54 Buffers, Scheduling, ILP ILP not fully recovered

April 26, Conclusion Stream programming is a natural fit for running image processing applications on distributed-memory systems Algorithmic Skeletons efficiently exploit data parallelism, by allowing the user to select the most restricted skeleton that supports his kernel Extensible (new skeletons) Retargetable (new skeleton implementations) PEPCI effectively combines the necessities of efficiently implementing algorithmic skeletons Term rewriting (by embedding Stratego) Partial evaluation (to automatically separate rule composition and code generation)

April 26, Future Work Better merging of kernels Merge more efficiently Merge different metaskeletons Implement on a more general architecture Implement more demanding applications And more involved skeletons

April 26, End

April 26, Partial evaluation (2) Free optimizations Loop unrolling If the conditions are known, and the body isn’t Function inlining Aggressive constant folding Including external “pure” functions

April 26, Kernel translation SIMD processors are not programmed in C, but in parallel derivatives Skeleton should translate kernel to target language  Extend PEPCI with C derivative syntax Though only minimally interpreted

April 26, Example: local neighborhood operation in XTC NeighbourhoodToPixelOp() sobelx(in stream unsigned char i[-1..1][-1..1], out stream int *o) { int x, y, temp; temp = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x=x+2) temp = temp + x*i[y][x]; *o = temp; } static lmem _in2; static lmem _in1; { lmem temp; temp = (0)+((-1)*(_in2[-1.. 0])); temp = (temp)+((1)*(_in2[1.. 2])); temp = (temp)+((-1)*(_in1[-1.. 0])); temp = (temp)+((1)*(_in1[1.. 2])); temp = (temp)+((-1)*(larg0[-1.. 0])); temp = (temp)+((1)*(larg0[1.. 2])); larg1 = temp; } _in2 = _in1; _in1 = larg0;

April 26, Stream program void main(int argc, char **argv) { STREAM a, b, c; int maxval, dummy, maxc; scInit(argc, argv); while (1) { capture(&a); interpolate(&a, &a); sobelx(&a, &b); sobely(&a, &c); magnitude(&b, &c, &a); direction(&b, &c, &b); mask(&b, &a, &a, scint(128)); hough(&a, &a); display(&a); imgMax(&a, scint(0), &maxval, scint(0), &dummy, scint(0), &maxc); _block(&maxc, &maxval); printf(“Ball found at %d with strength %d\n”, maxc, maxval); } return scExit(); }

April 26, Programming with algorithmic skeletons (1) PixelToPixelOp() binarize(in stream int *i, out stream int *o, in int *threshold) { *o = (*i > *threshold); } NeighbourhoodToPixelOp() average(in stream int i[-1..1][-1..1], out stream int *o) { int x, y; *o = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) *o += i[y][x]; *o /= 9; }

April 26, Programming with algorithmic skeletons (2) StackOp(in stream int *init) propagate(in stream int *i[-1..1][-1..1], out stream int *o) { int x, y; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) if (i[y][x] && !*o) { *o = 1; push(y, x); } AssocPixelReductionOp() max(in stream int *i, out int *res) { if (*i > *res) *res = *i; }

April 26, Algorithmic Skeletons <=t >t + = <=t >t += <=t >t +=

April 26, Term rewriting (1) From code to abstract syntax tree Stat AssignPlus IdArrayIndex “i” “acc” “ky” ArrayIndexId Stat(AssignPlus(Id("acc"),ArrayIndex(ArrayIndex(Id("i"),Id("ky")), Id("kx")))) acc+=i[ ] ;kykx “kx”