Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications.

Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications IPDPS 2006 Wouter Caarls, Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science and Technology

April 26, 20062 Overview Stream programming Writing stream kernels Algorithmic skeletons Writing algorithmic skeletons Skeleton merging Results Conclusion & Future work

April 26, 20063 Stream Programming FIFO-connected kernels processing series of data elements Well suited to signal processing applications Explicit communication and task decomposition Ideal for distributed-memory systems Each data element processed (mostly) independently Ideal for data-parallel systems such as SIMDs

April 26, 20064 Kernel Examples from Image Processing Pixel processing (color space conversion) Perfect match Local neighborhood processing (convolution) Requires 2D access Recursive neighborhood processing (distance transform) Regular data dependencies Stack processing (region growing) Irregular data dependencies Increasing generality & Architectural requirements

April 26, 20065 Writing Kernels The language for writing kernels should be restricted To allow efficient compilation to constrained architectures But also general So many different algorithms can be specified  Solution: a different language for each type of kernel User selects the most restricted language that supports his kernel Retargetability Efficiency Ease-of-use

April 26, 20066 Algorithmic skeletons* as kernel languages An algorithmic skeleton captures a pattern of computation Is conceptually a higher-order function, repetitively calling a kernel function with certain parameters Iteration strategy may be parallel Kernel parameters restrict dependencies Provides the environment in which the kernel runs, and can be seen as a very restricted DSL *M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, 1989

April 26, 20067 Sequential neighborhood skeleton NeighborhoodToPixelOp() Average(in stream float i[-1..1] [-1..1], out stream float *o) { int ky, kx; float acc=0; for (ky=-1; ky <=1; ky++) for (kx=-1; kx <=1; kx++) acc += i[ky][kx]; *o = acc/9; } void Average(float **i, float **o) { for (int y=1; y < HEIGHT-1; y++) for (int x=1; x < WIDTH-1; x++) { float acc=0; acc += i[y-1][x-1]; acc += i[y-1][x ]; acc += i[y-1][x+1]; acc += i[y ][x-1]; acc += i[y ][x ]; acc += i[y ][x+1]; acc += i[y+1][x-1]; acc += i[y+1][x ]; acc += i[y+1][x+1]; o[y][x] = acc/9; } } Kernel definitionResulting operation Skeleton

April 26, 20068 Skeleton tasks Implement structure Outer loop, border handling, buffering, parallel implementation  Just write C code Transform kernel Stream access, translation to target language  Term rewriting How to combine in a single language?  Partial evaluation

April 26, 20069 Term rewriting (1) Input *o = acc/9; Rewrite Rule (applied topdown to all nodes) replace(`o`, `&o[y][x]`); Output o[y][x] = acc/9;

April 26, 200610 Term rewriting (2) Using Stratego* Input acc += i[ky][kx]; Rewrite Rule (applied topdown to all nodes) RelativeToAbsolute: |[ i[~e1][~e2] ]| -> |[ i[y + ~e1][x + ~e2] ]| Output acc += i[y+ky][x+kx]; *E. Visser. Stratego: A language for program transformation based on rewriting strategies, 2001

April 26, 200611 PEPCI (1) Rule composition and code generation in C stratego RelativeToAbsolute(code i, code body) { main = (body) RelativeToAbsolute’: |[ ~i[~e1][~e2] ]| -> |[ ~i[y + ~e1][x + ~e2] ]| } for (a=0; a < arguments; a++) if (args[a].type == ARG_STREAM_IN) body = RelativeToAbsolute(args[a].id, body); else if (args[a].type == ARG_STREAM_OUT) body = DerefToArrayIndex(args[a].id, body); for (y=1; y < HEIGHT-1; y++) for (x=1; x < WIDTH-1; x++) @body; Rule definition Rule composition Code generation

April 26, 200612 PEPCI (2) Combining rule composition and code generation How to distinguish rule composition from code generation? for (a=0; a < arguments; a++) body = DerefToArrayIndex(args[a].id, body); for (x=0; x < stride; x++) @body;  Partial evaluation: evaluate only the parts of the program that are known. Output the rest arguments is known, DerefToArrayIndex is known, args[a].id is known, body is known -> evaluate stride is unknown -> output

April 26, 200613 PEPCI (3) Partial evaluation by interpretation double n, x=1; int ii, iterations=3; scanf(“%lf”, &n); for (ii=0; ii < iterations; ii++) x = (x + n/x)/2; printf(“sqrt(%f) = %f\n”, n, x); double n; double x; int ii; int iterations; x = 1; iterations = 3; scanf(“%lf”, &n); ii = 0; x = (1 + n/1)/2; ii = 1; x = (x + n/x)/2; ii = 2; x = (x + n/x)/2; ii = 3; printf(“sqrt(%f) = %f\n”, n, x); doublen doublex intii intiterations Symbol table InputOutput ?1?1 ?1?3?1?3 ?103?103 ??03??03 ??13??13 ??23??23 ??33??33

April 26, 200614 Kernelization overheads Kernelizing an application impacts performance Mapping Scheduling Buffers management Lost ILP  Merge kernels Extract static kernel sequences Statically schedule at compile-time Replace sequence with merged kernel

April 26, 200615 Skeleton merging Skeletons are completely general functions Cannot be properly analyzed or reasoned about  Restrict skeleton generality be using metaskeletons Skeletons using the same metaskeleton can be merged Merged operation still uses the original metaskeleton, and can be recursively merged

April 26, 200616 Example Philips Inca+ smart camera 640x480 sensor XeTaL 16MHz, 320-way SIMD TriMedia 180MHz, 5-issue VLIW Ball detection Filtering, Segmentation, Hough transform

April 26, 200617 Results SetupTime to process a frame (ms) TriMedia baseline133 TriMedia optimized100 TriMedia kernelized160 TriMedia merged134 TriMedia + XeTaL merged54 Buffers, Scheduling, ILP ILP not fully recovered

April 26, 200618 Conclusion Stream programming is a natural fit for running image processing applications on distributed-memory systems Algorithmic Skeletons efficiently exploit data parallelism, by allowing the user to select the most restricted skeleton that supports his kernel Extensible (new skeletons) Retargetable (new skeleton implementations) PEPCI effectively combines the necessities of efficiently implementing algorithmic skeletons Term rewriting (by embedding Stratego) Partial evaluation (to automatically separate rule composition and code generation)

April 26, 200619 Future Work Better merging of kernels Merge more efficiently Merge different metaskeletons Implement on a more general architecture Implement more demanding applications And more involved skeletons

April 26, 200620 End

April 26, 200621 Partial evaluation (2) Free optimizations Loop unrolling If the conditions are known, and the body isn’t Function inlining Aggressive constant folding Including external “pure” functions

April 26, 200622 Kernel translation SIMD processors are not programmed in C, but in parallel derivatives Skeleton should translate kernel to target language  Extend PEPCI with C derivative syntax Though only minimally interpreted

April 26, 200623 Example: local neighborhood operation in XTC NeighbourhoodToPixelOp() sobelx(in stream unsigned char i[-1..1][-1..1], out stream int *o) { int x, y, temp; temp = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x=x+2) temp = temp + x*i[y][x]; *o = temp; } static lmem _in2; static lmem _in1; { lmem temp; temp = (0)+((-1)*(_in2[-1.. 0])); temp = (temp)+((1)*(_in2[1.. 2])); temp = (temp)+((-1)*(_in1[-1.. 0])); temp = (temp)+((1)*(_in1[1.. 2])); temp = (temp)+((-1)*(larg0[-1.. 0])); temp = (temp)+((1)*(larg0[1.. 2])); larg1 = temp; } _in2 = _in1; _in1 = larg0;

April 26, 200624 Stream program void main(int argc, char **argv) { STREAM a, b, c; int maxval, dummy, maxc; scInit(argc, argv); while (1) { capture(&a); interpolate(&a, &a); sobelx(&a, &b); sobely(&a, &c); magnitude(&b, &c, &a); direction(&b, &c, &b); mask(&b, &a, &a, scint(128)); hough(&a, &a); display(&a); imgMax(&a, scint(0), &maxval, scint(0), &dummy, scint(0), &maxc); _block(&maxc, &maxval); printf(“Ball found at %d with strength %d\n”, maxc, maxval); } return scExit(); }

April 26, 200625 Programming with algorithmic skeletons (1) PixelToPixelOp() binarize(in stream int *i, out stream int *o, in int *threshold) { *o = (*i > *threshold); } NeighbourhoodToPixelOp() average(in stream int i[-1..1][-1..1], out stream int *o) { int x, y; *o = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) *o += i[y][x]; *o /= 9; }

April 26, 200626 Programming with algorithmic skeletons (2) StackOp(in stream int *init) propagate(in stream int *i[-1..1][-1..1], out stream int *o) { int x, y; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) if (i[y][x] && !*o) { *o = 1; push(y, x); } AssocPixelReductionOp() max(in stream int *i, out int *res) { if (*i > *res) *res = *i; }

April 26, 200627 Algorithmic Skeletons <=t >t + = <=t >t += <=t >t +=

April 26, 200628 Term rewriting (1) From code to abstract syntax tree Stat AssignPlus IdArrayIndex “i” “acc” “ky” ArrayIndexId Stat(AssignPlus(Id("acc"),ArrayIndex(ArrayIndex(Id("i"),Id("ky")), Id("kx")))) acc+=i[ ] ;kykx “kx”

Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications.

Similar presentations

Presentation on theme: "Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications.

Similar presentations

Presentation on theme: "Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications."— Presentation transcript:

Similar presentations

About project

Feedback