Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.

Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor performs the same calculation independently “Nearly embarrassingly parallel” –Small Computation/Communication ratio –Communication limited to the distribution and gathering of data –Computation is time consuming and hides the communication

Embarrassingly Parallel Examples P0P1P2 P1P2P3 P0 Send Data Receive Data Embarrassingly Parallel Application Nearly Embarrassingly Parallel Application

Low Level Image Processing Storage –A two dimensional array of pixels. –One bit, one byte, or three bytes may represent pixels –Operations may only involve local data Image Applications –Shifting (newX=x+delta; newY=y+delta) –Scaling (newX = x*scale; newY = y*scale) –Rotate(newX=x cos  +y sin  newY=-xsin  +ysin  –Clip newX = x if min x <=x< max x ; 0 otherwise newY = y if min y <=y<=max y ; 0 otherwise Other Applications –Smoothing, Edge Detection, Pattern Matching

Process Partitioning 1024 768 128 P21 Partitioning might assign groups of rows or columns to processors

Image Shifting Application (See code on Page 84) Master –Send starting row number to slaves –Initialize a new array to hold shifted image –FOR each message received oUpdate new bit map coordinates Slave –Receive starting row –Compute translated coordinates and transmit them back to the master Questions –Where is the initial image? –What happens if a remote processor fails? –How does the master decide how much to assign to each processor? –Is the load balanced (all processors working equally)? –Is the initial transmission of the row numbers needed?

Analysis Computation –Host: 3 * rows * cols, Slave: 2 * rows * cols / (P-1) Communication (t comm = t startup + m*t data ) 1.Host:(t startup + t data ) * (P-1) + rows * columns * (t startup + 4 * t data ) 2.Slaves: (t startup + t data ) + rows * columns/(P-1)*(t startup + 4 * t data ) Total –T s = 4 * rows * cols –T p = 3 * rows*cols + (t startup + t data ) * (P-1) + rows*cols * (t startup +4 * t data ) = 3 * rows*cols + 2*(P-1) + 5*rows*cols = 8*rows*cols+2*(P-1) –S(p) < ½ Computation ratio = t comp /t comm = (3*rows/cols)/(5*rows*cols+2*(p-1)) ≈ 3/5 Questions –Can the transmission of the rows be done in parallel? –How is it possible to reduce the communication cost? –Is this an Amdahl or a Gustafson application? Program on Page 84

Mandelbrot Set Complex numbers –a+bi where i = (-1) 1/2 Complex plane –horizontal axis: real values –Vertical axis: imaginary values. The Mandelbrot Set is a set of complex plane points that are iterated using a prescribed function over a bounded area: The iteration stops when the function value reaches a limit The iteration stops when the iteration count reaches a limit Each point gets a color according to the final iteration count

Pseudo code FOR each point c = c x +ic y in a bounded area SET z = z real + i*z imaginary = 0 + i0 SET iterations = 0 DO SET z = f(z, c) SET value = (z real 2 + z imaginary 2 ) 1/2 iterations++ WHILE value<limit and iterations<max point = c x and c y scaled to the display picture[point] = color[iterations] Notes: 1.Set each point’s color based on its final iteration count 2.Some points converge quickly; others slowly, and others not at all 3.The non converging points (exceeding the maximum iterations) are said to lie in the Mandelbrot Set (black on the previous slide) 4.A common Mandelbrot function is z = z 2 + c

Scaling and Zooming Display range of points –From c min = x min + iy min to c max = x max + iy max Display range of pixels –From the pixel at (0,0) to the pixel at (width, height) Pseudo code For pixel x = 0 to width For pixel y = 0 to height c x = x min + pixel x * (x max – x min )/width c y = y min + pixel y * (x max – x min )/height color = mandelbrot(c x, c y ) picture[pixel x ][pixel y ] = color

Parallel Implementation Load-balancing –Algorithms used to avoid processors from becoming idle –Note: Does NOT mean that every processor has the same work load Static Approach –The load is partitioned once at the start of the run –Mandelbrot: assign each processor a group of rows –Deficiencies of book approach Separate messages per coordinate No accounting for processes that fail Dynamic Approach –The load is partitioned during the run –Mandelbrot: Slaves ask for work when they complete a section –Improvements from book approach Ask for work before completion (double buffering) –Question: How does the program terminate? Static and Dynamic load balancing approaches shown in chapter 3

Analysis of Static Approach Assumptions (Different from the text) –Slaves send a row at a time –Assume display time is equal to computation time –t startup and t data = 1 Master 1.Computation: height*width 2.Communication: height*(t startup + width*t data ) ≈ height*width Slaves 1.Computation: avgIterations * height/(P-1) * width 2.Communication: height/(P-1)*(t startup +width*t data ) ≈height*width/P-1 Speed-up –S(p) ≈ 2 * height * width * avgIterations – / (avgIterations*height*width/(P-1)+height*width/(P-1)) ≈ P-1 Computation/communication ratio –2 * height * width * avgIterations / (height * (t startup + width*t data )) ≈ avgIterations

Monte Carlo Methods Pseudo-code (Throw darts to converge at a solution) –Compute definite integral While more iterations needed pick a point Evaluate a function Add to the answer Compute average –Calculation of PI While more iterations needed Randomly pick a point If point is in circle within++ Compute PI = 4 * within / iterations Parallel Implementation –Need a parallel pseudo random generator (See notes below) –Minimal communication requirements Note: We can also use the upper right quadrant Section 3.2.3 of Text 1/N ∑ 1 N f(pick.x) (x max – x min )

Computation of PI ∫(1-x 2 ) 1/2 dx = π/4; 0<=x<=1∫(1-x 2 ) 1/2 dx = π; -1<=x<=1 Total points/Points within = Total Area/Area in shape Questions: How to handle the boundary condition? What is the best accuracy that we can achieve? Within if (point.x 2 + point.y 2 ) ≤ 1

Parallel Random Number Generator Numbers of a pseudo random sequence are: –Uniformly, large period, repeatable, statistically independent –Each processor must generate a unique sequence –Accuracy depends upon random sequence precision Sequential linear generator (a and m are prime; c=0) 1.X i+1 = (ax i +c) mod m (ex: a=16807, m=2 31 – 1, c=0) 2.Many other generators are possible Parallel linear generator with unique sequences 1.X i+k = (Ax i + C) mod m 2.A=a P, C=c (a P + a P-1 + … + a 1 + a 0 ) x1x1 x2x2 x P-1 xPxP x P+1 x 2P-2 x 2P-1 Parallel Random Sequence

Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.

Similar presentations

Presentation on theme: "Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.

Similar presentations

Presentation on theme: "Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor."— Presentation transcript:

Similar presentations

About project

Feedback