Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Practical techniques & Examples
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Data Structures Using C++ 2E
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Numerical Algorithms Matrix multiplication
CSCE 313: Embedded Systems Scaling Multiprocessors Instructor: Jason D. Bakos.
COMPE472 Parallel Computing Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations.
Informationsteknologi Monday, November 12, 2007Computer Graphics - Class 71 Today’s class Viewing transformation Menus Mandelbrot set and pixel drawing.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Lesson2 Point-to-point semantics Embarrassingly Parallel Examples.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
Characteristics of Embarrassingly Parallel Computations Easily parallelizable Little or no interaction between processes Can give maximum speedup.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Parallel Processing – Final Project Performed by:Nitsan Mane Jonathan Distler PP9.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Review Algorithm Analysis Problem Solving Space Complexity
1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CPSC 171 Introduction to Computer Science 3 Levels of Understanding Algorithms More Algorithm Discovery and Design.
Chapter Relations & Functions 1.2 Composition of Functions
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Computer Science 320 Load Balancing for Hybrid SMP/Clusters.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Ch 9 Infinity page 1CSC 367 Fractals (9.2) Self similar curves appear identical at every level of detail often created by recursively drawing lines.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
Monte Carlo Methods.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Chapter 10, Part II Edge Linking and Boundary Detection The methods discussed in the previous section yield pixels lying only on edges. This section.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
Computer Science 320 Load Balancing with Clusters.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Embarrassingly Parallel (or pleasantly parallel) Characteristics Domain divisible into a large number of independent parts. Little or no communication.
Essential components of the implementation are:  Formation of the network and weight initialization routine  Pixel analysis of images for symbol detection.
Computer Science 320 Reduction. Estimating π Throw N darts, and let C be the number of darts that land within the circle quadrant of a unit circle Then,
Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.
Programming for Performance Laxmikant Kale CS 433.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Computer Graphics CC416 Lecture 04: Bresenham Line Algorithm & Mid-point circle algorithm Dr. Manal Helal – Fall 2014.
Embarrassingly Parallel Computations
Embarrassingly Parallel (or pleasantly parallel)
Frequency Distributions and Their Graphs
Embarrassingly Parallel
Algorithm Discovery and Design
Introduction to High Performance Computing Lecture 7
Parallel Techniques • Embarrassingly Parallel Computations
Embarrassingly Parallel Computations
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor performs the same calculation independently “Nearly embarrassingly parallel” –Small Computation/Communication ratio –Communication limited to the distribution and gathering of data –Computation is time consuming and hides the communication

Embarrassingly Parallel Examples P0P1P2 P1P2P3 P0 Send Data Receive Data Embarrassingly Parallel Application Nearly Embarrassingly Parallel Application

Low Level Image Processing Storage –A two dimensional array of pixels. –One bit, one byte, or three bytes may represent pixels –Operations may only involve local data Image Applications –Shifting (newX=x+delta; newY=y+delta) –Scaling (newX = x*scale; newY = y*scale) –Rotate(newX=x cos  +y sin  newY=-xsin  +ysin  –Clip newX = x if min x <=x< max x ; 0 otherwise newY = y if min y <=y<=max y ; 0 otherwise Other Applications –Smoothing, Edge Detection, Pattern Matching

Process Partitioning P21 Partitioning might assign groups of rows or columns to processors

Image Shifting Application (See code on Page 84) Master –Send starting row number to slaves –Initialize a new array to hold shifted image –FOR each message received oUpdate new bit map coordinates Slave –Receive starting row –Compute translated coordinates and transmit them back to the master Questions –Where is the initial image? –What happens if a remote processor fails? –How does the master decide how much to assign to each processor? –Is the load balanced (all processors working equally)? –Is the initial transmission of the row numbers needed?

Analysis Computation –Host: 3 * rows * cols, Slave: 2 * rows * cols / (P-1) Communication (t comm = t startup + m*t data ) 1.Host:(t startup + t data ) * (P-1) + rows * columns * (t startup + 4 * t data ) 2.Slaves: (t startup + t data ) + rows * columns/(P-1)*(t startup + 4 * t data ) Total –T s = 4 * rows * cols –T p = 3 * rows*cols + (t startup + t data ) * (P-1) + rows*cols * (t startup +4 * t data ) = 3 * rows*cols + 2*(P-1) + 5*rows*cols = 8*rows*cols+2*(P-1) –S(p) < ½ Computation ratio = t comp /t comm = (3*rows/cols)/(5*rows*cols+2*(p-1)) ≈ 3/5 Questions –Can the transmission of the rows be done in parallel? –How is it possible to reduce the communication cost? –Is this an Amdahl or a Gustafson application? Program on Page 84

Mandelbrot Set Complex numbers –a+bi where i = (-1) 1/2 Complex plane –horizontal axis: real values –Vertical axis: imaginary values. The Mandelbrot Set is a set of complex plane points that are iterated using a prescribed function over a bounded area: The iteration stops when the function value reaches a limit The iteration stops when the iteration count reaches a limit Each point gets a color according to the final iteration count

Pseudo code FOR each point c = c x +ic y in a bounded area SET z = z real + i*z imaginary = 0 + i0 SET iterations = 0 DO SET z = f(z, c) SET value = (z real 2 + z imaginary 2 ) 1/2 iterations++ WHILE value<limit and iterations<max point = c x and c y scaled to the display picture[point] = color[iterations] Notes: 1.Set each point’s color based on its final iteration count 2.Some points converge quickly; others slowly, and others not at all 3.The non converging points (exceeding the maximum iterations) are said to lie in the Mandelbrot Set (black on the previous slide) 4.A common Mandelbrot function is z = z 2 + c

Scaling and Zooming Display range of points –From c min = x min + iy min to c max = x max + iy max Display range of pixels –From the pixel at (0,0) to the pixel at (width, height) Pseudo code For pixel x = 0 to width For pixel y = 0 to height c x = x min + pixel x * (x max – x min )/width c y = y min + pixel y * (x max – x min )/height color = mandelbrot(c x, c y ) picture[pixel x ][pixel y ] = color

Parallel Implementation Load-balancing –Algorithms used to avoid processors from becoming idle –Note: Does NOT mean that every processor has the same work load Static Approach –The load is partitioned once at the start of the run –Mandelbrot: assign each processor a group of rows –Deficiencies of book approach Separate messages per coordinate No accounting for processes that fail Dynamic Approach –The load is partitioned during the run –Mandelbrot: Slaves ask for work when they complete a section –Improvements from book approach Ask for work before completion (double buffering) –Question: How does the program terminate? Static and Dynamic load balancing approaches shown in chapter 3

Analysis of Static Approach Assumptions (Different from the text) –Slaves send a row at a time –Assume display time is equal to computation time –t startup and t data = 1 Master 1.Computation: height*width 2.Communication: height*(t startup + width*t data ) ≈ height*width Slaves 1.Computation: avgIterations * height/(P-1) * width 2.Communication: height/(P-1)*(t startup +width*t data ) ≈height*width/P-1 Speed-up –S(p) ≈ 2 * height * width * avgIterations – / (avgIterations*height*width/(P-1)+height*width/(P-1)) ≈ P-1 Computation/communication ratio –2 * height * width * avgIterations / (height * (t startup + width*t data )) ≈ avgIterations

Monte Carlo Methods Pseudo-code (Throw darts to converge at a solution) –Compute definite integral While more iterations needed pick a point Evaluate a function Add to the answer Compute average –Calculation of PI While more iterations needed Randomly pick a point If point is in circle within++ Compute PI = 4 * within / iterations Parallel Implementation –Need a parallel pseudo random generator (See notes below) –Minimal communication requirements Note: We can also use the upper right quadrant Section of Text 1/N ∑ 1 N f(pick.x) (x max – x min )

Computation of PI ∫(1-x 2 ) 1/2 dx = π/4; 0<=x<=1∫(1-x 2 ) 1/2 dx = π; -1<=x<=1 Total points/Points within = Total Area/Area in shape Questions: How to handle the boundary condition? What is the best accuracy that we can achieve? Within if (point.x 2 + point.y 2 ) ≤ 1

Parallel Random Number Generator Numbers of a pseudo random sequence are: –Uniformly, large period, repeatable, statistically independent –Each processor must generate a unique sequence –Accuracy depends upon random sequence precision Sequential linear generator (a and m are prime; c=0) 1.X i+1 = (ax i +c) mod m (ex: a=16807, m=2 31 – 1, c=0) 2.Many other generators are possible Parallel linear generator with unique sequences 1.X i+k = (Ax i + C) mod m 2.A=a P, C=c (a P + a P-1 + … + a 1 + a 0 ) x1x1 x2x2 x P-1 xPxP x P+1 x 2P-2 x 2P-1 Parallel Random Sequence