Embarrassingly Parallel (or pleasantly parallel) Characteristics Domain divisible into a large number of independent parts. Little or no communication.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Practical techniques & Examples
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Ray Tracing CMSC 635. Basic idea How many intersections?  Pixels  ~10 3 to ~10 7  Rays per Pixel  1 to ~10  Primitives  ~10 to ~10 7  Every ray.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Numerical Algorithms Matrix multiplication
CSCE 313: Embedded Systems Scaling Multiprocessors Instructor: Jason D. Bakos.
COMPE472 Parallel Computing Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations.
Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.
Informationsteknologi Monday, November 12, 2007Computer Graphics - Class 71 Today’s class Viewing transformation Menus Mandelbrot set and pixel drawing.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.
Announcements Mailing list: –you should have received messages Project 1 out today (due in two weeks)
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Fractal Image Compression
Edge Detection Today’s reading Forsyth, chapters 8, 15.1
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Characteristics of Embarrassingly Parallel Computations Easily parallelizable Little or no interaction between processes Can give maximum speedup.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Parallel Processing – Final Project Performed by:Nitsan Mane Jonathan Distler PP9.
Edge Detection Today’s readings Cipolla and Gee –supplemental: Forsyth, chapter 9Forsyth Watt, From Sandlot ScienceSandlot Science.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Image processing Lecture 4.
1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Mathematics for Computer Graphics (Appendix A) Won-Ki Jeong.
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Tools for Raster Displays CVGLab Goals of the Chapter To describe pixmaps and useful operations on them. To develop tools for copying, scaling, and rotating.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Computational Biology, Part E Basic Principles of Computer Graphics Robert F. Murphy Copyright  1996, 1999, 2000, All rights reserved.
Objectives Differentiate between raster scan display and random scan display.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Monte Carlo Methods.
Image Processing Edge detection Filtering: Noise suppresion.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
CGMB214: Introduction to Computer Graphics
CSC 211 Data Structures Lecture 13
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Many slides from Steve Seitz and Larry Zitnick
CSC508 Convolution Operators. CSC508 Convolution Arguably the most fundamental operation of computer vision It’s a neighborhood operator –Similar to the.
Duy & Piotr. How to reconstruct a high quality image with the least amount of samples per pixel the least amount of resources And preserving the image.
Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Computer Science 320 Parallel Image Generation. The Mandelbrot Set.
Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.
Lecture 11 Text mode video
Computational Vision CSCI 363, Fall 2012 Lecture 17 Stereopsis II
Filters– Chapter 6. Filter Difference between a Filter and a Point Operation is that a Filter utilizes a neighborhood of pixels from the input image to.
Computer Graphics CC416 Lecture 04: Bresenham Line Algorithm & Mid-point circle algorithm Dr. Manal Helal – Fall 2014.
Hough Transform CS 691 E Spring Outline Hough transform Homography Reading: FP Chapter 15.1 (text) Some slides from Lazebnik.
Embarrassingly Parallel Computations
Embarrassingly Parallel (or pleasantly parallel)
Embarrassingly Parallel
Image Processing, Lecture #8
Image Processing, Lecture #8
Introduction to High Performance Computing Lecture 7
Parallel Techniques • Embarrassingly Parallel Computations
Embarrassingly Parallel Computations
Magnetic Resonance Imaging
Edge Detection Today’s readings Cipolla and Gee Watt,
Presentation transcript:

Embarrassingly Parallel (or pleasantly parallel) Characteristics Domain divisible into a large number of independent parts. Little or no communication between processors Processor performs the same calculation independently “Nearly embarrassingly parallel” – Communication is limited to distributing and gathering data – Computation dominates the communication Definition: Problems that scale well to thousands of processors

Embarrassingly Parallel Examples P0P1P2 Embarrassingly Parallel Application P1P2P3 P0 Send Data Receive Data Nearly Embarrassingly Parallel Application

Low Level Image Processing Storage – A two dimensional array of pixels. – One bit, one byte, or three bytes may represent pixels – Operations may only involve local data Image Applications – Shift: newX=x+delta; newY=y+delta – Scale: newX = x*scale; newY = y*scale – Rotate a point about the origin newX = x cos  +y sin  newY=-xsin  +ycos  – Clip newX = x if min x <=x< max x ; 0 otherwise newY = y if min y <=y<=max y ; 0 otherwise Note: Does not include communication to a graphics adapter

Non-trivial Image Processing Smoothing – A function that captures important patterns, while eliminating noise or artifacts – Linear smoothing: Apply a linear transformation to a picture – Convolution: P new (x,y) = ∑ j=0,m-1 ∑ k=0,n-1 P(x,y,j,k) old f(j,k) Edge Detection – A function that searches for discontinuities or variations in depth, surface, or color – Purpose: Significantly reduce follow-up processing – Uses: Pattern recognition and computer vision – One approach: differentiate to identify large changes Pattern Matching – Match an image against a template or a group of features – Example: ∑i=0,X∑j=1,Y (Picture(x+I,y+i) – Template(x,y)) Note: This is another digital signal processing application

Array Storage The C language stores arrays in row-major order, Matlab and Fortran use column-major order Loops can be extremely slow in C if the outer loop processes columns due to the system memory cache operation int A[2][3] = { {1, 2, 3}, {4, 5, 6} }; In memory: int A[2][3][2] = {{{1,2}, {3,4}, {5,6}}, {{7,8}, {9,10}, {11,12}}}; In memory : Translate multi-dimension indices to single dimension offsets – Two Dimensions: offset = row*COLS + column – Three Dimensions: offset = i*DIM2*DIM3 + j*DIM3+ k – What is the formula for four dimensions? Cow –major (left most dimensions) are stored one after another Column-major (right most dimensions) are stored one after another

Process Partitioning Note: 128 rows per displayed cell Note: 128 columns per displayed cell Pixel 21 Rows 0: 0-7 2: :16-23 Partitioning might assign groups of rows or columns to processors Pixel 2053 Row 2, column 5 Rows 0: :

Typical Static Partitioning Master – Scatter or broadcast the image along with assigned processor rows – Gather the updated data back and perform final updates if necessary Slave – Receive Data – Compute translated coordinates – Perform collective gather operation Questions – How does the master decide how much to assign to each processor? – Is the load balanced (all processors working equally)? Notes on the Text shift example – Employs individual sends/receives, which is much slower – However, if coordinate positions change or results do not represent contiguous pixel positions, this might be required

Mandelbrot Set Z 0 = * i For each (x,y) from [-2,+2] Iterate z n until either The iteration stops when the iteration count reaches a limit (in the set) Z n is out of bounds ( |z n |>2 (not in the set) Save the iteration count which will map to a display color Definition: Those points C = (x,y) = x + iy in the complex plane that iterate with a function (normally: z n+1 = z n 2 + C) converge to a finite value Implementation Complex plane Display horizontal axis: real values vertical axis: imaginary values

Scaling and Zooming Display range of points – From c min = x min + iy min to c max = x max + iy max Display range of pixels – From the pixel at (0,0) to the pixel at (ROWS, COLUMNS) Pseudo code For row = row min to row end For col = 0 to COLUMNS c y = y min +(y max -y min )* row/ROWS c x = x min +(x max -x min )* col/COLUMNS color = mandelbrot(c x, c y ) picture[COLUMNS*row+col] = color

Pseudo code ( mandelbrot(c x,c y ) ) SET z = z real + i*z imaginary = 0 + i0 SET iterations = 0 DO SET z = z 2 + C // temp = z real ; z real =z real 2 –z imaginary 2 + c x // z imaginary = 2 * temp * z imaginary + c y SET value = z real 2 + z imaginary 2 iterations++ WHILE value<=4 and iterations<max RETURN iterations Notes: 1.The final iteration count determines each point’s color 2.Some points converge quickly; others slowly, and others not at all 3.Non-converging points are in the Mandelbrot Set (black on the previous slide) 4.Note 4 ½ = 2, so we don't need to compute the square root when setting value

Parallel Implementation Load-balancing – Algorithms used to avoid processors from becoming idle – Note: A balanced load does NOT require even same work loads Static Approach – The load is assigned once at the start of the run – Mandelbrot: assign each processor a group of rows – Deficiencies: Not load balanced Dynamic Approach – The load is dynamically assigned during the run – Mandelbrot: Slaves ask for work when they complete a section Both the Static and Dynamic algorithms are examples of load balancing

The Dynamic Approach 1.The Master's work is increased somewhat a.Must send rows when receive requests from slaves b.Must be responsive to slave requests. A separate thread might help or the master can make use of MPI's asynchronous receive calls. 2.Termination a.Slaves terminate when receiving "no work" indication in received messages b.The master must not terminate until all of the slaves complete 3.Partitioning of the load a.Master receives blocks of pixels, Slaves receive ranges of (x,y) ranges b.Partitions can be in columns or in rows. Which is better? 4.Refinement: Ask for work before completion (double buffering)

Monte Carlo Methods Pseudo-code (Throw darts to converge at a solution) 1.Compute a definite integral While more iterations needed pick a random point total += f(x) result = 1/iterations * total 2.Calculation of PI While more iterations needed Randomly pick a point If point is in circle within++ Compute PI = 4 * within / iterations Using the upper right quadrant eliminates the 4 in the equation Note: Parallel programs shouldn't use the standard random number generator 1/N ∑ 1 N f(pick.x) (x max – x min )

Computation of PI ∫(1-x 2 ) 1/2 dx = π/4; 0<=x<=1∫(1-x 2 ) 1/2 dx = π; -1<=x<=1 Total points/Points within = Total Area/Area in shape Questions: How to handle the boundary condition? What is the best accuracy that we can achieve? Within if (point.x 2 + point.y 2 ) ≤ 1

Parallel Random Number Generator Numbers of a pseudo random sequence are – Uniformly, large period, repeatable, statistically independent – Each processor must generate a unique sequence – Accuracy depends upon random sequence precision Sequential linear generator (a and m are prime; c=0) 1.X i+1 = (ax i +c) mod m (ex: a=16807, m=2 31 – 1, c=0) 2.Many other generators are possible Parallel linear generator with unique sequences 1.X i+k = (Ax i + C) mod m where k is the "jump" constant 2.A=a k, C=c (a k-1 + a k-2 + … + a 1 + a 0 ) 3.if k = P, we can compute A and C and the first k random numbers to get started x1x1 x2x2 x P-1 xPxP x P+1 x 2P-2 x 2P-1 Parallel Random Sequence