Characteristics of Embarrassingly Parallel Computations Easily parallelizable Little or no interaction between processes Can give maximum speedup.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Practical techniques & Examples

4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M.

Parallel Algorithms Lecture Notes. Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

COMPE472 Parallel Computing Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations.

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.

Reference: Message Passing Fundamentals.

CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.

EECC756 - Shaaban #1 lec # 8 Spring Message-Passing Computing Examples Problems with a very large degree of parallelism: –Image Transformations:

Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.

1 Lecture 11 Sorting Parallel Computing Fall 2008.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Parallel System Performance CS 524 – High-Performance Computing.

Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Parallel Edge Detection Daniel Dobkin Asaf Nitzan.

Computer Systems Nat 4.5 Computing Science Data Representation Lesson 4: Storing Graphics EXTENSION.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.

Scientific Computing Lecture 5 Dr. Guy Tel-Zur Autumn Colors, by Bobby Mikul, Mikul Autumn Colors, by Bobby Mikul,

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Basic Image Manipulation Raed S. Rasheed Agenda Region of Interest (ROI) Basic geometric manipulation. – Enlarge – shrink – Reflection Arithmetic.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Image Synthesis Rabie A. Ramadan, PhD D Images.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

Embarrassingly Parallel Computations processes …….. Input data results Each process requires different data and produces results from its input without.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 6.

Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.

Mandelbrot Set Fractal

1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Introduction to OOP CPS235: Introduction.

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Embarrassingly Parallel (or pleasantly parallel) Characteristics Domain divisible into a large number of independent parts. Little or no communication.

CSCI-455/552 Introduction to High Performance Computing Lecture 21.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Distributed and Parallel Processing George Wells.

Embarrassingly Parallel Computations

Parallel Programming By J. H. Wang May 2, 2017.

Embarrassingly Parallel

Introduction to High Performance Computing Lecture 7

Parallel Techniques • Embarrassingly Parallel Computations

Embarrassingly Parallel Computations

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Potential for parallel computers/parallel programming

Matrix Addition and Multiplication

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Characteristics of Embarrassingly Parallel Computations Easily parallelizable Little or no interaction between processes Can give maximum speedup if all available processors are kept busy The only constructs required are simply to distribute the data and to start the processes Since the data is not shared, message-passing multicomputers are appropriate for such computations

Representation of Images The most basic way to store a two-dimensional image is a pixmap, in which each pixel is stored as a binary number in a two-dimensional array. – black-and-white - 1 bit per pixel – greyscale - 8 bits per pixel – color - 24 bits (RGB) Geometrical transformations require mathematical operations performed on pixels coordinates –Transformations move a pixel’s position without affecting its value. –Transformations must be done at high speed to be acceptable Pixels transformations are independent –Truly embarrassingly parallel computations

Parallel Programming Concern The input data is the bitmap typically held in a file copied into an array Main parallel programming concern: division of bitmap into group of pixels for each process –Typically more pixels than processors Two general methods of grouping –By square/rectangular region –By columns/rows Example: A 640 x 480 image, 48 processes –Divide display area into x 80 square areas –Divide display area into 48 rows of 640 x 10 pixels This method of division appears in applications involving processing 2-D data

Partition into Rows: Master Process for (i = 0, row = 0; i < 48; i++, row = row + 10) send(row,P[i]); for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) temp_map[i][j] = 0; for (i = 0; i < (640 * 480); i++) { recv(oldrow,oldcol,newrow,newcol,P[ANY]); if (!((newrow = 480)|| (newcol = 640))) temp_map[newrow][newcol] = map[oldrow][oldcol]; } for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) map[i][j] = temp_map[i][j];

Slave Processes recv(row,P[MASTER]); for (oldrow = row; oldrow < (row + 10); oldrow++) for (oldcol = 0; oldcol < 640; oldcol++) { newrow = oldrow + delta_x; newcol = oldcol + delta_y; send(oldrow,oldcol,newrow,newcol,P[MASTER]); }

Program Analysis Suppose each pixel requires two computational steps and there are n x n pixels. – t s = 2n 2 - (O(n 2 )) Communication: – p processes – Before the computation, the starting row numbers must be sent to each process. – The individual processes have to send back the transformed coordinates of their group of pixels. – t comm = p(t startup +t data )+n 2 (t startup +4 tdata ) = O(p+n 2 ) Computation: – Groups of n 2 /p. – Each pixel requires 2 additions. – t comp = 2(n 2 /p) = O(n 2 /p) For fixed p, the time complexity is O(n 2 ).

Mandelbrot Set The Mandelbrot set is a widely used test in parallel computer systems –It is computationally intensive Displaying this set is another example of processing a bit-mapped image In contrast to the previous example, the image is computed in this case

Mandelbrot Set (cont’d) Computing the function z k+1 =z k 2 +c, is simplified by recognizing that –z 2 =a 2 +2abi+bi 2 = a 2 -b 2 +2abi Hence if z real is the real part of z and z imag is the imaginary part of z, the next iteration values can be produced by computing: –z real =z real 2 -z imag 2 +c real –z imag =2z real z imag +c imag The following C structure can be used to represent z: typedef struct { float real; float imag; } complex;

Mandelbrot Set (cont’d) The code for computing and displaying the points requires some scaling of the coordinate system of the display area –Actual viewing area will usually be a rectangular window of any size and sited anywhere of interest in the complex plane Let disp_heigt, disp_width and (x,y) be the display height, width and the coord of a point in the display area If this window is to display the complex plane with minimum values (real_min, imag_min) and maximum values (real_max,imag_max), each (x,y) point needs to be scaled by: c.real = real_min + x*(real_max-real_min)/disp_width; c.imag = imag_min + y*(imag_max-imag_min)/disp_height; For computational efficiency, let –Scale_real= (real_max-real_min)/disp_width –Scale_imag= *(imag_max-imag_min)/disp_height

Mandelbrot Set (cont’d) Including scaling, the code could be of the form: for(x=0; x<disp_width;x++) for(y=0; y<disp_height;y++){ c.real = real_min + ((float)x*scale_real); c.imag = imag_min + ((float)y*scale_imag); color = cal_pixel(c); display(x,y,color); }

Static Task Assignment Master for (i = 0,row=0; i < 48; i++,row=row+10) send(&row,P[i]); for (i = 0; i < (480*640); i++) { recv(&c,&color,P[ANY]); display(c,color); }

Static Task Assignment (cont’d) Slave (process i) recv(&row,P[MASTER]); for (x = 0; i < disp_width; x++) for(y=row; y<row+10; y++){ c.real = real_min + ((float)x*scale_real); c.imag = imag_min + ((float)y*scale_imag); color = cal_pixel(c); send(&c,&color,P[MASTER]); }

Dynamic Task Assignment Mandelbrot Set - significant iterative computation per pixel. The number of iterations will generally be different for each pixel. Computers may be of different types and speeds in a cluster Ideally, we want all processors to complete together, achieving a system efficiency of 100%. Assigning regions of different sizes to different processors also has problems –Need to know a processor’s speed a priori Work Pool approach (processor farm) –Individual processors are supplied with work when they become idle. Dynamic load balancing can be achieved using a work-pool approach

Dynamic Task Assignment Master count=0; /* counter for termination */ row=0; /* row being sent */ for (k = 0, k < num_proc; k++){/*assume num_proc<disp_height*/ send(&row,P[k],data_tag); count++; row++; } do{ recv(&slave, &r, color, P[ANY],result_tag); count--; /* reduce count as rows received */ if(row<display_height){ send(&row,P[SLAVE],data_tag); row++; count++; } else send(&row,P[SLAVE],terminator_tag); display(r,color); } while(count>0);

Dynamic Task Assignment (cont’d) Slave recv(&y,P[MASTER],source_tag); while(source_tag==data_tag){ c.imag = imag_min + ((float)y*scale_imag); for(x=0;x<disp_width;x++){ c.real = real_min + ((float)x*scale_real); color[x] = cal_pixel(c ); } send(c,color,P[MASTER],result_tag); recv(&y,P[MASTER],source_tage); /* recv next row */ }

Analysis Exact analysis of the Mandelbrot computation is complicated by not knowing how many iterations are needed for each pixel. The number of iterations for each pixel is some function of c but cannot exceed max. Therefore, the sequential time is Sequential time complexity of O(n). Let us just consider the static assignment. Three phases: Phase 1: Communication –First, the row number is sent to each slave one data item to each p-1 slaves t comm1 = (p-1)(t startup + t data )

Analysis (cont’d) Phase 2: Computation –The slaves perform the Mandelbrot computation in parallel Phase 3: Communication –Results are passed back to the master, one row of pixel colors at a time. –Suppose each slave handles u rows and there are v pixels on a row: –For static assignment, u and v will be constant (unless the solution of the image was changed), so we can assume t comm2 = k, a constant

Overall Execution Time Overall, the parallel time is given by where the total number of processors is p