4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M.

Slides:

Advertisements

Similar presentations

Introduction to Algorithms Quicksort

Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Practical techniques & Examples

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Lecture 3: Parallel Algorithm Design

1 Parallel Parentheses Matching Plus Some Applications.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 11.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.

Numerical Algorithms Matrix multiplication

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.

Reference: Message Passing Fundamentals.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Algorithms and Applications

1 Lecture 11 Sorting Parallel Computing Fall 2008.

and Divide-and-Conquer Strategies

COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

CSCI-455/552 Introduction to High Performance Computing Lecture 18.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 13.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 11.5.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CSCI-455/552 Introduction to High Performance Computing Lecture 6.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 9.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

CSCI-455/552 Introduction to High Performance Computing Lecture 21.

CSCI-455/552 Introduction to High Performance Computing Lecture 15.

1 Chapter4 Partitioning and Divide-and-Conquer Strategies 划分和分治的并行技术 Lecture 5.

Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.

Partitioning & Divide and Conquer Strategies Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

Divide-and-Conquer Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, June 25, 2012 slides4.ppt. 4.1.

Numerical Algorithms Chapter 11.

Partitioning and Divide-and-Conquer Strategies

Chapter 4 Divide-and-Conquer

Pipelined Computations

Parallel Sorting Algorithms

and Divide-and-Conquer Strategies

Parallel Sorting Algorithms

Introduction to High Performance Computing Lecture 16

Introduction to High Performance Computing Lecture 17

Presentation transcript:

4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Chapter 4 Partitioning and Divide-and-Conquer Strategies Two fundamental techniques in parallel programming -Partitioning -Divide-and-conquer Typical problems solved with these techniques

4.2 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Partitioning/Divide and Conquer Examples Many possibilities. Operations on sequences of number such as simply adding them together Several sorting algorithms can often be partitioned or constructed in a recursive fashion Numerical integration N-body problem

4.3 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Partitioning Dividing the problem into parts. Basis for all parallel programming, in one form or another. Domain decomposition - partitioning applied to the program data. Dividing the data and operating upon the divided data concurrently. Functional decomposition - partitioning applied to the functions of a program. Dividing the program into independent functions and executing the functions concurrently. Domain decomposition forms the foundation of most parallel algorithms

4.4 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Partitioning a sequence of numbers into parts and adding the parts

4.5 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Number Summation Revisited Our previous implementation used broadcast() and reduce() to send the data and combine the result The following pseudocode uses individual send()s and receive()s: 1.Master: s = n/p; for (i=0,x=0;I<p;I++,x=x+s) send(&numbers[x],s,P[I]); sum = 0; for (i=0;i<p;i++){ recv(&part_sum,P[ANY]); sum = sum + part_sum; } 2.Slave: recv(numbers,s,P[MASTER]); part_sum = 0; for (i=0;i<s;i++) part_sum = part_sum + numbers[i]; send(&part_sum, P[MASTER]);

4.6 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Divide-and-Conquer Dividing a problem into subproblems that are of the same form as the larger problem. Further divisions into still smaller sub-problems are usually done by recursion. int add(int *s){ if (number(s) == 2) return (s[0] + s[2]); else { divide(s, s1, s2); part_sum1 = add(s1); part_sum2 = add(s2); return (part_sum1 + part_sum2); }

4.7 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Tree construction When each division creates two parts, a recursive divide-and-conquer formulation forms a binary tree.

4.8 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Parallel Implementation In a sequential implementation, only one node of the tree can be visited at a time. A parallel solution offers the prospect of traversing several parts of the tree simultaneously. Could assign one processor to each node in the tree. Requires 2 m+1 -1 processors to divide the tasks into 2 m parts A more efficient solution is to reuse processors at each level of the tree At each stage, each processor keeps half of the list and passes on the other half

4.9 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Dividing a list into parts The division stops when the total number of processors is committed.

4.10 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Partial summation

4.11 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Adding a List of Numbers Suppose we statically create eight processes to add a list of numbers: Process P0: divide(s1, s1, s2); send(s2, P4); divide(s1, s1, s2); send(s2, P2); part_sum = *s1; recv(&part_sum1, P1); part_sum = part_sum + part_sum1; recv(&part_sum1, P2); part_sum = part_sum + part_sum1; recv(&part_sum1, P4); part_sum = part_sum + part_sum1;

4.12 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Adding a List of Numbers (cont’d) Process P4: recv(s1, P0); divide(s1, s1, s2); send(s2, P6); divide(s1, s1, s2); send(s2, P5); part_sum = *s1; recv(&part_sum1, P5); part_sum = part_sum + part_sum1; recv(&part_sum1, P6); part_sum = part_sum + part_sum1; send(&part_sum, P0);

4.13 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Analysis Assume that n is a power of 2. Communication log p steps with p processes (division phase) Time complexity is O(n) for constant p.

4.14 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Analysis (cont’d) Computation At the end of the divide phase, the n/p numbers are added together. Then one addition occurs at each stage during the combining phase, leading to Time complexity of O(n) for constant p. For large n and variable p, we get O(n/p). Total parallel execution time becomes:

4.15 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. M-ary Divide and Conquer) Divide and conquer can also be applied where a task is divided into more than two parts at each stage Sample code for 4-are divide and conquer: int add(int *s){ if(number(s)<=4) return (n1+n2+n3+n4); else{ Divide(s, s1, s2,s2,34); p_sum1 = add(s1); p_sum2 = add(s2); p_sum3 = add(s3); p_sum4 = add(s4); return (p_sum1 + p_sum2+ p_sum3 + p_sum4); }

4.16 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Quadtree

4.17 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Dividing an image

4.18 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Example 1: Sorting Using Bucket Sort Most sequential sorting algorithms are based on comparing and exchanging pairs of numbers Bucket sort is not based on compare and exchange, but is naturally a partitioning method Works well only if the numbers to be sorted are uniformly distributed across a known interval If all the numbers in the interval 0 to a-1 –Divide numbers into m regions 0 to a/m-1, a/m to 2a/m-1,… –One “bucket” is assigned to hold numbers that fall a region

4.19 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Bucket Sort

4.20 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Sequential Bucket Sort: Partitioning into Buckets Need to determine placement of each number into appropriate bucket Given a number r, compare it with the start of regions a/m, 2a/m, 3a/m, … –Can take m-1 comparisons to determine right bucket Better alternative: multiply r by m/a to determine its bucket –Requires one computational step, except that division can be expensive If m is a power of 2, bucket identification even faster –If m = 2 k, higher k bits of r determine r ’s bucket Assume that placing a number into a bucket requires one step –placing all the numbers requires n steps. If the numbers are uniformly distributed each bucket will contain n/m numbers

4.21 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Sequential Bucket Sort: Sorting the Buckets After placing the numbers, each bucket needs to be sorted n numbers can be sorted sequentially in O(n log n) time Best time for sequential compare and exchange sorting Thus, each bucket can be sorted in O(n/m log n/m) time, sequentially Assuming no concatenation overhead, the sequential time is

4.22 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Parallel Bucket Sort Can be parallelized by assigning one processor per bucket The computation time will now be (with m=p): This equation means that each processor examines each of the numbers – inefficient Can be improved by removing each examined number from the list at it is bucketed

4.23 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Parallel version of bucket sort Simple approach Assign one processor for each bucket.

4.24 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Further Parallelization Partition sequence into m regions, one region for each processor. Each processor maintains p “small” buckets and separates the numbers in its region into its own small buckets. Small buckets then emptied into p final buckets for sorting –Requires each processor to send one small bucket to each of the other processors (bucket i to processor i). The algorithms has four phases: –Partitioning numbers –Placement into p small buckets –Sending small buckets to large buckets –Sorting large buckets

4.25 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Another parallel version of bucket sort Introduces new message-passing operation - all-to-all broadcast.

4.26 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Analysis Phase 1 - Computation and Communication the p partitions, containing n/p numbers each, are sent to the processes. Using a broadcast (or scatter) routine tcomm1 = t startup + nt data Phase 2 – Computation To separate each partition of n/p numbers into p small buckets requires the time t comp2 = n/p

4.27 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Analysis Phase 3 – Communication Next, the small buckets are distributed (no computation). Each small bucket will have about n/p2 numbers. Each process must send the contents of p-1 small buckets to other processes Since each process of the p processes must make this communication, we have t comm3 = p(p-1)(t startup +(n/p2)t data The above assumes the communications do not overlap in time and individual send()s are used If the communication could overlap ( MPI_Alltoall() ), we have: t comm3 = (p-1)(t startup +(n/p2)t data

4.28 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Analysis Phase 4 – computation Sort the largest buckets simultaneously t comp4 = (n/p)log(n/p) Overall execution time:

4.29 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. “all-to-all” broadcast routine Sends data from each process to every other process

4.30 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. “all-to-all” routine actually transfers rows of an array to columns: Transposes a matrix.

4.31 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Example 2: Numerical integration In many areas of mathematics and science it is necessary to calculate the area bounded by y = f(x), the axis, and the lines x = a, and x = b. The area can be calculated using definite integrals Can be difficult or impossible sometimes Typically, approximations methods are used to calculate such area Trapezoidal rule Simpson rule Adaptive quadrature methods Area approximated by partitioning the interval [a, b] into n equal pieces and inscribing a rectangle above each sub- interval thus obtained

4.32 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Numerical integration using rectangles. Each region calculated using an approximation given by rectangles: Aligning the rectangles:

4.33 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Trapezoidal Rule:Static Assignment One process is assigned statically to compute each region, before the computation begins As before an SPMD programming model is used Each calculation has same form With p processors to compute the area between x=a and x=b, each process will handle a region of size (b-a)/p Each process will compute its area in the manner shown by the code overleaf

4.34 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Numerical integration using trapezoidal method May not be better!

4.35 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Trapezoidal Rule (cont’d) if(i==master){ printf(“Enter # of intervals “); scanf(“%d”,&n); } bcast(&n,p[group]); region=(b-1)/p; start = a + region*i; end = start+region; d =(b-a)/n; area=0.0; for(x=start; x<end; x=x+d) area = area + 0.5*(f(x)+f(x+d))*d; } reduce_add(&integral,&area,p[group])

4.36 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Trapezoidal Rule: Optimization An implementation of this formula: area = 0.5*(f(start)+f(end))*d; for(x=start+d; x<end; x=x+d) area = area + f(x); area = area * d;

4.37 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Adaptive Quadrature Solution adapts to shape of curve. Use three areas, A, B, and C. Computation terminated when largest of A and B sufficiently close to sum of remain two areas.