Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Similar presentations


Presentation on theme: "Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder."— Presentation transcript:

1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder Chapter 5: Scalable Algorithmic Techniques

2 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Scalable program construction Can be improved by larger problem size Focus on data parallel 5-2

3 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Ideal parallel computation Large blocks of independent computation BOINC projects at Berkeley SETHI project These kinds of projects are atypical 5-3

4 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Important principle Parallel programs are more scalable when they emphasize blocks of computation– typically the larger the block the better – that minimize the inter-thread dependencies. 5-4

5 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Schwarz’s algorithm Tree should be used to connect processes rather than items Given P < n –Encode as in 1.3 –Each process add n/P items locally then combine the P intermediate sums with a P-leaf tree that connects the processes. All processes are working directly on the problem 5-5

6 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-6 Figure 5.1 Schwartz - Process-induced tree. Each process computes locally on a sequence of values (heavy lines), and then combines the results pair-wise, inducing a tree; notice that process 0 participates at each level in the tree.

7 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-7 Figure 5.2 Schwartz algorithm inducing the tree of Figure 5.1. Line 8 loads the locally computed value into the tree; line 14 performs the summation when both operands are available. Threads exit when they have nothing left to do.

8 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Advocate use of reduce and scan Even though not in programming languages Code as functions High level Conveys information about program logic 5-8

9 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Reduce/Scan are common/important Reduce –Combine a set of values to compare or combine results Scan –Parallel prefix –Performs a sequential operation in parts –Carries intermediate results 5-9

10 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Kinds of Scans Given A = {2, 4, 6} Inclusive – +\A = {2, 6, 12} –Used by Peril-L Exclusive – +\A = {0, 2, 6} –First item is the identity item for the set 5-10

11 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples of Reduces 2 nd smallest array element –Use smallest and 2ndsmallest –If array value is smaller update each accordingly Histogram – compute with k intervals –Use min and max reduce to find smallest/largest –Initialize k element array, hist, to 0’s –Iterate through data counting interval it belongs 5-11

12 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples (cont.) Length of longest run of consecutive 1’s –current = 0, longest = 0 –current is current run of 1’s –Answer is max(current, longest) Index of first x –Create 2 element temp array –temp[0] = x, temp[1] = +-infinity –Iterate looking for x, keep smaller of saved index and found index 5-12

13 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Basic structure of reduce and scan Local variable tally stores intermediate results Functions –init() initializes tally –accum() performs local accumulation –combine() composes intermediate tally results and passes them to parent –x-gen() takes global result to generate final answer Will vary for scan and reduce 5-13

14 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Example +/ A (reduce) init() – tally = 0 accum(tally, val) – tally = tally + A[i] combine(left, right) – adds left and right tally values and passes tally to the parent reduce-gen(root) has nothing to do, returns its argument as the global result logic shown in next slide 5-14

15 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-15 Figure 5.3

16 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-16 Figure 5.4 Peril-L code for the generalized reduce logic. Notice the sites for the four component functions. The tree combining relies on the use of full/empty memory, which drives the tree accumulation. As threads complete their roles in the combining tree, they terminate.

17 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-17 Figure 5.5 The four generalized reduce functions implementing secondMin reduce. The tally is a two-element struct.

18 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Generalized Scan Like reduce except after combining the intermediate results are passed down the combining tree. The value that each process receives from its parent is the tally for the values that are left of the parent’s leftmost leaf. 5-18

19 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples of Scan Team Standings Keep the longest sequence of 1s Index of Last Occurrence 5-19

20 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-20 Figure 5.6

21 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-21 Figure 5.7 Generalized scan program. The down sweep of the tally values, beginning on line 35, distributes intermediate results to all threads to compute the final result (line 44).

22 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-22 Figure 5.7 Generalized scan program. The down sweep of the tally values, beginning on line 35, distributes intermediate results to all threads to compute the final result (line 44). (cont.)

23 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples of Scan Given array A of int 1, …, k lastOccurrence \ A Returns position i the index of the most recent occurrence of A[i] accum stores I in tally [j], last occurrence Combine takes the max of each element Scan generator reprocess the block of data using ptally as its initial value 5-23

24 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-24 Figure 5.8 Customized scan functions to return the index of the last occurrence of the element in the ith operand position; the tally is a globally allocated array of k elements.

25 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Assigning work to processes statically Block allocations –Exploit locality –Better than complete rows –Yields less communication 4x4 => 16 edge elements 16 element row => 2*16 = 32 edge elements 5-25

26 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-26 Figure 5.9

27 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Overlap Regions Stencil computation – reference neighbor elements Allocate extra space for neighbors 5-27

28 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-28 Figure 5.10

29 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Cyclic and block cyclic allocations May result in poor load balance when work is not proportional to the amount of data. Processes that own black and white portions have less work to do After 25% is done, 7 processes have nothing to do The last 25% is done just by P 15 5-29

30 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-30 Figure 5.11

31 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Solution – use cyclic distribution Allocate elements to processes in a round-robin fashion Cyclic allocation balance hot spots Small block size will incur overhead with communication with neighbors Small blocks do not use locality Size of blocks must be carefully determined 5-31

32 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-32 Figure 5.12 Illustration of a cyclic distribution of an 8 × 8 array onto five processes.

33 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-33 Figure 5.13 Block-cyclic allocation of 3 × 2 blocks to a 14 × 14 array distributed to four processes (colors).

34 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-34 Figure 5.14 The block-cyclic allocation of Figure 5.13 midway through the computation; the blocks to the right summarize the active values for each process.

35 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Julia sets need load balancing z n+1 =z 2 n + c c is complex coefficient to determine the shape 5-35

36 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-36 Figure 5.15 Julia set generated from the site http://aleph0.clarku.edu/~djoyce.

37 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-37 Figure 5.16 Example of an unstructured grid representing the pressure distribution on two airfoils. Image from http://fun3d.larc.nasa.gov/example-24.html.

38 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Assigning work dynamically Work queue –Data structure for dynamically assigning work to threads or processes –Tasks added at one end and removed from other –Example is Collatz Conjecture (in text) –Example of producer/consumer 5-38

39 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-39 Figure 5.17 Code for computing the expansion factor for the Collatz Conjecture.

40 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-40 Figure 5.17 Code for computing the expansion factor for the Collatz Conjecture (cont.).

41 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Hoard memory allocator Solves problem of memory allocated in 1 process and freed in another Principles –Limit local memory usage –Manage memory in large blocks p 139 5-41

42 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Trees Challenges –Local pointers –They are dynamic which may cause communication issues –Irregular structure challenges reasoning about communication and load balancing 5-42

43 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-43 Figure 5.18 Cap allocation for a binary tree on P = 8 processes. Each process is allocated one of the leaf subtrees, along with a copy of the cap (shaded).

44 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-44 Figure 5.19 Logical tree representations: (a) a binary tree where P = 8; (b) a binary tree where P = 6.

45 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-45 Figure 5.20 Enumerating the Tic-Tac-Toe game tree; a process is assigned to search the games beginning with each of the four initial move sequences.


Download ppt "Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder."

Similar presentations


Ads by Google