Download presentation
Presentation is loading. Please wait.
1
Introduction to parallel algorithms
COT 5405 – Fall 2012 Introduction to parallel algorithms Ashok Srinivasan Florida State University
2
Outline Background Primitives Algorithms Important points
3
Background Terminology Communication cost model Time complexity
Speedup Efficiency Scalability Communication cost model
4
Time complexity Parallel computation
A group of processors work together to solve a problem Time required for the computation is the period from when the first processor starts working until when the last processor stops Sequential Parallel - bad Parallel - ideal Parallel - realistic
5
Other terminology Speedup: S = T1/TP Efficiency: E = S/P
Work: W = P TP Scalability How does TP decrease as we increase P to solve the same problem? How should the problem size increase with P, to keep E constant? Notation P = Number of processors T1 = Time on one processor TP = Time on P processors
6
Communication cost model
Processes spend some time doing useful work, and some time communicating Model communication cost as TC = ts + L tb L = message size Independent of location of processes Any process can communicate with any other process A process can simultaneously send and receive one message
7
I/O model We will ignore I/O issues, for the most part
We will assume that input and output are distributed across the processors in a manner of our choosing Example: Sorting Input: x1, x2, ..., xn Initially, xi is on processor i Output xp1, xp2, ..., xpn xpi on processor i xpi < xpi+1
8
Primitives Reduction Broadcast Gather/Scatter All gather Prefix
9
Reduction -- 1 Tn = n-1 + (n-1)(ts+tb) Sn = 1/(1 + ts + tb)
x1 Compute x1 + x xn x2 xn x3 x4 Tn = n-1 + (n-1)(ts+tb) Sn = 1/(1 + ts + tb)
10
Reduction -- 2 Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1
x1 Reduction-1 for {x1, ... xn/2} xn/2+1 Reduction-1 for {xn/2+1, ... xn} Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1 = n/2 + n/2 (ts+ tb) Sn ~ 2/(1 + ts+ tb)
11
Reduction -- 3 Apply reduction-2 recursively
xn/2+1 x1 Reduction-1 for {x1, ... xn/2} Reduction-1 for {xn/2+1, ... xn} xn/4+1 xn/2+1 x3n/4+1 x1 Reduction-1 for {x1, ... xn/4} Reduction-1 for {xn/4+1, ... xn/2} Reduction-1 for {xn/2+1, ... x3n/4} Reduction-1 for {x3n/4+1, ... xn} Apply reduction-2 recursively * Divide and conquer Tn ~ log2n + (ts+ tb) log2n Sn ~ (n/ log2n) x 1/(1 + ts+ tb) Note that any associative operator can be used in place of +
12
Parallel addition features
If n >> P * Each processor adds n/P distinct numbers Perform parallel reduction on P numbers TP ~ n/P + (1 + ts+ tb) log P Optimal P obtained by differentiating wrt P Popt ~ n/(1 + ts+ tb) If communication cost is high, then fewer processors ought to be used E = [1 + (1+ ts+ tb) P log P/n]-1 * As problem size increases, efficiency increases * As number of processors increases, efficiency decreases
13
Some common collective operations
Broadcast A B C D A, B, C, D Gather A A, B, C, D B C D Scatter A B C D A, B, C, D All Gather
14
Broadcast T ~ (ts+ Ltb) log P L: Length of data x7 x8 x1 x1 x2 x3 x4
15
Gather/Scatter Gather: Data move towards the root
Note: Si=0log P–1 2i = (2 log P – 1)/(2–1) = P-1 ~ P x18 4L x14 x58 2L 2L x12 x34 x56 x78 L L L L x1 x2 x3 x4 x5 x6 x7 x8 Gather: Data move towards the root Scatter: Review question T ~ ts log P + PLtb
16
All gather x7 x8 x3 x4 x5 x6 L x1 x2 Equivalent to each processor broadcasting to all the processors
17
All gather x78 x78 x34 x34 2L x56 x56 L x12 x12
18
All gather x58 x58 x14 x14 2L x58 x58 L 4L x14 x14
19
All gather Tn ~ ts log P + PLtb 2L L 4L x18 x18 x18 x18 x18 x18 x18
20
Review question: Pipelining
* Useful when repeatedly and regularly performing a large number of primitive operations Optimal time for a broadcast = log P But doing this n times takes n log P time Pipelining the broadcasts takes n + P time Almost constant amortized time per broadcast if n >> P n + P << n log P when n >> P Review question: How can you accomplish this time complexity?
21
Sequential prefix Input Output Algorithm Values xi , 1 < i < n
Xi = x1 * x2 * ... * xi, 1 < i < n * is an associative operator Algorithm X1 = x1 for i = 2 to n Xi = Xi-1 * xi
22
Parallel prefix Input Output Define f(a,b) as follows
Processor i has xi Output Processor i has x1 * x2 * ... * xi Define f(a,b) as follows if a == b Xi = xi, on Proc Pi else compute in parallel f(a,(a+b)/2) f((a+b)/2+1,b) Pi and Pj send Xi and Xj to each other, respectively a < i < (a+b)/2 j = i + (a+b)/2 Xi = Xi*Xj on Pi Xj = Xi*Xj on Pj Divide and conquer f(a,b) yields the following Xi = xa *... * xi, Proc Pi Xi = xa *... * xb, Proc Pi a < i < b f(1,n) solves the problem T(n) = T(n/2) (ts+tw) => T(n) = O(log n) An iterative implementation improves the constant
23
Iterative parallel prefix example
24
Algorithms Linear recurrence Matrix vector multiplication
25
Determine each xi, 2 < i < n
Linear recurrence Determine each xi, 2 < i < n xi = ai xi-1 + bi xi-2 x0 = x0, x1 = x1 Sequential solution for i = 2 to n Follows directly from the recurrence This approach is not easily parallelized
26
Linear recurrence in parallel
Given xi = ai xi-1 + bi xi-2 x2i = a2i x2i-1 + b2i x2i-2 x2i+1 = a2i+1 x2i + b2i+1 x2i-1 Rewrite this in matrix form x2i x2i+1 x2i-2 x2i-1 b2i a2i a2i+1 b2i b2i+1 + a2i+1 a2i Ai Xi-1 Xi Xi = Ai A i A1X0 This is a parallel prefix computation, since matrix multiplication is associative Solved in O(log n) time
27
Matrix-vector multiplication
c = A b Often performed repeatedly bi = A bi-1 We need same data distribution for c and b One dimensional decomposition Example: row-wise block striped for A b and c replicated Each process computes its components of c independently Then all-gather the components of c
28
1-D matrix-vector multiplication
c: Replicated A: Row-wise b: Replicated Each process computes its components of c independently Time = Q(n2/P) Then all-gather the components of c Time = ts log P + tb n Note: P < n
29
2-D matrix-vector multiplication
B0 C1 A10 A11 A12 A13 B1 C2 A20 A21 A22 A23 B2 C3 A30 A31 A32 A33 B3 Processes Pi0 sends Bi to P0i Time: ts + tbn/P0.5 Processes P0j broadcast Bj to all Pij Time = ts log P0.5 + tb n log P0.5 / P0.5 Processes Pij compute Cij = AijBj Time = Q(n2/P) Processes Pij reduce Cij on to Pi0, 0 < i < P0.5 Total time = Q(n2/P + ts log P + tb n log P / P0.5 ) P < n2 * More scalable than 1-dimensional decomposition
30
Important points Efficiency
Increases with increase in problem size Decreases with increase in number of processors Aggregation of tasks to increase granularity Reduces communication overhead Data distribution 2-dimensional may be more scalable than 1-dimensional Has an effect on load balance too General techniques Divide and conquer Pipelining
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.