Download presentation

Presentation is loading. Please wait.

Published bySarina Turney Modified over 2 years ago

1
Parallel Algorithms

2
Computation Models Goal of computation model is to provide a realistic representation of the costs of programming. Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is good (i.e. performance-efficient)

3
Goal for Modeling We want to develop computational models which accurately represent the cost and performance of programs If model is poor, optimum in model may not coincide with optimum observed in practice ModelReal World x A B optimum Y

4
Models of Computation Whats a model good for?? Provides a way to think about computers. Influences design of: Architectures Languages Algorithms Provides a way of estimating how well a program will perform. Cost in model should be roughly same as cost of executing program

5
The Random Access Machine Model RAM model of serial computers: –Memory is a sequence of words, each capable of containing an integer. –Each memory access takes one unit of time –Basic operations (add, multiply, compare) take one unit time. –Instructions are not modifiable –Read-only input tape, write-only output tape

6
Has RAM influenced our thinking? Language design: No way to designate registers, cache, DRAM. Most convenient disk access is as streams. How do you express atomic read/modify/write? Machine & system design: Its not very easy to modify code. Systems pretend instructions are executed in-order. Performance Analysis: Primary measures are operations/sec (MFlop/sec, MHz,...) Whats the difference between Quicksort and Heapsort??

7
What about parallel computers RAM model is generally considered a very successful bridging model between programmer and hardware. Since RAM is so successful, lets generalize it for parallel computers...

8
PRAM [Parallel Random Access Machine] PRAM composed of: – P processors, each with its own unmodifiable program. –A single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. –a read-only input tape. –a write-only output tape. PRAM model is a synchronous, MIMD, shared address space parallel computer. (Introduced by Fortune and Wyllie, 1978)

9
PRAM model of computation p processors, each with local memory Synchronous operation Shared memory reads and writes Each processor has unique id in range 1-p Shared memory

10
Characteristics At each unit of time, a processor is either active or idle (depending on id) All processors execute same program At each time step, all processors execute same instruction on different data (data- parallel) Focuses on concurrency only

11
Variants of PRAM model

12
More PRAM taxonomy Different protocols can be used for reading and writing shared memory. –EREW - exclusive read, exclusive write A program isnt allowed to have two processors access the same memory location at the same time. –CREW - concurrent read, exclusive write –CRCW - concurrent read, concurrent write Needs protocol for arbitrating write conflicts –CROW – concurrent read, owner write Each memory location has an official owner PRAM can emulate a message-passing machine by partitioning memory into private memories.

13
Sub-variants of CRCW Common CRCW –CW iff all processors writing same value Arbitrary CRCW –Arbitrary value of write set stored Priority CRCW –Value of min-index processor stored Combining CRCW

14
Why study PRAM algorithms? Well-developed body of literature on design and analysis of such algorithms Baseline model of concurrency Explicit model –Specify operations at each step –Scheduling of operations on processors Robust design paradigm

15
Work-Time paradigm Higher-level abstraction for PRAM algorithms WT algorithm = (finite) sequence of time steps with arbitrary number of operations at each step Two complexity measures –Step complexity T(n) –Work complexity W(n) WT algorithm work-efficient if W(n) = (T S (n)) optimal sequential Algorithm

16
Designing PRAM algorithms Balanced trees Pointer jumping Euler tours Divide and conquer Symmetry breaking...

17
Balanced trees Key idea: Build balanced binary tree on input data, sweep tree up and down Tree not a data structure, often a control structure (e.g., recursion)

18
Alg : Sum Given: Sequence a of n = 2 k elements Given: Binary associative operator + Compute: S = a a n

19
WT description of sum integer B[1..n] forall i in 1 : n do B[i] := a i enddo for h = 1 to k do forall i in 1 : n/2 h do B[i] := B[2i-1] + B[2i] enddo S := B[1]

20
Points to note about WT pgm Global program: no references to processor id Contains both serial and concurrent operations Semantics of forall Order of additions different from sequential order: associativity critical

21
Analysis of scan operation Algorithm is correct (lg n) steps, (n) work EREW model Two variants –Inclusive: as discussed –Exclusive: s 1 = I, s k = x x k-1 If n not power of 2, pad to next power

22
Complexity measures of Sum Recall definitions of step complexity T(n) and work complexity W(n) Concurrent execution reduces number of steps

23
How to do prefix sum ? Input: Sequence x of n = 2 k elements, binary associative operator + Output: Sequence s of n = 2 k elements, with s k = x x k Example: x = [1, 4, 3, 5, 6, 7, 0, 1] s = [1, 5, 8, 13, 19, 26, 26, 27]

24
List Ranking List ranking problem –Given a singly linked list L with n objects, for each node, compute the distance to the end of the list If d denotes the distance –node.d = 0 if node.next = nil – node.next.d + 1 otherwise Serial algorithm: O(n) Parallel algorithm –Assign one processor for each node –Assume there are as many processors as list objects –For each node i, perform 1.i.d = i.d + i.next.d 2.i.next = i.next.next// pointer jumping {

25
List Ranking - Pointer Jumping List_ranking(L) 1.for each node i, in parallel do 2. if i.next = nil then i.d = 0 3. else i.d = 1 4.while exists a node i, such that i.next != nil do 5. for each node i, in parallel do 6. if i.next != nil then 7. i.d = i.d + i.next.d// i updates i itself 8. i.next = i.next.next Analysis –After a pointer jumping, a list is transformed into two (interleaved) lists –After that, four (interleaved) lists –Each pointer jumping doubles the number of lists and halves their length –After log n, all lists contain only one node –Total time: O(log n)

26
List Ranking - Example

27
List Ranking - Discussion Synchronization is important –In step 8 (i.next = i.next.next), all processors must read right hand side before any processor write left hand side The list ranking algorithm is EREW –If we assume in step 7 (i.d = i.d + i.next.d) all processors read i.d and then read i.next.d –If j.next = i, i and j do not read i.d concurrently Work performance –performs O(n log n) work since n processors in O(log n) time Work efficient –A PRAM algorithm is work efficient w.r.t another algorithm if two algorithms are within a constant factor –Is the link ranking algorithm work-efficient w.r.t the serial algorithm? No, because O(n log n) versus O(n) Speedup –S = n / log n

28
Parallel Prefix on a List Prefix computation –Input, a binary, associative operator –Output –Prefix computation: y k = x 1 x 2.. x k Example –if x k = 1 for k=1..n and = + –Then y k = k, for k = 1..n Serial algorithm: O(n) Notation –[i, j] = x i x i+1.. x j [k, k] = x k [i, k] [k+1, j] = [i, j] Idea: perform prefix computation on a linked list so that –each node k contains [k, k] = x k initially –finally each node k contains [1, k] = y k

29
Parallel Prefix on a List (2) List_prefix(L, X) // L: list, X: 1.for each node i, in parallel 2. i.y = x i 3.While exists a node i such that i.next != nil do 4. for each node i, in parallel do 5. if i.next != nil then 6. i.next.y = i.y i.next.y // i updates its successor 7. i.next = i.next.next Analysis –Initially k-th node has [k,k] as y-value, points to (k+1)-th node –At the first iteration, k-th node fetches [k+1,k+1] from its successor and perform [k,k] [k+1,k+1] resulting in [k,k+1] and update its successor –At the second iteration k-th node fetches [k+1,k+2] from its successor and perform [k-1,k] [k+1,k+2] resulting in [k-1,k+2] and update its successor

30
Parallel Prefix on a List (3)

31
Parallel Prefix on a List (4) Running time: O(log n) –After log n, all lists contain only one node Work performed: O(n log n) Speedup –S = n / log n

32
Pointer jumping Fast parallel processing of linked data structures (lists, trees) Convention: Draw trees with edges directed from children to parents Example: Finding the roots of forest represented as parent array P –P[i] = j if and only if (i, j) is a forest edge –P[i] = i if and only if i is a root

33
Algorithm (Roots of forest) forall i in 1:n do S[i] := P[i] while S[i] != S[S[i]] do S[i] := S[S[i]] endwhile enddo

34
Initial state of forest

35
After one iteration

36
After another iteration

37
Concurrent Read – Finding Roots

38
Analysis of pointer jumping Termination detection? At each step, tree distance between i and S[i] doubles unless S[i] is a root CREW model Correctness by induction on h O(lg h) steps, O(n lg h) work T S (n) = O(n) Not work-efficient unless h constant

39
Concurrent Read – Finding Roots This is a CREW algorithm Suppose Exclusive-Read is used, what will be the running time? –Initially only one node i has root information –First iteration: Another node reads from the node i Totally two nodes are filled up –Second iteration: Another two nodes can reads from the two nodes Totally four nodes are filled up –k-th iteration: 2 k-1 nodes are filled up –If there are n nodes, k=log n –So Find_root with Exclusive-Read takes O(log n). O(log log n) vs. O(log n)

40
Euler tours Technique for fast optimal processing of tree data Euler circuit of directed graph: directed cycle that traverses each edge exactly once Represent (rooted) tree by Euler circuit of its directed version

41
Trees = balanced parentheses ( ( ( ) ( ) ) ( ) ( ( ) ( ) ( ) ) ) Key property: The parenthesis subsequence corresponding to a subtree is balanced.

42
Computing the Depth Problem definition –Given a binary tree with n nodes, compute the depth of each node Serial algorithm takes O(n) time A simple parallel algorithm –Starting from root, compute the depths level by level –Still O(n) because the height of the tree could be as high as n Euler tour algorithm –Uses parallel prefix computation

43
Computing the Depth (2) Euler tour: A cycle that traverses each edge exactly once in a graph –It is a directed version of a tree Regard an undirected edge into two directed edges –Any directed version of a tree has an Euler tour by traversing the tree in a DFS way forming a linked list. Employ 3*n processors –Each node i has fields i.parent, i.left, i.right –Each node i has three processors, i.A, i.B, and i.C. Three processors in each node of the tree are linked as follows –i.A = i.left.A if i.left != nil – i.B if i.left = nil –i.B = i.right.A if i.right != nil – i.C if i.right = nil –i.C = i.parent.B if i is the left child – i.parent.C if i is the right child – nil if i.parent = nil { { {

44
Computing the Depth (3) Algorithm –Construct the Euler tour for the tree – O(1) time –Assign 1 to all A processors, 0 to B processors, -1 to C processors –Perform a parallel prefix computation –The depth of each node resides in its C processor O(log n) –Actually log 3n EREW because no concurrent read or write Speedup –S = n/log n

45
Computing the Depth (4)

46
Broadcasting on a PRAM Broadcast can be done on CREW PRAM in O(1) steps: –Broadcaster sends value to shared memory –Processors read from shared memory Requires lg(P) steps on EREW PRAM. M PPPPPPPP B

47
Concurrent Write – Finding Max Finding max problem –Given an array of n elements, find the maximum(s) –sequential algorithm is O(n) Data structure for parallel algorithm –Array A[1..n] –Array m[1..n]. m[i] is true if A[i] is the maximum –Use n 2 processors Fast_max(A, n) 1.for i = 1 to n do, in parallel 2. m[i] = true// A[i] is potentially maximum 3.for i = 1 to n, j = 1 to n do, in parallel 4. if A[i] < A[j] then 5. m[i] = false 6.for i = 1 to n do, in parallel 7. if m[i] = true then max = A[i] 8.return max Time complexity: O(1)

48
Concurrent Write – Finding Max Concurrent-write –In step 4 and 5, processors with A[i] < A[j] write the same value false into the same location m[i] –This actually implements m[i] = (A[i] A[1]) … (A[i] A[n]) Is this work efficient? –No, n 2 processors in O(1) –O(n 2 ) work vs. sequential algorithm is O(n) What is the time complexity for the Exclusive-write? –Initially elements think that they might be the maximum –First iteration: For n/2 pairs, compare. n/2 elements might be the maximum. –Second iteration: n/4 elements might be the maximum. log n th iteration: one element is the maximum. –So Fast_max with Exclusive-write takes O(log n). O(1) (CRCW) vs. O(log n) (EREW)

49
Simulating CRCW with EREW CRCW algorithms are faster than EREW algorithms –How much fast? Theorem –A p-processor CRCW algorithm can be no more than O(log p) times faster than the best p-processor EREW algorithm Proof by simulating CRCW steps with EREW steps –Assumption: A parallel sorting takes O(log n) time with n processors –When CRCW processor p i write a datum x i into a location l i, EREW p i writes the pair (l i, x i ) into a separate location A[i] Note EREW write is exclusive, while CRCW may be concurrent –Sort A by l i O(log p) time by assumption –Compare adjacent elements in A –For each group of the same elements, only one processor, say first, write x i into the global memory l i. Note this is also exclusive. –Total time complexity: O(log p)

50
Simulating CRCW with EREW (2)

51
CRCW versus EREW - Discussion CRCW –Hardware implementations are expensive –Used infrequently –Easier to program, runs faster, more powerful. –Implemented hardware is slower than that of EREW In reality one cannot find maximum in O(1) time EREW –Programming model is too restrictive Cannot implement powerful algorithms

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google