Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education.

Outline Introduction (case study: maximum element) – Work-optimality The Parallel Random Access Machine – Shared memory modes – Accelerated cascading Other Parallel Architectures (case study: sorting) – Circuits – Linear processor networks – (Mesh processor networks) Distributed Algorithms – Message-optimality – Broadcast and echo – (Leader election)

Introduction

Why use parallelism? p steps on 1 printer, 1 step on p printers p = speed-up factor (best case) Given a sequential algorithm, how can we parallelize it? – Some are inherently sequential (P-complete)

Case Study: Maximum Element In: a[] Out: maximum element in a sequential_maximum(a) { n = a.length max = a[0] for i = 1 to n – 1 { if (a[i] > max) max = a[i] } return max } 2111231748332241 21 23 48 O(n)

Parallel Maximum Idea: Use ⌈ n / 2 ⌉ processors Note idle processors after the first step! 2111231748332241 21234841 2348 O(lg n)

Work-Optimality Work = number of algorithmic steps × number of processors Running time of parallelized maximum algo = O(lg n) × (n / 2) = O(n lg n) Not work-optimal! Sequential algo’s work is O(n) – Workaround: accelerated cascading…

Formal Algorithm for Parallel Maximum But first!...

The Parallel Random Access Machine

The Parallel Random Access Machine (PRAM) New construct: parallel loop for i = 1 to n in parallel { … } Assumption 1: use n processors to execute this loop (processors are synchronized) Assumption 2: memory shared across all processors

Example: Parallel Search In: a[], x Out: true if x is in a, false otherwise parallel_search(a, x) { n = a.length found = false for i = 0 to n – 1 in parallel { if (a[i] == x) found = true } return found } Is this work-optimal? Shared memory modes: Exclusive Read (ER) Concurrent Read (CR) Exclusive Write (EW) Concurrent Write (CW) Real-world systems are most commonly CREW parallel_search runs on what type?

Formal Algorithm for Parallel Maximum In: a[] Out: maximum element in a parallel_maximum(a) { n = a.length for i = 0 to ⌈ lg n ⌉ – 1 { for j = 0 to ⌈ n/2 i+1 ⌉ – 1 in parallel { if (j × 2 i+1 + 2 i < n) // boundary check a[j × 2 i+1 ] = max(a[j × 2 i+1 ], a[j × 2 i+1 + 2 i ]) } } return a[0] } Theorem: parallel_maximum is CREW and finds the maximum element in parallel time O(lg n) and work O(n lg n)

Accelerated Cascading Phase 1: Use sequential_maximum on blocks of lg n elements – We use n / lg n processors – O(lg n) sequential steps per processor – Total work = O(lg n) steps × (n / lg n) processors = O(n) Phase 2: Use parallel_maximum on the resulting n / lg n elements – lg (n / lg n) parallel steps = lg n – lg (lg n) = O(lg n) – Total work = O(lg n) steps × ((n / lg n) / 2) processors = O(n)

Formal Algorithm for Optimal Maximum In: a[] Out: maximum element in a optimal_maximum(a) { n = a.length block_size = ⌈ lg n ⌉ block_count = ⌈ n / block_size ⌉ create array block_results[block_count] for i = 0 to block_count – 1 in parallel { start = i × block_size end = min(n – 1, start + block_size – 1) block_results[i] = sequential_maximum(a[start.. end]) } return parallel_maximum(block_results) }

Some Notes All CR algorithms can be converted to ER algorithms! – “Broadcasting” an ER variable to all processors for concurrent access takes O(lg n) parallel time maximum is a “semigroup algorithm” – Semigroup = a set of elements + an associative binary relation (max, min, +, ×, etc.) – Same accelerated-cascading methods can be applied for min-element, summation, product of n numbers, etc.!

Other Parallel Architectures

PRAM may not be the best model Shared memory = expensive! – Some algorithms require communication between processors (= memory locking issues) – Better to use channels! Extreme case: very simple processors with no shared memory (just communication channels)

Circuits Each processor is a gate with a specialized function (e.g., comparator gate) Circuit = a layout of gates to perform a full task (e.g., sorting) x y min(x, y) max(x, y)

Sorting circuit for 4 elements (depth 3) Step 1Step 2Step 3 (Depth of network = 3) 17 42 23 7 17 42 7 23 17 7 23 42 7

Sorting circuit for n elements? Simpler problem: max element Idea: Add as many of these diagonals as needed

Odd-Even Transposition Network Theorem: The odd-even transposition network sorts n numbers in n steps and O(n 2 ) processors 18 42 31 56 12 11 19 34 18 42 31 56 11 12 19 34 18 31 42 11 56 12 19 34 18 31 11 42 12 56 19 34 18 11 31 12 42 19 56 34 11 18 12 31 19 42 34 56 11 12 18 19 31 34 42 56 11 12 18 19 31 34 42 56 11 12 18 19 31 34 42 56

Zero-One Principle of Sorting Networks Lemma: If a sorting network works correctly on all inputs consisting of only 0’s and 1’s, it works for any arbitrary input – Assume there is a network that sorts 0-1 sequences but not another arbitrary input a 0.. a n-1 – Let b 0.. b n-1 be the output of that network – There must exist s b t – Label all a i < b s with 0 and all else with 1 – If we run all a 0.. a n-1 with their labels, then b s ’s label will be 1 and b t ’s label will be 0 – Contradiction: The network is assumed to sort 0-1 sequences properly but did not do so here!

Correctness of the Odd-Even Transposition Network Assume binary sequence a 0.. a n–1 Let a i = first 0 in the sequence Two cases: i is odd or even To sort a 0.. a i, we need i steps (worst-case) Induction: Given a 0.. a k (where k ≥ i) will sort in k steps, will a 0.. a k+1 get sorted in k+1 steps? 1 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 10 1

Better Sorting Networks Batcher’s Bitonic Sorter (1968) – Depth O(lg 2 n), size O(n lg 2 n) – Idea: sort 2 groups (recursively), then merge using a network that can sort bitonic sequences AKS Network (1983) – Ajtai, Komlós and Szemerédl – Depth O(lg n), size O(n lg n) – Not practical! Hides a very large c in the cn lg n algorithm

More Intelligent Processors: Processor Networks Star Linear/Ring Completely-connected Mesh Diameter = 2 Diameter = n – 1 (or n – 2) Diameter = 1

Sorting on Linear Networks Emulate an odd-even transposition network! O(n) steps, work is O(n 2 ) – We can’t expect better on a linear network 18423156121119 18423156111219 18314211561219 18311142125619 18113112421956 11181231194256 11121819314256 11121819314256

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat 154106 11751 1214138 91623

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat 461015 11751 8121314 16932

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Done! Sort rows, sort columns, repeat 1234 8765 9101112 16151413

Sorting on Mesh Networks: Shearsort Theorem: Shearsort sorts n 2 elements in O(n lg n) steps on an n × n mesh We can use the Zero-One Principle! – Only because algorithm is comparison-exchange Can be implemented using comparators only – and oblivious Outcome of comparator does not influence comparisons made later on – (Disclaimer: reference is actually very unclear about this)

Correctness of Shearsort 01001001 01110111 00101001 10010010 11101010 00011101 00111111 11001111

00000111 11111100 00000111 11100000 00011111 11110000 00111111 11111100 1 full row of 1’s 1 full row of 0’s 1 full row of 1’s

Correctness of Shearsort lg(n) × 2 phases, each phase takes n steps 00000000 00000000 00000100 00110100 11111111 11111111 11111111 11111111 Sort space guaranteed to be halved after 2 phases

Distributed Algorithms

Different concerns altogether… Problems usually easy to parallelize Main problems: – Inherently asynchronous – How to broadcast data and ensure every node gets it – How to minimize bandwidth usage – What to do when nodes go down (decentralization) – (Do we trust the results given by the nodes?) 2, 3, 5, 7, 13 … … 2 42643801 -1, 2 43112609 -1… DES (56-bit) SETI@Home

Message-Optimality New language constructs: send to p receive from p terminate Message-complexity = number of messages sent by a distributed algorithm (also uses O-notation)

Broadcast Initiators vs. noninitiators Simple case: ring network w/ one initiator init_ring_broadcast() { send token to successor receive token from predecessor terminate } ring_broadcast() { receive token from predecessor send token to successor terminate } Theorem: init_ring_broadcast + ring_broadcast broadcasts to n machines using time and message complexity O(n)

Broadcast on a tree network init_broadcast() { N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate } broadcast() { receive token from parent N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate } Note: no acknowledgment! 2 1 3 6 4 5

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 0 nul

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 0 0 nul

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 0 0 nul 0

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 0 0 nul 0 0

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 1 0 0 0 1 2

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 1 0 2 2 3 2

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 2 0 fin 4

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 2 1 fin

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 3 fin

Echo Creates a spanning tree out of any connected network Theorem: init_echo + echo has time complexity O(diameter) and message complexity O(edges) init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } 2 1 3 6 4 5 fin

Leader Election (for ring networks) init_election() { send token, p.ID to successor min = p.ID receive token, token_id while (p.ID != token_id) { if token_id < min min = token_id send token, token_id to successor receive token, token_id } if (p.ID == min) i_am_the_leader = true else i_am_the_leader = false terminate } election() { i_am_the_leader = false do { receive token, token_id send token, token_id to successor } while (true) } Theorem: init_election + election runs in n steps with message complexity O(n 2 )

Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education.

Similar presentations

Presentation on theme: "Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education.

Similar presentations

Presentation on theme: "Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education."— Presentation transcript:

Similar presentations

About project

Feedback