Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education.

Slides:



Advertisements
Similar presentations
1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
Advertisements

Parallel Algorithms.
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Lecture 8: Asynchronous Network Algorithms
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Lecture 3: Parallel Algorithm Design
Chapter 15 Basic Asynchronous Network Algorithms
Batcher’s merging network Efficient Parallel Algorithms COMP308.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Sorting Algorithms Comparison Sorts if (A>B) { temp=A; A=B; B=temp; } Potential Speed-up –Optimal Comparison Sort: O(N lg N) –Optimal Parallel.
What is an Algorithm? (And how do we analyze one?)
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:
Chapter 4 - Self-Stabilizing Algorithms for Model Conservation4-1 Chapter 4: roadmap 4.1 Token Passing: Converting a Central Daemon to read/write 4.2 Data-Link.
CS421 - Course Information Website Syllabus Schedule The Book:
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Lecture 21: Parallel Algorithms
Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 9 Tuesday, 11/20/01 Parallel Algorithms Chapters 28,
2. Computational Models 1 MODELS OF COMPUTATION (Chapter 2) Models –An abstract description of a real world entity –Attempts to capture the essential features.
CHAPTER 12 Parallel and Distributed Algorithms. Algorithm Maximum This algorithm returns the maximum value in array a[0],..., a[n - 1]. The value.
1 Parallel Algorithms III Topics: graph and sort algorithms.
CS Main Questions Given that the computer is the Great Symbol Manipulator, there are three main questions in the field of computer science: What kinds.
What is an Algorithm? (And how do we analyze one?) COMP 122, Spring 04.
CSCI-455/552 Introduction to High Performance Computing Lecture 22.
Parallel Computers 1 The PRAM Model for Parallel Computation (Chapter 2) References:[2, Akl, Ch 2], [3, Quinn, Ch 2], from references listed for Chapter.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
Election Algorithms and Distributed Processing Section 6.5.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 12: February 6, 2006 Sorting.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Lecture 6 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
Comparison Networks Sorting Sorting binary values
RAM, PRAM, and LogP models
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.
Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Comparison Networks Sorting Sorting binary values Sorting arbitrary numbers Implementing symmetric functions.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Reducing number of operations: The joy of algebraic transformations CS498DHP Program Optimization.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
PRAM and Parallel Computing
Lecture 3: Parallel Algorithm Design
Sorting Networks Characteristics The basic unit: a comparator
Parallel Algorithms (chap. 30, 1st edition)
Parallel Sorting Algorithms
Lecture 9: Asynchronous Network Algorithms
Parallel and Distributed Algorithms
Comparison Networks Sorting Sorting binary values
Parallel Sorting Algorithms
Lecture 6 Algorithm Analysis
Lecture 6 Algorithm Analysis
Unit –VIII PRAM Algorithms.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Module 6: Introduction to Parallel Computing
Parallel Sorting Algorithms
Presentation transcript:

Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.

Outline Introduction (case study: maximum element) – Work-optimality The Parallel Random Access Machine – Shared memory modes – Accelerated cascading Other Parallel Architectures (case study: sorting) – Circuits – Linear processor networks – (Mesh processor networks) Distributed Algorithms – Message-optimality – Broadcast and echo – (Leader election)

Introduction

Why use parallelism? p steps on 1 printer, 1 step on p printers p = speed-up factor (best case) Given a sequential algorithm, how can we parallelize it? – Some are inherently sequential (P-complete)

Case Study: Maximum Element In: a[] Out: maximum element in a sequential_maximum(a) { n = a.length max = a[0] for i = 1 to n – 1 { if (a[i] > max) max = a[i] } return max } O(n)

Parallel Maximum Idea: Use ⌈ n / 2 ⌉ processors Note idle processors after the first step! O(lg n)

Work-Optimality Work = number of algorithmic steps × number of processors Running time of parallelized maximum algo = O(lg n) × (n / 2) = O(n lg n) Not work-optimal! Sequential algo’s work is O(n) – Workaround: accelerated cascading…

Formal Algorithm for Parallel Maximum But first!...

The Parallel Random Access Machine

The Parallel Random Access Machine (PRAM) New construct: parallel loop for i = 1 to n in parallel { … } Assumption 1: use n processors to execute this loop (processors are synchronized) Assumption 2: memory shared across all processors

Example: Parallel Search In: a[], x Out: true if x is in a, false otherwise parallel_search(a, x) { n = a.length found = false for i = 0 to n – 1 in parallel { if (a[i] == x) found = true } return found } Is this work-optimal? Shared memory modes: Exclusive Read (ER) Concurrent Read (CR) Exclusive Write (EW) Concurrent Write (CW) Real-world systems are most commonly CREW parallel_search runs on what type?

Formal Algorithm for Parallel Maximum In: a[] Out: maximum element in a parallel_maximum(a) { n = a.length for i = 0 to ⌈ lg n ⌉ – 1 { for j = 0 to ⌈ n/2 i+1 ⌉ – 1 in parallel { if (j × 2 i i < n) // boundary check a[j × 2 i+1 ] = max(a[j × 2 i+1 ], a[j × 2 i i ]) } } return a[0] } Theorem: parallel_maximum is CREW and finds the maximum element in parallel time O(lg n) and work O(n lg n)

Accelerated Cascading Phase 1: Use sequential_maximum on blocks of lg n elements – We use n / lg n processors – O(lg n) sequential steps per processor – Total work = O(lg n) steps × (n / lg n) processors = O(n) Phase 2: Use parallel_maximum on the resulting n / lg n elements – lg (n / lg n) parallel steps = lg n – lg (lg n) = O(lg n) – Total work = O(lg n) steps × ((n / lg n) / 2) processors = O(n)

Formal Algorithm for Optimal Maximum In: a[] Out: maximum element in a optimal_maximum(a) { n = a.length block_size = ⌈ lg n ⌉ block_count = ⌈ n / block_size ⌉ create array block_results[block_count] for i = 0 to block_count – 1 in parallel { start = i × block_size end = min(n – 1, start + block_size – 1) block_results[i] = sequential_maximum(a[start.. end]) } return parallel_maximum(block_results) }

Some Notes All CR algorithms can be converted to ER algorithms! – “Broadcasting” an ER variable to all processors for concurrent access takes O(lg n) parallel time maximum is a “semigroup algorithm” – Semigroup = a set of elements + an associative binary relation (max, min, +, ×, etc.) – Same accelerated-cascading methods can be applied for min-element, summation, product of n numbers, etc.!

Other Parallel Architectures

PRAM may not be the best model Shared memory = expensive! – Some algorithms require communication between processors (= memory locking issues) – Better to use channels! Extreme case: very simple processors with no shared memory (just communication channels)

Circuits Each processor is a gate with a specialized function (e.g., comparator gate) Circuit = a layout of gates to perform a full task (e.g., sorting) x y min(x, y) max(x, y)

Sorting circuit for 4 elements (depth 3) Step 1Step 2Step 3 (Depth of network = 3)

Sorting circuit for n elements? Simpler problem: max element Idea: Add as many of these diagonals as needed

Odd-Even Transposition Network Theorem: The odd-even transposition network sorts n numbers in n steps and O(n 2 ) processors

Zero-One Principle of Sorting Networks Lemma: If a sorting network works correctly on all inputs consisting of only 0’s and 1’s, it works for any arbitrary input – Assume there is a network that sorts 0-1 sequences but not another arbitrary input a 0.. a n-1 – Let b 0.. b n-1 be the output of that network – There must exist s b t – Label all a i < b s with 0 and all else with 1 – If we run all a 0.. a n-1 with their labels, then b s ’s label will be 1 and b t ’s label will be 0 – Contradiction: The network is assumed to sort 0-1 sequences properly but did not do so here!

Correctness of the Odd-Even Transposition Network Assume binary sequence a 0.. a n–1 Let a i = first 0 in the sequence Two cases: i is odd or even To sort a 0.. a i, we need i steps (worst-case) Induction: Given a 0.. a k (where k ≥ i) will sort in k steps, will a 0.. a k+1 get sorted in k+1 steps?

Better Sorting Networks Batcher’s Bitonic Sorter (1968) – Depth O(lg 2 n), size O(n lg 2 n) – Idea: sort 2 groups (recursively), then merge using a network that can sort bitonic sequences AKS Network (1983) – Ajtai, Komlós and Szemerédl – Depth O(lg n), size O(n lg n) – Not practical! Hides a very large c in the cn lg n algorithm

More Intelligent Processors: Processor Networks Star Linear/Ring Completely-connected Mesh Diameter = 2 Diameter = n – 1 (or n – 2) Diameter = 1

Sorting on Linear Networks Emulate an odd-even transposition network! O(n) steps, work is O(n 2 ) – We can’t expect better on a linear network

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Row phase Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Column phase Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Arrange numbers in “boustrophedon” order a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 } Done! Sort rows, sort columns, repeat

Sorting on Mesh Networks: Shearsort Theorem: Shearsort sorts n 2 elements in O(n lg n) steps on an n × n mesh We can use the Zero-One Principle! – Only because algorithm is comparison-exchange Can be implemented using comparators only – and oblivious Outcome of comparator does not influence comparisons made later on – (Disclaimer: reference is actually very unclear about this)

Correctness of Shearsort

full row of 1’s 1 full row of 0’s 1 full row of 1’s

Correctness of Shearsort lg(n) × 2 phases, each phase takes n steps Sort space guaranteed to be halved after 2 phases

Distributed Algorithms

Different concerns altogether… Problems usually easy to parallelize Main problems: – Inherently asynchronous – How to broadcast data and ensure every node gets it – How to minimize bandwidth usage – What to do when nodes go down (decentralization) – (Do we trust the results given by the nodes?) 2, 3, 5, 7, 13 … … , … DES (56-bit)

Message-Optimality New language constructs: send to p receive from p terminate Message-complexity = number of messages sent by a distributed algorithm (also uses O-notation)

Broadcast Initiators vs. noninitiators Simple case: ring network w/ one initiator init_ring_broadcast() { send token to successor receive token from predecessor terminate } ring_broadcast() { receive token from predecessor send token to successor terminate } Theorem: init_ring_broadcast + ring_broadcast broadcasts to n machines using time and message complexity O(n)

Broadcast on a tree network init_broadcast() { N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate } broadcast() { receive token from parent N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate } Note: no acknowledgment!

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate }

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul 0

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } nul 0 0

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate }

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate }

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin 4

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin

Echo Creates a spanning tree out of any connected network init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin

Echo Creates a spanning tree out of any connected network Theorem: init_echo + echo has time complexity O(diameter) and message complexity O(edges) init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate } echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate } fin

Leader Election (for ring networks) init_election() { send token, p.ID to successor min = p.ID receive token, token_id while (p.ID != token_id) { if token_id < min min = token_id send token, token_id to successor receive token, token_id } if (p.ID == min) i_am_the_leader = true else i_am_the_leader = false terminate } election() { i_am_the_leader = false do { receive token, token_id send token, token_id to successor } while (true) } Theorem: init_election + election runs in n steps with message complexity O(n 2 )