Examples Concepts & Definitions Analysis of Algorithms

Slides:



Advertisements
Similar presentations
Chapter 13. Red-Black Trees
Advertisements

Introduction to Algorithms
Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
Chapter 5: Tree Constructions
CS16: Introduction to Data Structures & Algorithms
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
Parallel Algorithms.
Introduction to Algorithms Quicksort
David Luebke 1 8/25/2014 CS 332: Algorithms Red-Black Trees.
110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.
Less Than Matching Orgad Keller.
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Rizwan Rehman Centre for Computer Studies Dibrugarh University
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?
Epp, section 10.? CS 202 Aaron Bloomfield
Single Source Shortest Paths
NON - zero sum games.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Lecture 3: Parallel Algorithm Design
Chapter 4: Trees Part II - AVL Tree
AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Discrete Structure Li Tak Sing( 李德成 ) Lectures
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
1 Complexity of Network Synchronization Raeda Naamnieh.
1 ICS 353 Design and Analysis of Algorithms Spring Semester (062) King Fahd University of Petroleum & Minerals Information & Computer Science.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
CPSC 411, Fall 2008: Set 2 1 CPSC 311 Analysis of Algorithms Sorting Lower Bound Prof. Jennifer Welch Fall 2009.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
E.G.M. Petrakissearching1 Searching  Find an element in a collection in the main memory or on the disk  collection: (K 1,I 1 ),(K 2,I 2 )…(K N,I N )
B-Trees and B+-Trees Disk Storage What is a multiway tree?
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
The Binary Heap. Binary Heap Looks similar to a binary search tree BUT all the values stored in the subtree rooted at a node are greater than or equal.
September 29, Algorithms and Data Structures Lecture V Simonas Šaltenis Aalborg University
CSC 211 Data Structures Lecture 13
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Heapsort. What is a “heap”? Definitions of heap: 1.A large area of memory from which the programmer can allocate blocks as needed, and deallocate them.
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Internal and External Sorting External Searching
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
1 Chapter 6 Heapsort. 2 About this lecture Introduce Heap – Shape Property and Heap Property – Heap Operations Heapsort: Use Heap to Sort Fixing heap.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Heaps © 2010 Goodrich, Tamassia Heaps Heaps
Chapter 5. Optimal Matchings
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Compact routing schemes with improved stretch
Presentation transcript:

Examples Concepts & Definitions Analysis of Algorithms Parallel Algorithms Examples Concepts & Definitions Analysis of Algorithms

Lemma Any complete binary tree with n leaves has Exercise: Prove it. internal nodes = n-1 (i.e., 2n-1 total nodes) height = log2 n Exercise: Prove it.

Warming up Consider the BTIN (Binary Tree Interconnected Network) computational model. Suppose the tree has n leaves (and hence 2n-1 processors). If we have n numbers stored at the leaves, how can we obtain the sum? How can we obtain the max or min? How can we propagate a number stored at the root to all leaves?

Warming up Suppose we have n-1 numbers stored at n-1 arbitrary leaves. How can we move these numbers to the n-1 internal nodes? If the leftmost n/2 leaves have numbers, how can we move them the rightmost leaves? How many steps does each of the above computation require?

Example 1.4 Grouping in a shared-Memory PC Given a sequence of pairs {(x1, d1), …, (xn, dn)} where xi  {0, 1, .., m-1}, m < n, and di is an arbitrary datum. By pigeonhole principle several xi will be repeated because m < n. Write a parallel algorithm to group these pairs according to the xi’s.

Example 1.4 Grouping in a shared-Memory Model Sequential algorithm: for each step i, read x and insert it in the hash table. Time = n steps Memory = Θ(n) 1 m-1

Example 1.4: Parallel algorithm Grouping in a shared-Memory Model Shared memory with n processors P1, P2, .., Pn Memory = m(2n-1) Think about m complete BT T0, T1, ..., Tm-1 each with n leaves numbered 1, 2, .., n which corresponds to P1,.., Pn . 1 m-1 T0 T1 Tm-1 1 n 1 n 1 n 1 n

Example 1.4: Parallel algorithm Grouping in a shared-Memory Model Phase 1: Each processor Pi will read the pair (xi, di) and insert it in the leaf i that belongs to the tree Txi Phase 2: Each processor Pi will try to move the pair (xi, di) higher up in its tree until it can go no higher as follows: 1 m-1 T0 T1 Tm-1 1 n 1 n 1 n 1 n

Example 1.4: Parallel algorithm Grouping in a shared-Memory Model Shifting-up rule: If node u is free, then the pair in the right child (if any) takes precedence in moving to u over the pair in the left child (if any).

Example 1.4: Parallel algorithm Analysis Since in shared memory parallel computer we have common program for all processors and they execute synchronously, then Phase 1: takes 1 step only Phase 2: it takes log2 n steps only because each tree has height = log2 n. Total = log2 n + 1 steps Extra empty cells in the m(2n-1) memory can be released.

Example: 1.5 Pipelining database in a BTIN model In a BTIN with n leaves (processors) containing n distinct records each of the form (k, d) where k is a key and d is a datum. Suppose that the root receives a query to retrieve the record whose key is K (if it exists) Write a parallel algorithm.

Example: 1.5 Pipelining database in a BTIN model Sequential Algorithm: Use binary search algorithm after sorting the records according to the keys. Time = Θ(n log n)

Example: 1.5: Parallel Algorithm Pipelining database in a BTIN model The root sends the key K to its children which they send subsequently to their children and so on. Until it reaches the leaves where it is compared with the keys they stored there. The leaf that contains the key is going to send up the corresponding record to the root through its parent and grandparents. Other leaves will send null message. When a parent receives a record from one of its children then it will send the same record to its parent; otherwise it will send null. and so on ...

Example: 1.5 Analysis This is called pipeline technique. All processor have the same program working asynchronously when they receive messages from parents or children. Time = 2 log2 n steps to send the key down the tree and receive back the record.

Example: 1.5: Parallel Algorithm Pipelining database in a BTIN model What if we make m queries K1, K2, ..., Km? Solution: Sends them sequentially one after another Total time = 2 log2 n + m - 1

Example 1.6 Prefix (Partial) Sums Given n numbers x0, x1, ..., xn-1 where n is a power of 2. Compute the partial sums for all k =0, 1, .., n-1 Sk = x0 + x1 + ... + xk

Example 1.6 Prefix (Partial) Sums Sequential Algorithm: We need to make the unavoidable n-1 additions.

Example 1.6: Parallel Algorithm Prefix (Partial) Sums For i = 0, 1, ...n-1, let initially Si = xi Then for j =0, ..., log2 n – 1, let Si ← Si + Si-2^j until 2^j = i This is can be done using the combinatorial Circuit model with n(log2 n +1) processors distributed over log2 n +1 columns and n rows. at each step we add the one that is at distance equal to twice the distance we use in the previous step.

Example 1.6: Parallel Algorithm Analysis The number of processors in the model is n(log2 n +1) The number of columns is log2 n +1 Each processor does at most one step (addition). The processors in any fixed column work in parallel. Time = log2 n +1 additions.

!!Expensive computations!! Summary At the cost of increasing the computation power (the number of processors & memory) we may be able to decrease the computation time drastically. !!Expensive computations!! Is it worth it?

Is it worth it? Parallel computation is done mainly to speed up computations, while requiring huge processing power. To produce Tera FLOPS, the number of CPUs in a massively parallel computer can reach 100s of 1000s. This is OK if parallel computing is cheap enough comparing to the critical time reduction of the problem we are solving. In the close future, TFLOPS will be available by a single Intel Chip!!

Is it worth it? This is indeed the case with many applications in medicine, business, and science that Process huge database, Deal with live streams from huge number of sources require huge number of iterations

Analysis of Parallel Algorithms Complexity of algorithms is measured by Time Parallel steps Number of CPUs

Elementary Steps Computational Steps: basic arithmetic or logical operations performed within a processor, e.g., comparisons, additions, swapping, ..etc Each takes a constant number of time units

Elementary Steps Routing Steps: steps used by the algorithm to move data from one processors to another via shared-memory or interconnections. It is different in shared-memory models than in the interconnection models.

Routing in Shared-memory In shared-memory models: the interchange is done by accessing the common memory It is assumed that it can be done in constant time in the uniform model (unrealistic but easy to assume) O(log M) in the non-uniform model with memory of size M.

Routing in Interconnections In interconnection models: the routing step is measured by the number of links the message has to follow in order to reach the destination processor. That is, it is the distance of the shortest path between the source and the destination processors.

Performance Measures Running time Speedup Work Cost Success ratio Cost optimality Efficiency Success ratio

1. Running Time It is measured by the number of elementary steps (computational and routings) the algorithm does from the time the first processor starts to work to the finishing time of last processor. Example: P1 performs 13 steps, then idle for 3 steps, then 4 steps more. P2 performs 11 steps continuously P3 performs 5 steps, then 11 steps more. Total = 20 steps

1. Running Time The running time depends on the size of input and the number of processors which may depend on input. Some time we write tp to denote the running time of a parallel algorithm that runs on a computer with p processors.

2. Speedup the best time known for seq. alg. The speedup is defined by S(1,p) = t1 /tp the best time known for seq. alg. the time for par. alg. with p CPUs =

Speedup Folklore Theorem Theorem: The speedup S(1,p) ≤ p Proof: The parallel algorithm can be simulated by a sequential one in tp  p time (in the trivial way) Since t1 is an optimal time then t1 ≤ tp  p That is t1 / tp ≤ p ..... Right ? Well not really!!! True only for algorithms that can be simulated by sequential computers in tp  p time

Speedup Folklore Theorem Theorem: The speedup S(1,p) ≤ p Conclusions: First, tp ≥ t1/p This means that the running time of any parallel algorithm tp cannot be better than t1/p Second, a good parallel algorithm is the one whose speedup is very close to the number of processors.

Example 1.14: Adding n numbers Adding n numbers can be done in sequential computer by using n-1 additions. In parallel computer, it can be done in O(log n) steps by using BTIN model. The tree has n/log n leaves. Each leaf processor adds log n numbers Each parent adds the number of its children and so on. The root will contain the result.

Analysis Time to add numbers in each processor = O(log n) steps Time to propagate = O(log n) steps Speedup = O(n/log n) log ( n/log n) n/log n 1 2

Example 1.15: Searching in Shared Memory Model Given a number x and an array A[1..n] containing n distinct numbers sorted increasingly, all stored in a memory. Write a parallel algorithm that searches for x in the list and returns its position.

Example 1.15: Searching in Shared Memory Model Sequential algorithm: binary search algorithm can solve the problem in O(log n) time. However, one can achieve O(1) parallel time! In shared memory model with n processors let each processor Pi compares x with the cell A[i] Specify a location in the memory call it answer; initially answer=0. Any processor that finds x will write its index in the location answer. I.e. if answer=i, then Pi finds x in A[i]

Example 1.15: Analysis Surely the overall time is O(1) steps. Speedup = O(log n) ≤ n = # of processors used Is it possible to use only O(log n) processors to achieve the same performance?

How about searching in the BTIN model? Use the BTIN model with n leaves (and 2n-1 total processors) The leaf processor Pi holds A[i] The root will take the input x and propagates it to the leaves and the answer will return back to the root. Any parallel algorithm has to take at least (log n) to just traverse the links between the processors. Speedup =O(1) in the best which way smaller than 2n-1 the number of processors.

Having said that ... It appears that the speedup Theorem is not always true specially if the parallel computer has many different stream inputs which can’t be simulated properly in a sequential computer. Counter Example: read 1.17.

Slowdown Folklore Theorem Theorem: if a certain problem can be solved with p processors in tp time and with q processors in tq time where q < p, then tp ≤ tq ≤ tp +p tp /q That is when the number of CPUs decreases from p to q then the running time can slowdown by a factor of (1+p/q) in the worst case. Or when the number of CPUs increases from q to p then the running time can be reduced by a factor of 1/(1+p/q) in the best case.

Idea of Proof Suppose that you have a parallel algorithm that runs on a computer with p processors. To run the same algorithm on a computer with q processors you need to distribute the tasks that the p processors do (at most p tp steps) into the q processors as evenly as possible. Thus, each processor will have to do at most p tp /q steps .... And so tq ≤ tp +p tp /q. For detail See Page 18

Slowdown Folklore Theorem (Brent’s Theorem) Notice that if q=1, then 1 ≤ t1 / tp = S(1,p) ≤ 1 + p The Slowdown theorem is not always true specially when the distribution of input data or the communications impose overhead on the running time. See Example 1.19

3. Number of Processors ... Why? Algorithms (with same running time and on same models) but with less number of processors are preferred (less expensive). Sometimes optimal times and speedups can be achieved with certain number of processors A minimum number of processors may be required to have successful computations Slowdown and speedup theorems show that number of processors is important. Certain computational models may not accommodate the required number of processors. (e.g., perfect squares or prime number) In combinatorial circuits each CPU is used at most once. That gives an upper bound on the time.

4. The Work The Work is defined to be the exact total number of elementary steps executed by all processors. Running time = maximum of elementary steps used by any processor Exercise: Think about the combinatorial circuit model.

5. The Cost The cost C(n) is an upper bound on the total number of elementary steps used by all processors, and defined as C(n) = t(n)  p(n), where t(n) = the running time p(n) = number of processors Note: not all processors are necessarily active during the t(n) time units. In combinatorial circuit model: C(n) = p(n) by definition.

Example steps Running time = 8 Cost = 8 x 6 = 48 Work = 6+4+4 +8+5+5 = 32 RT ≤ Work ≤ Cost 8 7 6 5 4 3 2 1 P1 P2 P3 P4 P5 P6

5.1 Cost Optimality Notice that the cost is indeed the worst case running time needed to simulate a parallel algorithm on a sequential computer (if it can be done). For the following: we restricted ourself to “simulate-able” parallel algorithms only. If (f(n)) number of steps are needed to solve a problem sequentially, and the cost of a parallel algorithm is O(f(n)), then we say that the algorithm is asymptotically cost optimal.

5.1 Cost Optimality Recall Example 1.14 of Adding n numbers via BTIN. We used p=O(n/log n) processors to achieve O(log n) running time. So the cost = O(n). But adding any n numbers need (n) sequential steps. Thus the cost is optimal. Notice: if we use BTIN with n leaves, the cost is O(n log n) which is not optimal.

This means that .. If (f(n)) is a lower bound on the required number of steps to solve a problem of size n, then (f(n)/p) is a lower bound on the running time of parallel algorithm with p processors. This follows from the speedup theorem which says that the reduction in the running time is by at most a factor of 1/p. Example: Any parallel algorithm that uses n processors needs (log n ) steps to sort n numbers, because sequential sorting needs (n log n) steps.

5.1 Cost Optimality 2. The cost of a parallel algorithm is not optimal if an equivalent sequential algorithm exists whose worst case running time is better than the cost. Recall Example 1.4 of grouping n pairs into m groups. We used n processors in a shared memory model to solve the problem in O(log n) time. The cost is O(n log n) steps which is not optimal because the sequential algorithm uses O(n) steps.

5.1 Cost Optimality 3. Unknown cost optimality is possible when we have parallel algorithm whose cost matches the best known sequential running time but we don’t know if the sequential running time is optimal. Example: Matrix multiplication requires (n^2) steps. The best known sequential algorithm takes (n^c) where 2 < c < 2.38. We don’t know if it is optimal though.

5.2 Efficiency The efficiency of a parallel algorithm is defined by E(1,p) = t1 / (p tp) , where t1 is the running time of the best known sequential algorithm, tp is the running time of the parallel algorithm that runs on a computer with p processors.

5.2 Efficiency If the parallel algorithm is within our restriction, then E(1,p) ≤ 1. If E(1,p) < 1, the parallel algorithm is not cost optimal .... NOT GOOD If E(1,p) = 1 and t1 is optimal, then the parallel algorithm is cost optimal .... GOOD If E(1,p) > 1, then the simulated sequential algorithm (if doable!) is faster than the parallel algorithm. ... IDEAL Read Example 1.26

Summary It is unfair to compare the running time of a parallel algorithm tp to the running time of the best known sequential algorithm t1. We should compare the cost=p tp to t1.

6. Success Ratio Consider algorithms that solve problems correctly with certain probabilities. Let Pr(p) = the probability of success that a parallel algorithm solves correctly a given problem. Let Pr(1) = the probability of success that a sequential algorithm solves correctly the same given problem. The success ratio is defined by Sr(1,p) = Pr(p) /Pr(1)

ssr(1,p) = Pr(p) /(p  Pr(1) ) 6. Success Ratio The success ratio is defined by sr(1,p) = Pr(p) /Pr(1) The scaled success ratio is ssr(1,p) = Pr(p) /(p  Pr(1) ) Usually: sr(1,p) ≤ p and ssr(1,p) ≤ 1