Lecture 2: Parallel Reduction Algorithms & Their Analysis on Different Interconnection Topologies Shantanu Dutt Univ. of Illinois at Chicago.

Slides:

Advertisements

Similar presentations

Basic Communication Operations

Advertisements

Lecture 9: Group Communication Operations

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Probability of Compound Events Shantanu Dutt ECE Dept. Uinv. of Illinois at Chicago.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

1 Parallel Parentheses Matching Plus Some Applications.

1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.

Lecture # 3: A Primer on How to Design Parallel Algorithms Shantanu Dutt University of Illinois at Chicago.

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

CPSC 668Set 3: Leader Election in Rings1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

1 Lecture 11 Sorting Parallel Computing Fall 2008.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

Communication operations Efficient Parallel Algorithms COMP308.

Topic Overview One-to-All Broadcast and All-to-One Reduction

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.

A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.

1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.

Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.

1 Next Few Classes Networking basics Protection & Security.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.

1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Parallel graph algorithms Antonio-Gabriel Sturzu, SCPD Adela Diana Almasi, SCPD Adela Diana Almasi, SCPD Iulia Alexandra Floroiu, ISI Iulia Alexandra Floroiu,

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE 668 Set 3: Leader Election in Rings 1.

HYPERCUBE ALGORITHMS-1

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

CSCI-455/552 Introduction to High Performance Computing Lecture 21.

Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.

1 Advanced course on: Parallel and Distributed Model Checking Lecture 1 – Lecturers: Orna Grumberg, Computer Science Dept, Technion Karen Yorav,

Interconnection topologies

Introduction to parallel algorithms

Parallel Sorting Algorithms

Lecture 22: Parallel Algorithms

Communication operations

Introduction to parallel algorithms

Networked Real-Time Systems: Routing and Scheduling

Parallel Sorting Algorithms

ECE 753: FAULT-TOLERANT COMPUTING

Introduction to High Performance Computing Lecture 17

Parallel Sorting Algorithms

Introduction to parallel algorithms

Presentation transcript:

Lecture 2: Parallel Reduction Algorithms & Their Analysis on Different Interconnection Topologies Shantanu Dutt Univ. of Illinois at Chicago

An example of an SPMD message-passing parallel program 2

SPMD message-passing parallel program (contd.) 3 node xor D, 1

Reduction Computations & Their Parallelization  The prior max computation is a reduction computation, defined as x = f(D), where D is a data set (e.g., a vector), and x is a scalar quantity. In the max computation, f is the max function, and D the set/vector of numbers for which the max is to be computed.  Reduction computations that are associative [defined as f(a,b,c) = f(f(a,b), c) = f(a, f(b,c))], can be easily parallelized using a recursive reduction approach (the final value of f(D) needs to be at some processor at the end of the parallel computation):  The data set D is evenly distributed among P processors—each processor Pi has a disjoint subset Di of D/P data elements.  Each processor Pi performs the computation f(Di) on its data set Di. Due to associativity, the final result = f(f(D 1 ), f(D 2 ), …., f(D P )). This is achieved via communication betw. the processors and f computation of successively larger f(Di) sets.  A natural communication pattern is that of a binary tree. But a more general pattern is described below.  Each processor then engages in (log P) rounds of message passing with some other processors. In the k’th round Pi communicates with a unique partner processor Pj = partner(Pi, k) in which it sends or receives (depending, say, on whether its id is > or < than Pj, resp.) the current f computation result it or Pj contains, resp.  If Pi receives a computation result b from Pj in the k’th round, it computes a = f(a,b), where a is its current result, and participates in the (k+1)’th round of commun. If Pi has sent its data to Pj, then it does not participate in any further rounds of communication and computation; it is done with its task.  At the end of the (log P) rounds of communication, the processor with the least ID (= 0) will hold f(D).

Reduction Computations & Their Parallelization (contd.)  Assuming (Pi, Pj), where Pj = partner(Pi, k), is a unique send-recv pair (the pairs may or may not be disjoint from each other) in round k, the # of processors holding the required partial computation results halve after each round, and hence the term recursive halving for such parallel computations.  In general, there are other variations of parallel reduction computations (generally dictated by the interconnection topology) in which the # of processors will reduce by some factor other than 2 in each round of communication. The general term for such parallel computations is recursive reduction.  A topology independent recursive-halving communication pattern is shown below. Note also that as the # of processors involved halve, the # of initial data sets that each “active” partial result represents/covers double (a recursive doubling of coverage of data sets by each active partial result). Total # of msgs sent is P-1 (Why?).  Basic metrics of performance we will use initially: parallel time, speedup (seq. time/parallel time), efficiency (speedup/P = # of procs.), total number or size of msgs Time step 1 Time step 2 Time step 3

Reduction Computations & Their Parallelization (contd.)  A variation of a reduction computation is one in which each processor needs to have the f(D) value.  In this case, instead of a send from Pj to its partner in the k’th round Pi (assuming, id(Pi) < id(Pj)), Pi and Pj would exchange their respective data elements a, b, each compute f(a,b) and each engage in the (k+1)’th round of communication with different partners.  Thus there is no recursive-halving of the # of processors involved in each subsequent round (the # of participating processors always remains P). However, the # of initial data sets that each active partial result covers recursively doubles in each round as in the recursive-halving computation.  The “exchange” communication pattern for a reduction computation is shown below (total # of msgs sent is P(log P)) Time step

Analysis of Parallel Reduction on Different Topologies  Recursive halving based reduction on a hypercube:  Initial computation time = Theta(N/P); N= # of data items, P = # processors.  Communication time = Theta(log P), as there are (log P) msg passing rounds, in each round all msgs are sent in parallel, each msg is a 1-hop msg., and there is no conflict among msgs  Computation time during commun. rounds = Theta(log P) [1 red. oper. in each processor in each round). Note that computation and commun. are sequentialized (e.g., no overlap betw. them), and so need to take the sum of both times to obtain total time.  Same comput. and commun. time for exchange commun. on a hypercube  Speedup = S(P) = Seq._time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta(2*logP))] ~ Theta(P) if N >> P (a) Hypercubes of dimensions 0 to 4 (b) Msg pattern for a reduction comput. using recursive halving; processor 000 will hold the final result (c) Msg pattern for a reduction comput. using exchange communication; all processors will hold the final result Time steps

Analysis of Parallel Reduction on Different Topologies (contd).  Recursive reduction on a direct tree:  Initial Computation time = Theta(N/P); N= # of data items, P = # processors.  Communication time = Theta((log (P/2)), as there are (log ((P+1)/2)) msg passing rounds, in each round all msgs are sent in parallel, each msg is a 1- hop msg., and there is no conflict among msgs;  Computation time during commun. rounds = Theta(2*(log (P/2)) [2 red. opers. in the “parent” processor in each round) = Theta(2*(log P)). Again comput. and commun.not overlapped.  Speedup = S(P) = Seq_time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta(3*logP) )]~ Theta(P) if N >> P Recursive reduction in (a) a direct tree network; and (b) an indirect tree network , 2 2, 4 Time steps Round #, Hops

Analysis of Parallel Reduction on Different Topologies (contd).  Recursive reduction on an indirect tree:  Initial Computation time = Theta(N/P); N= # of data items, P = # processors.  Communication time = …. + 2*(log P) = [(2*((log P)(log P +1)/2)) = (log P)((log P) +1), as there are (log P) msg passing rounds, in round k all msgs are sent in parallel, each msg is a (2*k)-hop msg., and there is no conflict among msgs;  Computation time during commun. rounds = Theta(log P) [1 red. opers. in each receiving processor in each round) = Theta(log P)  Speedup = S(P) = Seq_time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta((log P)^2) + Theta(logP)] ~ Theta(P)) if N >> P Recursive reduction in (a) a direct tree network; and (b) an indirect tree network , 2 2, 4 Time steps Round #, Hops