CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Social network partition Presenter: Xiaofei Cao Partick Berg.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Utrecht, february 22, 2002 Applications of Tree Decompositions Stan van Hoesel KE-FdEWB Universiteit Maastricht
1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Principles of Parallel Algorithm Design
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
Virtues of Good (Parallel) Software
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Parallel Processing (CS526) Spring 2012(Week 4).  Parallelism from two perspectives: ◦ Platform  Parallel Hardware Architecture  Parallel Communication.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
L21: “Irregular” Graph Algorithms November 11, 2010.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Chapter One Introduction to Pipelined Processors.
Independent Component Analysis (ICA) A parallel approach.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Static Process Scheduling
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
Auburn University
PERFORMANCE EVALUATIONS
Auburn University
Parallel Tasks Decomposition
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Partial Products Algorithm for Multiplication
EE 193: Parallel Computing
Principles of Parallel Algorithm Design
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Principles of Parallel Algorithm Design
Parallelismo.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Memory System Performance Chapter 3
Principles of Parallel Algorithm Design
Presentation transcript:

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley,

Preliminaries: Decomposition, Tasks, and Dependency Graphs Parallel algorithm should decompose the problem into tasks that can be executed concurrently A decomposition can be illustrated in the form of a directed graph (task dependency graph)

Dependency Graphs nodes = tasks edges = dependencies The result of one task is required for processing the next Task runtime

Degree of Concurrency Determines the maximum amount of tasks which can indeed run in parallel An upper bound on the parallel algorithm speedup Degree of concurrency = 3

Critical Path The longest path in the dependency graph A lower bound on program runtime Critical path = = 13

Average Degree of Concurrency The ratio between the critical path to the sum of all the tasks The speed up of the parallel algorithm Critical path = = 13

Example: Multiplying a Dense Matrix with a Vector Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n independent tasks 7

Example: Multiplying a Dense Matrix with a Vector While tasks share data (namely, the vector b ), they do not have any control dependencies no task needs to wait for the (partial) completion of any other All tasks are of the same size in terms of number of operations 8 Is this the maximum number of tasks we could decompose this problem into?

Multiplying a dense matrix with a vector – 2n CPUs available Task 1 Task 2 On what kind of platform will we have 2N processors? 9

Multiplying a dense matrix with a vector – 2n CPUs available Task 1 Task 2 Task 2n+1 Task 1 Task 2 Task 2n+1 Task 2n-1 Task 2n Task 3n 10

Granularity of Task Decompositions The size of tasks into which a problem is decomposed is called granularity 11

Parallelization Limitations 12

Guidelines for good parallel algorithm design Maximize concurrency. Spread tasks evenly between processors to avoid idling and achieve load balancing. Execute tasks on critical path as soon as the dependencies are satisfied. Minimize (overlap) communication between processors. 13

Evaluating parallel algorithm 14 To example

Evaluating parallel algorithm 15

Evaluating parallel algorithm 16

Evaluating parallel algorithm 17

Simple example p1 n1

19 p1 p/21 This is the result

Summing n numbers on p CPUs 20

Summing n numbers on p CPUs 21 To definitions

Parallel multiplication of 3 matrices A, B, C – rectangular matrices N x N We want to compute in parallel: A x B x C How do we achieve the maximum performance? 22 XX ABC

Parallel multiplication of 3 matrices 23 X= ABT X= TCR Every cell in T can be calculated independently from the others Are the blue and yellow depended tasks?

Parallel multiplication of 3 matrices X= TCR x y u v R x,y T 1,y T n,y R u,y R x,v T 1,v T n,v R x,v

Dynamic Task Creation: Simple Master-Worker Identify independent computations Create a queue of tasks Each process picks from the queue and may insert a new task to the queue 25 R x,y T 1,y T n,y R u,y R x,v T 1,v T n,v R x,v

Task Granularity How do we select the size of each task? 26 R x,y T 1,y T n,y R u,y R x,v T 1,v T n,v R x,v

Hypothetic Implementation: Multiple Threads Let's assign separate thread for each task we defined before. How do we coordinate the threads? Initialize threads to know their dependencies. Build “producer-consumer” like logic for every thread. 27 R x,y T 1,y T n,y R u,y R x,v T 1,v T n,v R x,v