Auburn University

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.
Lecture 7: Task Partitioning and Mapping to Processes Shantanu Dutt ECE Dept., UIC.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Strategies for Implementing Dynamic Load Sharing.
Principles of Parallel Algorithm Design
Mapping Techniques for Load Balancing
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü.
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Principles of Parallel Algorithm Design (reading Chp.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
CS 420 Design of Algorithms Parallel Algorithm Design.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Auburn University
Auburn University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Odd-Even Sort Implementation Dr. Xiao Qin.
Parallel Graph Algorithms
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Parallel Tasks Decomposition
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
Parallel Programming By J. H. Wang May 2, 2017.
Resource Elasticity for Large-Scale Machine Learning
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
Parallel Programming in C with MPI and OpenMP
Principles of Parallel Algorithm Design
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Parallel Matrix Operations
Principles of Parallel Algorithm Design
Sungho Kang Yonsei University
Parallelismo.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Chapter 01: Introduction
To accompany the text “Introduction to Parallel Computing”,
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Task Interactions Mapping Techniques Dr. Xiao Qin Auburn University http://www.eng.auburn.edu/~xqin xqin@auburn.edu Slides 1-22 TBC=24 Slides are adopted from Drs. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Review: Characteristics of Tasks Once a problem has been decomposed into independent tasks, the characteristics of these tasks critically impact choice and performance of parallel algorithms. Relevant task characteristics include: Task generation. Task sizes. Size of data associated with tasks.

Characteristics of Task Interactions Regular or Irregular interactions? A simple example of a regular static interaction pattern is in image dithering. The underlying communication pattern is a structured (2-D mesh) one as shown here: regular

Characteristics of Task Interactions Q1: Regular or Irregular interactions? The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern. irregular Irregular

Characteristics of Task Interactions read-only vs. read-write Interactions may be read-only or read-write. Read-only interactions: tasks just read data items associated with other tasks. Read-write interactions: tasks read, as well as modify data items associated with other tasks. Read-write interactions are harder to code. Q2:Why? since they require additional synchronization primitives.

Database Query Processing read-only or read-write? Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next.

Characteristics of Task Interactions one-way vs. two-way One-way interaction: is initiated and accomplished by one of the two interacting tasks. Two-way interaction: requires participation from both tasks involved in an interaction. One way interactions are somewhat harder to code in message passing APIs.

Characteristics of Task Interactions Q3: One-way or Two-way? The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern. Two-way. Two-way Task Interaction

Mapping Techniques Once a problem has been decomposed into concurrent tasks, these must be mapped to processes (that can be executed on a parallel platform). Mappings must minimize overheads. Primary overheads are communication and idling. Minimizing these overheads often represents contradicting objectives. Assigning all work to one processor trivially minimizes communication at the expense of significant idling.

Mapping Techniques for Minimum Idling Mapping must simultaneously (1) minimize idling and (2) load balance. Q4: (1) Which case has balanced load? (2) Which case has minimum idling? Merely balancing load does not minimize idling.

Mapping Techniques for Minimum Idling Static vs. Dynamic. Static Mapping Tasks are mapped to processes a-priori. Must have a good estimate of the size of each task. Even in these cases, the problem may be NP complete. Dynamic Mapping Tasks are mapped to processes at runtime. Q5: Why tasks must be mapped at runtime? This may be because the tasks are generated at runtime Or their sizes are not known. Other factors that determine the choice of techniques size of data associated with a task the nature of underlying domain.

Schemes for Static Mapping Mappings based on data partitioning. Mappings based on task graph partitioning. Hybrid mappings.

Mappings Based on Data Partitioning We can combine data partitioning with the ``owner-computes'' rule to partition the computation into subtasks. The simplest data decomposition schemes for dense matrices are 1-D block distribution schemes.

Block Array Distribution Schemes Block distribution schemes can be generalized to higher dimensions as well.

Block Array Distribution Schemes: Examples Multiplying two dense matrices A and B: partition the output matrix C using a block decomposition. Load balancing: give each task the same number of elements of C. (Note: each element of C corresponds to a single dot product.) The choice of precise decomposition (1-D or 2-D) is determined by the associated communication overhead. Higher dimension decomposition allows the use of larger number of processes. Multiplying two dense matrices A and B, we can partition the output matrix C using a block decomposition. For load balance, we give each task the same number of elements of C. (Note that each element of C corresponds to a single dot product.) The choice of precise decomposition (1-D or 2-D) is determined by the associated communication overhead. In general, higher dimension decomposition allows the use of larger number of processes.

Data Sharing in Dense Matrix Multiplication

Cyclic and Block Cyclic Distributions If the amount of computation associated with data items varies, a block decomposition may lead to significant load imbalances. A simple example of this is in LU decomposition (or Gaussian Elimination) of dense matrices.

LU Factorization of a Dense Matrix A decomposition of LU factorization into 14 tasks - notice the significant load imbalance. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Block Cyclic Distributions Variation of the block distribution scheme that can be used to alleviate the load-imbalance and idling problems. Partition an array into many more blocks than the number of available processes. Blocks are assigned to processes in a round-robin manner so that each process gets several non-adjacent blocks.

Block-Cyclic Distribution for Gaussian Elimination The active part of the matrix in Gaussian Elimination changes. By assigning blocks in a block-cyclic fashion, each processor receives blocks from different parts of the matrix.

Block-Cyclic Distribution: Examples One- and two-dimensional block-cyclic distributions among 4 processes. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Block-Cyclic Distribution A cyclic distribution is a special case in which block size is one. A block distribution is a special case in which block size is n/p , where n is the dimension of the matrix and p is the number of processes. p.123

Graph Partitioning Based Data Decomposition In case of sparse matrices, block decompositions are more complex. Consider the problem of multiplying a sparse matrix with a vector.

Graph Partitioning Dased Data Decomposition The graph of the matrix is a useful indicator of the work (number of nodes) and communication (the degree of each node). Partition the graph so as to assign equal number of nodes to each process, while minimizing edge count of the graph partition.

Partitioning the Graph of Lake Superior Random Partitioning Partitioning for minimum edge-cut.

Mappings Based on Task Paritioning Partitioning a given task-dependency graph across processes. Determining an optimal mapping for a general task-dependency graph is an NP-complete problem. Excellent heuristics exist for structured graphs.