COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

Slides:



Advertisements
Similar presentations
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.
Parallel Algorithm Design
Lecture 7: Task Partitioning and Mapping to Processes Shantanu Dutt ECE Dept., UIC.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Strategies for Implementing Dynamic Load Sharing.
Principles of Parallel Algorithm Design
Multi-Tasking Models and Algorithms
Mapping Techniques for Load Balancing
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü.
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Static Process Scheduling
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Principles of Parallel Algorithm Design (reading Chp.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
CS 420 Design of Algorithms Parallel Algorithm Design.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Pthread RW Locks – An Application Dr. Xiao Qin Auburn University
Auburn University
Auburn University
Auburn University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Odd-Even Sort Implementation Dr. Xiao Qin.
Parallel Tasks Decomposition
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
Parallel Programming in C with MPI and OpenMP
Principles of Parallel Algorithm Design
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Principles of Parallel Algorithm Design
CS 584.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Parallel Algorithm Models
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University Slides are adopted from Drs. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar 2

Review: Characteristics of Task Interactions read-only vs. read-write Interactions may be read-only or read-write. In read-only interactions, tasks just read data items associated with other tasks. In read-write interactions tasks read, as well as modify data items associated with other tasks. In general, read-write interactions are harder to code, since they require additional synchronization primitives. 3

Review: Characteristics of Task Interactions one-way vs. two-way Interactions may be one-way or two-way. A one-way interaction can be initiated and accomplished by one of the two interacting tasks. A two-way interaction requires participation from both tasks involved in an interaction. One way interactions are somewhat harder to code in message passing APIs. 4

Mapping Techniques Once a problem has been decomposed into concurrent tasks, these must be mapped to processes (that can be executed on a parallel platform). Mappings must minimize overheads. – Primary overheads are communication and idling. – Minimizing these overheads often represents contradicting objectives. Assigning all work to one processor trivially minimizes communication at the expense of significant idling. 5

Block-Cyclic Distribution: Examples One- and two-dimensional block-cyclic distributions among 4 processes. 6 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Block-Cyclic Distribution A cyclic distribution is a special case in which block size is one. A block distribution is a special case in which block size is n/p, where n is the dimension of the matrix and p is the number of processes. 7

Graph Partitioning Dased Data Decomposition In case of sparse matrices, block decompositions are more complex. Consider the problem of multiplying a sparse matrix with a vector. 8

Graph Partitioning Dased Data Decomposition The graph of the matrix is a useful indicator of the work (number of nodes) and communication (the degree of each node). Partition the graph so as to assign equal number of nodes to each process, while minimizing edge count of the graph partition. 9

Partitioning the Graph of Lake Superior Random Partitioning Partitioning for minimum edge-cut. 10

Mappings Based on Task Paritioning Partitioning a given task-dependency graph across processes. Determining an optimal mapping for a general task- dependency graph is an NP-complete problem. Excellent heuristics exist for structured graphs. 11

Task Partitioning: Mapping a Binary Tree Dependency Graph Example illustrates the dependency graph of one view of quick-sort and how it can be assigned to processes in a hypercube. 12

Task Partitioning: Mapping a Sparse Graph Sparse graph for computing a sparse matrix-vector product and its mapping. 13

Hierarchical Mappings Sometimes a single mapping technique is inadequate. For example, the task mapping of the binary tree (quicksort) cannot use a large number of processors. For this reason, task mapping can be used at the top level and data partitioning within each level. 14

An example of task partitioning at top level with data partitioning at the lower level. 15

Schemes for Dynamic Mapping Dynamic mapping is sometimes also referred to as dynamic load balancing, since load balancing is the primary motivation for dynamic mapping. Dynamic mapping schemes can be centralized or distributed. 16

Centralized Dynamic Mapping Processes are designated as masters or slaves. When a process runs out of work, it requests the master for more work. When the number of processes increases, the master may become the bottleneck. To alleviate this, a process may pick up a number of tasks (a chunk) at one time. This is called Chunk scheduling. Selecting large chunk sizes may lead to significant load imbalances as well. A number of schemes have been used to gradually decrease chunk size as the computation progresses. 17

Distributed Dynamic Mapping Each process can send or receive work from other processes. This alleviates the bottleneck in centralized schemes. There are four critical questions: how are sensing and receiving processes paired together, who initiates work transfer, how much work is transferred, and when is a transfer triggered? Answers to these questions are generally application specific. We will look at some of these techniques later in this class. 18

Minimizing Interaction Overheads Maximize data locality: Where possible, reuse intermediate data. Restructure computation so that data can be reused in smaller time windows. Minimize volume of data exchange: There is a cost associated with each word that is communicated. For this reason, we must minimize the volume of data communicated. Minimize frequency of interactions: There is a startup cost associated with each interaction. Therefore, try to merge multiple interactions to one, where possible. Minimize contention and hot-spots: Use decentralized techniques, replicate data where necessary. 19

Minimizing Interaction Overheads (continued) Overlapping computations with interactions: Use non-blocking communications, multithreading, and prefetching to hide latencies. Replicating data or computations. Using group communications instead of point-to- point primitives. Overlap interactions with other interactions. 20

Parallel Algorithm Models An algorithm model is a way of structuring a parallel algorithm by selecting a decomposition and mapping technique and applying the appropriate strategy to minimize interactions. Data Parallel Model: Tasks are statically (or semi-statically) mapped to processes and each task performs similar operations on different data. Task Graph Model: Starting from a task dependency graph, the interrelationships among the tasks are utilized to promote locality or to reduce interaction costs. 21

Parallel Algorithm Models (continued) Master-Slave Model: One or more processes generate work and allocate it to worker processes. This allocation may be static or dynamic. Pipeline / Producer-Comsumer Model: A stream of data is passed through a succession of processes, each of which perform some task on it. Hybrid Models: A hybrid model may be composed either of multiple models applied hierarchically or multiple models applied sequentially to different phases of a parallel algorithm. 22