Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü.

Slides:



Advertisements
Similar presentations
Graph Algorithms Carl Tropper Department of Computer Science McGill University.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Lecture 7: Task Partitioning and Mapping to Processes Shantanu Dutt ECE Dept., UIC.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Strategies for Implementing Dynamic Load Sharing.
Principles of Parallel Algorithm Design
Multi-Tasking Models and Algorithms
Virtues of Good (Parallel) Software
Mapping Techniques for Load Balancing
Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Principles of Parallel Algorithm Design (reading Chp.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
CS 420 Design of Algorithms Parallel Algorithm Design.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Auburn University
Auburn University
Auburn University
Overview Parallel Processing Pipelining
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Parallel Programming in C with MPI and OpenMP
Principles of Parallel Algorithm Design
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Principles of Parallel Algorithm Design
CS 584.
Principles of Parallel Algorithm Design
CS 584 Lecture7 Assignment -- Due Now! Paper Review is due next week.
Principles of Parallel Algorithm Design
Parallel Algorithm Models
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Mattan Erez The University of Texas at Austin
Presentation transcript:

Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü

Identifying concurrent tasks Mapping tasks onto multiple processes Distributing input, output, intermediate data Managing access to shared data Synchronizing processors Principles of Parallel Algorithm Design

several choices for each step relatively a few combinations lead to a good parallel algorithm different choices yield best performance on –different parallel architectures –different parallel programming paradigms Principles of Parallel Algorithm Design Identifying concurrent tasks Mapping tasks onto multiple processes Distributing input, output, intermediate data Managing access to shared data Synchronizing processors

Decomposition, Tasks decomosition: –dividing a computation into smaller parts –some or all parts can be executed concurrently atomic task –user defined –indivisible units of computation –same size or different size

Task Dependence Graphs (TDG) directed acyclic graph nodes : atomic tasks directed edges : dependencies –some tasks use data produced by other tasks TDG can be weighted: –node wgt: amount of computation –edge wgt: amount of data multiple ways of expressing certain computations –different ways of arranging computations –lead to different TDGs

Granularity, Concurrency granularity: number (#) and size of tasks –fine grain : large # of small tasks –coarse grain : small # of large tas degree of concurrency (DoC): –# of tasks that can be executed simultaneously max DoC : max DoC at any given time –tree TDGs: max DoC = # of leaves (usually) avg DoC : DoC over entire duration

Degree of Concurrency depends on granularity –finer task granularity : larger DoC –bound on fine granularity of a decomposition depends on shape of TDG –shallow and wide TDG : larger DoC –deep and thin TDG : smaller DoC –critical path: longest directed path between a start node and a finish node –critical path length = sum of wgts along the path –avg DoC = total work / critical path length

Task Interaction Graph (TIG) tasks share input, output or intermediate data interactions among independent tasks of aTDG TIG: pattern of interactions among tasks –node: task –edge: connects tasks that interact with each other TIG can be weighted: –node wgt: amount of computation –edge wgt: amount of interaction

Processes and Mapping process vs processor: –logical computing agents that perform tasks mapping: assigning tasks to processes conflicting goals in a good mapping –maximize concurrency map independent tasks to different processes –minimize idle time / interaction overhead map tasks along critical path to same process map tasks with high interaction to same processes e.g., map all tasks to the same process

Decomposition Techiques recursive decomposition data decomposition explaloratory decomposition speculative decomposition

Recursive Decomposition divide-and-conquer strategy → natural concurrency divide problem into a set of independent subproblems conquer: recursively solve each subproblem combine: solns to subproblems to a soln of problem if sequential algorithm is not based on DAC –restructure computation as a DAC algorithm –recursive decomposition to extract concurrency –e.g., finding minimum of an array A of n numbers

Data Decomposition partition/decompose computational data domain use this partition to induce task decomposition –tasks: similar operations on different data parts partitioning output data –each output can be computed independently as a fn of input –example: block matrix multiplication –data decomposition may not lead to unique task decompsition –another example: computing itemset frequencies input: transactions & output: itemset frequencies

Data Decomposition partitioning input data –may not be possible desirable to partition output data e.g., finding min, sum of a set of numbers, sorting –a task created for each part of the input data –task: all computations that can be done using local data –a combine step may be needed to combine results of tasks –example: finding the sum of an array A of n numbers –example: computing itemset frequencies partitioning both output and input data –output data partitioning is feasible –partitioning of input data offers additional concurrency –example: computing itemset frequencies

Data Decomposition partitioning intermediate data –multistage computations partioning input or output data of an intermediate stage –may lead to higher concurrency –some restructuring of the algorithm may be needed –example: block matrix multiplication owner computes rule –each part performs all computations involving data it owns –input: perform all computations that can be done using local data –ouput: compute all data in the partition

Other Decomposition Techniques exploratory decomposition –search of a configuration space for a solution –partition the search space into smaller parts –search each part concurrently –total parallel work total serial work –example: 15-puzzle problem speculative decomposition hybrid decompositions –computation structured into multiple stages –may apply different decompositions in different stages –examples: finding min of an array and quicksort –data decomposition then recursive decomposition

Characteristics of Tasks task generation: static vs dynamic task generation –static: all tasks are known priori to execution of algorithm data decomposition: matrix multiplication recursive decomposition: finding min of an array –dynamic: actual tasks and TPG/TIG not available a priori rules, guideliness governing task generation may be known recursive decomposition: quicksort another example: ray tracing task sizes: uniform vs non-uniform –complexity of mapping depends on this –tasks in matrix multiplication: uniform –tasks in quicksort: non-uniform

Characteristics of Tasks knowledge of task sizes –can be used in mapping –known: tasks in decompositions for matrix multiplication –unknown: tasks in 15-puzzle problem do not know a priori how many moves will lead to a soln. size of data associated with tasks –associated data must be available to the process –size and location of the associated data –consider data migration overhead in the mapping

Characteristics of Inter-Task Interactions static vs dynamic –static: pattern and timing of interactions known a priori –static interaction: decompositions for matrix multiplication –message-passing paradigm (MPP): active involvement of both interacting tasks static interactions easy to program dynamic interactions harder to program tasks assigned additional synchronization and polling responsibilities –shared-address-space (SASP): can handle both equally easily regular vs irregular (spatial structure) –regular: structure that can be exploited for efficient implement. structured/curvilinear grids (implicit connectivity) image dithering (example) –irregular: no such regular pattern exists unstructured grids (connectivity maintained explicitly) SpMxV (sparse matrix vector multiplication) –irregular and dynamic interactions harder to handle in MPP

Characteristics of Inter-Task Interactions read-only vs read-write –read-only: tasks require read-only access to shared data –example: decompositions for matrix multiplication –read-write: tasks need to read and write on shared data –example: heuristic search for 15-puzzle problem one-way vs two-way –2-way: data/work needed by a task explicitliy supplied by another –usually involve predefined producer and consumer –1-way: only one of a pair of comm. tasks initiates & completes interaction –read-only → 1-way & read-write → either 1-way or 2-way –SASP can handle both interactions equally easily –MPP cannot handle 1-way interaction directly source of data should explicitly send it to the recipient static 1-way: easily converted to 2-way via program restructuring dynamic 1-way: nontrivial program structuring for converting to 2-way –polling: task checks for pending requests from others at regular intervals

Mapping Techniques minimize overheads of parallel task execution –overhead: inter-process interaction –overhead: process idle time (uneven load distribution) load balancing –balanced aggregate load: necessary but not sufficient –computations & interactions well balanced at each stage –example: 12-task decomposition (9-12 depends on 1-8)

Static vs Dynamic Mapping static: distribute tasks prior to execution –static task generation: either static or dynamic mapping –good mapping: knowledge of task sizes, data sizes, TIG –non-trivial problem (usually NP-hard) –task sizes known but non-uniform even if no TDG/TIG → number partitioning problem dynamic: distribute workload during execution –dynamic task generation: dynamic mapping –task sizes unkown: dynamic mapping more effective –large data size: dynamic mapping costly (in MPP)

Static-Mapping Schemes mapping based on data partitioning –data partitioning induces a decomposition –partitioning selected with final mapping in mind i.e., p-way data decomposition –dense arrays –sparse data structures, graphs (FE meshes) mapping based on task partitioning –task dependence graphs, task interaction graphs hierarchical partitioning –hybrid decomposition and mapping techniques

Array Distribution Schemes block distributions: spatial locality of interaction –each process receives a contigous block of entries –1D: each part contains a block of consecutive rows i.e., kth part contains rows kn/p... (k+1)n/p-1 –2D: checkerboard partitioning –higher dimensional distributions higher degree of concurrency less inter-process interaction example: matrix multiplication

Array Distribution Schemes cyclic distribution –amount of work differs for different matrix entries examples: ray casting, dense LU factorization block distribution leads to load imbalance –all processes have tasks from all parts of the matrix –good load balance, but complete loss of locality block-cyclic distribution –partition array into more than p blocks –map blocks to processes in a round-robin (scattered) manner randomized block distribution –when the distribution of work has some special pattern adaptive 2D array partitionings –rectilinear, jagged, orthogonal bisection

Dynamic Mapping Schemes centralized schemes –all tasks maintained in a common pool or by a process –idle processes take task(s) from central pool or master process –easier to implement –limited scalability: central pool/process becomes a bottleneck –chunk scheduling: idle processes get group of tasks danger of load imbalance due to large chunk sizes decrease chunk size as program progresses –e.g., sorting entries in each row of a matrix non-uniform tasks & unknown task sizes –e.g., image-space parallel ray casting

Dynamic Mapping Schemes distributed schemes –tasks are distributed among processes –more scalable (no bottleneck) –critical parameters of distributed load balancing how sending and receiving processes ard paired? who initiates the work transfer: sender or receiver? how much work transferred in each exchange? when is he work transfer performed? suitability to parallel architectures –both can be implemented in both SAS and MP paradigms –dynamic schemes require movement of tasks –computational granularity of tasks should be high in MP systems

Methods for Interaction Overheads factors: –volume and frequency of interaction –spatial and temporal pattern of interactions maximizing data locality –minimize volume of data exchange minimize overall volume of shared data similar to maximizing temporal data locality –minimize frequency of interaction high startup cost associated with each interaction restructure algorithm: shared data accessed in large pieces similar to increasing spatial locality of data access minimizing contention and hot spots –multiple tasks try to access same resource concurrently multiple simultaneous access to same memory block/bank multiple processes sending messages to same process at the same time

Methods for Interaction Overheads minimizing contention and hot spots –multiple tasks try to access same resource concurrently multiple simultaneous access to same memory block/bank multiple processes sending messages to same process simult. –e.g., matrix multiplication based on 2D partitioning overlapping computations with interactions –early initiation of an interaction –support from programming paradigm, OS, hardware –MP: non-blocking message-passing primitives

Methods for Interaction Overheads replicating data or computation –replicating frequently accessed read-only shared data –MP paradigm benefits more from data replication –replicated computation for shared intermediate results using optimized collective interaction operations –usually use available implementations (e.g., by MPI) –sometimes, it may be better to write your own procedure overlapping interactions with other interactions –example: one-to-all broadcast

Parallel Algorithm Models data-parallel model –data parallelism: identicial operations applied concurrently on different data items task graph model –task parallelism: independent tasks in a TDG –quicksort, sparse matrix factorization work-pool or task-pool model –dynamic mapping of tasks onto processes –mapping may be centralized or distributed

Parallel Algorithm Models master-slave or manager-worker model –master process generates work & allocates to worker processes pipeline or producer-consumer model –stream parallelism: execution of diff. programs on a data stream –each process in the pipeline: consumer of the sequence of data items for the preceeding process producer of data for the process following in the pipeline –pipeline may not be a linear chain (it can be a DAG) hybrid models –multiple models applied hierarchically –multiple models applied sequentially to different stages