Load Balancing Parallel Applications on Heterogeneous Platforms.

Slides:



Advertisements
Similar presentations
Maximum flow Main goals of the lecture:
Advertisements

Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Class-constrained Resource Allocation Problems Tami Tamir Thesis advisor: Hadas Shachnai.
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Traffic assignment.
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
1 P, NP, and NP-Complete Dr. Ying Lu RAIK 283 Data Structures & Algorithms.
12 System of Linear Equations Case Study
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Minimum Vertex Cover in Rectangle Graphs
NON - zero sum games.
Systolic Arrays & Their Applications
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
Practical techniques & Examples
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Numerical Algorithms Matrix multiplication
The number of edge-disjoint transitive triples in a tournament.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Design of parallel algorithms Matrix operations J. Porras.
Distributed Combinatorial Optimization
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi
Minimum Cost Flows. 2 The Minimum Cost Flow Problem u ij = capacity of arc (i,j). c ij = unit cost of shipping flow from node i to node j on (i,j). x.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
SOFTWARE / HARDWARE PARTITIONING TECHNIQUES SHaPES: A New Approach.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
Matrices Section 2.6. Section Summary Definition of a Matrix Matrix Arithmetic Transposes and Powers of Arithmetic Zero-One matrices.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
CSCI-455/552 Introduction to High Performance Computing Lecture 15.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Auburn University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Computability and Complexity
CSCI1600: Embedded and Real Time Software
Chapter 6. Large Scale Optimization
Parallel Programming in C with MPI and OpenMP
Chapter 6. Large Scale Optimization
Presentation transcript:

Load Balancing Parallel Applications on Heterogeneous Platforms

Heterogeneous Platforms So far we ’ ve only talked about platforms in which all processors/nodes are identical representative of most supercomputers Clusters often “ end up ” heterogeneous Built as homogeneous New nodes are added and have faster CPUs A cluster that stays up 3 years will likely have several generations of processors in it Network of workstations are heterogeneous couple together all machines in your lab to run a parallel applications (became popular with PVM) “ Grids ” couple together different clusters, supercomputers, and workstations  It is important to develop parallel applications that can leverage heterogeneous platforms

Heterogeneous Load Balancing There is an impressively large literature on load-balancing for heterogeneous platforms In this lecture we ’ re looking only at “ static ” load balancing before execution we know how much load is given to each processor e.g., as opposed to some dynamic algorithm that assigns work to processors when they ’ re “ ready ” We will look at: 2 simple load-balancing algorithms application to our 1-D distributed image processing application application to the 1-D distributed LU factorization discussion of load-balancing for 2-D data distributions (which turns out to be a very difficult problem) We assume homogeneous network and heterogeneous compute nodes

Static task allocation Let ’ s consider p processors Let t 1,..., t p be the “ cycle times ” of the processors i.e., time to process one elemental unit of computation (work units) for the application (T comp ) Let B be the number of (identical) work units Let c 1,..., c p be the number of work units processed by each processor c c n = B Perfect load balancing happens if c i x t i is constant or computing speed

Static task allocation if B is a multiple of then we can have perfect load balancing But in general, the formula for c i does not give an integer solution There is a simple algorithm that give the optimal (integer) allocation of work units to processors in O(p 2 )

Simple Algorithm // initialize with fractional values // rounded down For i = 1 to p // iteratively increase the c i values while (c c p < B) find k in {1,...,p} such that t k (c k + 1) = min {t i (c i + 1)} c k = c k + 1

Simple Algorithm 3 processors 10 work units p 1 p 2 p c 1 = 5 c 2 = 3 c 3 = 2

An incremental algorithm Note that the previous algorithm can be modified slightly to record the optimal allocation for B=1, B=2, etc... One can write the result as a list of processors B=1: P 1 B=2: P 1, P 2 B=3: P 1, P 2, P 1 B=4: P 1, P 2, P 1, P 3 B=5: P 1, P 2, P 1, P 3, P 1 B=6: P 1, P 2, P 1, P 3, P 1, P 2 B=7: P 1, P 2, P 1, P 3, P 1, P 2, P 1 etc. We will see how this comes in handy for some load- balancing problems (e.g., LU factorization)

Image processing algorithm First block Second block n = 44, p= 4, r = 3, k = 4 k = width of the parallelograms r = height of the parallelograms (number of rows per processors)

On a heterogeneous platform Compute a different r value for each processor using the simple load-balancing algorithm pick B compute the static allocation allocate rows in that pattern, in a cyclic fashion Example B = 10 t 1 = 3, t 2 = 5, t 3 = 8 allocation: c 1 = 5, c 2 = 3, c 3 = rows 3 rows 2 rows

LU Decomposition ReductionBroadcast Compute BroadcastsCompute to find the max a ji max a ji needed to compute the scaling factor Independent computation of the scaling factor Every update needs the scaling factor and the element from the pivot row Independent computations requires load-balancing

Load-balancing One possibility is to directly use the static allocation for all columns of the matrix Processor 1 is underutilized Processor 2 is idle

Load-balancing one possibility is the rebalance at each step (or each X steps) Redistribution is expensive in terms of communications

Load-balancing Use the distribution obtained with the incremental algorithm, reversed B=10: P 1, P 2, P 1, P 3, P 1, P 2, P 1, P 1, P 2, P 3 optimal load-balancing for 10 columns optimal load-balancing for 7 columns optimal load-balancing for 4 columns...

Load-balancing Of course, this should be done for blocks of columns, and not individual columns Also, should be done in a “ motif ” that spans some number of columns (B=10 in our example) and is repeated cyclically throughout the matrix provided that B is large enough, we get a good approximation of the optimal load- balance

2-D Data Distributions What we ’ ve seen so far works well for 1-D data distributions use the simple algorithm use the allocation pattern over block in a cyclic fashion We have seen that a 2-D distribution is what ’ s most appropriate, for instance for matrix multiplication We use matrix multiplication as our driving example

2-D Matrix Multiplication C = A x B Let us assume that we have a p x p processor grid, and that p=q=n all 3 matrices are distributed identically P 1,1 P 1,2 P 1,3 P 2,1 P 2,2 P 2,3 P 3,1 P 3,2 P 3,3 Processor Grid A 1,1 A 1,2 A 1,3 A 2,1 A 2,2 A 2,3 A 3,1 A 3,2 A 3,3

2-D Matrix Multiplication We have seen 3 algorithms to do a matrix multiplication (Cannon, Fox, Snyder) We will see here another algorithm called the “ outer product algorithm ” used in practice in the ScaLAPACK library because does not require initial redistribution of the data is easily adapted to rectangular processor grids, rectangular matrices, and rectangular blocks as usual, one should use a blocked-cyclic distribution, but we will describe the algorithm in its non-blocking, non-cyclic version

Outer-product Algorithm Proceeds in k=1,...,n steps Horizontal broadcasts: P i,k, for all i=1,...,p, broadcasts a ik to processors in its processor row Vertical broadcasts: P k,j, for all j=1,...,q, broadcasts a kj to processors in its processor column Independent computations: processor P i,j can update c ij = c ij + a ik x a kj

Outer-product Algorithm

Load-balancing Let t i,j be the cycle time of processor P i,j We assign to processor P i,j a rectangle of size r i x c j P 1,1 P 1,2 P 1,3 P 2,1 P 2,2 P 2,3 P 3,1 P 3,2 P 3,3 c 1 c 2 c 3 r1r2r3r1r2r3

Load-balancing First, let us note that it is not always possible to achieve perfect load-balacing There are some theorems that show that it ’ s only possible if the processor grid matrix, with processor cycle times, t ij, put in their spot is of rank 1 Each processor computes for r i x c j x t ij time Therefore, the total execution time is T = max i,j {r i x c j x t ij }

Load-balancing as optimization Load-balancing can be expressed as a constrained minimization problem minimize max i,j {r i x c j x t ij } with the constraints

Load-balancing as optimization The load-balancing problem is in fact much more complex One can place processors in any place of the processor grid One must look for the optimal given all possible arrangements of processors in the grid (and thus solve an exponential number of the the optimization problem defined on the previous slide) The load-balancing problem is NP-hard Complex proof A few (non-guaranteed) heuristics have been developed they are quite complex

“ Free ” 2-D distribution So far we ’ ve looked at things that looked like this But how about?

Free 2-D distribution Each rectangle must have a surface that ’ s proportional to the processor speed One can “ play ” with the width and the height of the rectangles to try to minimize communication costs A communication involves sending/receiving rectangles ’ half-perimeters One must minimize the sum of the half-perimeters if communications happen sequentially One must minimize the maximum of the half- perimeters if communications happen in parallel

Problem Formulation Let us consider p numbers s 1,..., s p such that s s p = 1 we just normalize the sum of the processors ’ cycle times so that they all sum to 1 Find a partition of the unit square in p rectangles with area s i, and with shape h i x v i such that h 1 +v 1 + h 2 +v h p +v p is minimized. This problem is NP-hard

Guaranteed Heuristic There is a guaranteed heuristic (that is within some fixed factor of the optimal) non-trivial It only works with processor columns