CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Slides:



Advertisements
Similar presentations
Partitioning Screen Space for Parallel Rendering
Advertisements

U of Houston – Clear Lake
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Hierarchical Decompositions for Congestion Minimization in Networks Harald Räcke 1.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
CS 584. Review n Systems of equations and finite element methods are related.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Strategies for Implementing Dynamic Load Sharing.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
MGR: Multi-Level Global Router Yue Xu and Chris Chu Department of Electrical and Computer Engineering Iowa State University ICCAD
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Fast Spectrum Allocation in Coordinated Dynamic Spectrum Access Based Cellular Networks Anand Prabhu Subramanian*, Himanshu Gupta*,
Load Balancing and Termination Detection Load balance : - statically before the execution of any processes - dynamic during the execution of the processes.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Researchers: Preet Bola Mike Earnest Kevin Varela-O’Hara Han Zou Advisor: Walter Rusin Data Storage Networks.
Distributed Algorithms Rajmohan Rajaraman Northeastern University, Boston May 2012 Chennai Network Optimization WorkshopDistributed Algorithms1.
Parallel Simulation of Continuous Systems: A Brief Introduction
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Data Structures and Algorithms in Parallel Computing
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
COMMUNICATING VIA FIREFLIES: GEOGRAPHIC ROUTING ON DUTY-CYCLED SENSORS S. NATH, P. B. GIBBONS IPSN 2007.
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
Dynamic Load Balancing Tree and Structured Computations.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
High Performance Computing Seminar
Auburn University
Parallel Graph Algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Parallel Programming in C with MPI and OpenMP
CS 584.
Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.
CS 6290 Many-core & Interconnect
Load Balancing Definition: A load is balanced if no processes are idle
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

CS 484 Load Balancing

Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing Static Dynamic

Load Balancing The load balancing problem can be reduced to the bin-packing problem NP-complete For simple cases, we can do well, but … Heterogeneity Different types of resources  Processor  Network, etc.

Evaluation of load balancing Efficiency Are the processors always working? How much processing overhead is associated with the load balance algorithm? Communication Does load balance introduce or affect the communication pattern? How much communication overhead is associated with the load balance algorithm? How many edges are cut in communication graph?

Partitioning Techniques Regular grids (-: Easy :-) striping blocking use processing power to divide load more fairly Generalized Graphs Levelization Scattered Decomposition Recursive Bisection

Example consider a set of twelve independent tasks with the following set of execution times: {10, 6, 4, 4, 2, 2, 2, 2, 1, 1, 1, 1} How would you distribute these tasks among 4 processors? 6

Consecutive block assignment 7 execution time for these twelve tasks would be 20 time units This schedule would take only 10 time units

8 Evaluation of Load Balancing Goal: Find a good mapping from the application graph G = (V,E) onto the processor graph H = (U,F) Consider Load: max number of nodes from G assigned to any single node of H Dilation: max distance of any route of a single edge from G in H Congestion: max number of edges from G that have to be routed via any single edge in H

9 Overall Goal: Find a mapping π that minimizes all three measures - load, dilation, and congestion Note: Today’s networks make dilation inconsequential to some extent

Levelization Begin with a boundary Number these nodes 1 All nodes connected to a level 1 node are labeled 2, etc. Partitioning is performed determine the number of nodes per processor count off the nodes of a level until exhausted proceed to the next level

Levelization

Recursive Coordinate Bisection Divide the domain based on the physical coordinates of the nodes. Pick a dimension and divide in half. RCB uses no connectivity information lots of edges crossing boundaries partitions may be disconnected Some new research based on graph separators overcomes some problems.

13 Unbalanced Recursive Bisection An attempt at reducing communication costs Create subgrids that have better aspect ratios Instead of dividing the grid in half, consider unbalanced subgrids of size: 1/p and (p-1)/p 2/p and (p-2)/p Etc. Choose the partition size that minimizes the subgrid aspect ratio

14 Unbalanced Recursive Bisection

Graph Theory Based Algorithms Geometric algorithms are generally low quality they don’t take into account connectivity Graph theory algorithms apply what we know about generalized graphs to the partitioning problem Hopefully, they reduce the cut size

Greedy Bisection Start with a vertex of the smallest degree least number of edges Mark all its neighbors Mark all its neighbors neighbors, etc. The first n/p marked vertices form one subdomain Apply the algorithm on the remaining

Recursive Graph Bisection Based on graph distance rather than coordinate distance. Determine the two furthest separated nodes Organize and partition nodes according to their distance from extremities. Computationally expensive Can use approximation methods.

Recursive Spectral Bisection Minimize the number of edges cut with the partition 18

RCB 529 edges cutRGB 618 edges cut RSB 299 edges cut

20 Dynamic Load Balancing

Load is statically partitioned initially Adjust load when an imbalance is detected. Objectives rebalance the load keep edge cut minimized (communication) avoid having too much overhead

Dynamic Load Balancing Consider adaptive algorithms After an interval of computation mesh is adjusted according to an estimate of the discretization error  coarsened in areas  refined in others Mesh adjustment causes load imbalance

Centralized DLB Control of the load is centralized Two approaches Master-worker (Task scheduling)  Tasks are kept in central location  Workers ask for tasks  Requires that you have lots of tasks with weak locality requirements. No major communication between workers Load Monitor  Periodically, monitor load on the processors  Adjust load to keep optimal balance

Decentralizing DLB Generally focused on work pool Two approaches Hierarchy Fully distributed

Fully Distributed DLB Lower overhead than centralized schemes. No global information Load is locally optimized Propagation is slow Load balance may not be as good as centralized load balance scheme Three steps Flow calculation (How much to move) Mesh node selection (Which work to move) Actual mesh node migration

Flow calculation View as a network flow problem Add source and sink nodes Connect source to all nodes  edge value is current load Connect sink to all nodes  edge value is mean load processor communication graph

Flow calculation Many network flow algorithms more intense than necessary not parallel Use simpler, more scalable algorithms Random Matchings pick random neighboring processes exchange some load eventually you may get there

Diffusion Each processor balances its load with all its neighbors How much work should I have? (  is weighting factor) How much to send on an edge? Repeat until all load is balanced steps

Diffusion Convergence to load balance can be slow Can be improved with over-relaxation Monitor what is sent in each step Determine how much to send based on current imbalance and how much was sent in previous steps Diffuses load in steps

Dimension Exchange Rather than communicate with all neighbors each round, only communicate with one synchronous algorithm Comes from dimensions of hypercube Use edge coloring for general graphs Exchange load with neighbor along a dimension l = (li + lj)/2 Will converge in d steps if hypercube Some graphs may need different factor to converge faster l = li * a + lj * (1 –a)