CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Slides:



Advertisements
Similar presentations
Partitioning Screen Space for Parallel Rendering
Advertisements

Partitional Algorithms to Detect Complex Clusters
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Modularity and community structure in networks
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
V4 Matrix algorithms and graph partitioning
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Graph Clustering. Why graph clustering is useful? Distance matrices are graphs  as useful as any other clustering Identification of communities in social.
Lecture 21: Spectral Clustering
CS 584. Review n Systems of equations and finite element methods are related.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Fast algorithm for detecting community structure in networks.
EDA (CS286.5b) Day 5 Partitioning: Intro + KLFM. Today Partitioning –why important –practical attack –variations and issues.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
EDA (CS286.5b) Day 6 Partitioning: Spectral + MinCut.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
CS267 L15 Graph Partitioning II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 15: Graph Partitioning - II James Demmel
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Strategies for Implementing Dynamic Load Sharing.
Application of Graph Theory to OO Software Engineering Alexander Chatzigeorgiou, Nikolaos Tsantalis, George Stephanides Department of Applied Informatics.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
15-853Page :Algorithms in the Real World Separators – Introduction – Applications.
Multilevel Graph Partitioning and Fiduccia-Mattheyses
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
SEMILARITY JOIN COP6731 Advanced Database Systems.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
Graph Partitioning using Single Commodity Flows
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
Dynamic Load Balancing Tree and Structured Computations.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
High Performance Computing Seminar
Auburn University
Data Mining K-means Algorithm
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
A Continuous Optimization Approach to the Minimum Bisection Problem
Degree and Eigenvector Centrality
CS 584.
Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.
3.3 Network-Centric Community Detection
Adaptive Mesh Applications
Adaptivity and Dynamic Load Balancing
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

CS 584

Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing Static Dynamic

Load Balancing The load balancing problem can be reduced to the bin-packing problem NP-complete For simple cases, we can do well, but … Heterogeneity Different types of resources  Processor  Network, etc.

Evaluation of load balancing Efficiency Are the processors always working? How much processing overhead is associated with the load balance algorithm? Communication Does load balance introduce or affect the communication pattern? How much communication overhead is associated with the load balance algorithm? How many edges are cut in communication graph?

Partitioning Techniques Regular grids (-: Easy :-) striping blocking use processing power to divide load more fairly Generalized Graphs Levelization Scattered Decomposition Recursive Bisection

Levelization Begin with a boundary Number these nodes 1 All nodes connected to a level 1 node are labeled 2, etc. Partitioning is performed determine the number of nodes per processor count off the nodes of a level until exhausted proceed to the next level

Levelization

Want to insure nearest neighbor comm. If p is # processors and n is # nodes. Let r i be the sum of the number of nodes in contiguous levels i and i + 1 Let r = max{r 1, r 2, …, r n } Nearest neighbor communication is assured if n/p > r

Scattered Decomposition Used for highly irregular grids Partition load into a large number r of rectangular clusters such that r >> p Each processor is given a disjoint set of r/p clusters. Communication overhead can be a problem for highly irregular problems.

Recursive Bisection Recursively divide the domain in two pieces at each step. 3 Methods Recursive Coordinate Bisection Recursive Graph Bisection Recursive Spectral Bisection

Recursive Coordinate Bisection Divide the domain based on the physical coordinates of the nodes. Pick a dimension and divide in half. RCB uses no connectivity information lots of edges crossing boundaries partitions may be disconnected Some new research based on graph separators overcomes some problems.

Ineritial Bisection Often, coordinate bisection is susceptible to the orientation of the mesh Solution: Find the principle axis of the communication graph

Graph Theory Based Algorithms Geometric algorithms are generally low quality they don’t take into account connectivity Graph theory algorithms apply what we know about generalized graphs to the partitioning problem Hopefully, they reduce the cut size

Greedy Bisection Start with a vertex of the smallest degree least number of edges Mark all its neighbors Mark all its neighbors neighbors, etc. The first n/p marked vertices form one subdomain Apply the algorithm on the remaining

Recursive Graph Bisection Based on graph distance rather than coordinate distance. Determine the two furthest separated nodes Organize and partition nodes according to their distance from extremities. Computationally expensive Can use approximation methods.

Recursive Spectral Bisection Uses the discrete Laplacian Let A be the adjacency matrix Let D be the diagonal matrix where D[i,i] is the degree of node I L G = A - D

Recursive Spectral Bisection LG is negative semidefinite Its largest eigenvalue is zero and the corresponding eigenvector is all ones. The magnitude of the second largest eigenvalue gives a measure of the connectivity of the graph. Its corresponding eigenvector gives a measure of distances between nodes.

Recursive Spectral Bisection The eigenvector corresponding to the second largest eigenvalue is the Fiedler vector. Calculation of the Fiedler vector is computationally intensive. RSB yields connected partitions that are very well balanced.

Example

RCB 529 edges cutRGB 618 edges cut RSB 299 edges cut

Global vs Local Partitioning Global methods produce a “good” partitioning Local methods can then be used to improve the partitioning

The Kernighan-Lin algorithm Swap pairs of nodes to decrease the cut Will allow intermediate increases in the cut size to avoid certain local minima Loop choose the pair of nodes with largest benefit of swapping logically exchange them (not for real) lock those nodes until all nodes are locked Find the sequence of swaps that yields the largest accumulated benefit Perform the swaps for real

The Kernihan-Lin Algorithm

Helpful-Sets Two Steps Find a set of nodes in one partition and move it to the other partition to decrease the cut size Rebalance the load The set of nodes moved must be helpful Helpfulness of node is equal to the change in cut size if the node is moved

Helpful-Sets All these sets are 2 - helpful

Helpful-Sets Algorithm

The Helpful-Sets Algorithm Theory If there is a bisection and if its cut size is not “too small” then there exists a small 4-helpful set in one side or the other This 4-helpful set can be moved and will reduce the cut by 4 If imbalance is not “too large” and cut of unbalanced partition is not “too small” then it is possible to rebalance without increasing the cut size by more than 2 Apply the theory iteratively until “too small” condition is met.

Multi-level Hybrid Methods For very large graphs, time to partition can be extremely costly Reduce time by coarsening the graph shrink a large graph to a smaller one that has similar characteristics Coarsen by heavy edge matching simple partitioning heuristics

Multi-level Hybrid Methods

Comparisons ChacoMetis Party Graph|v||e|MLININ+KLPMetisallall+HS airfoil (0.08)(0.00)(0.02)(0.04)(0.04)(0.15) crack (0.16)(0.01)(0.05)(0.14)(0.10)(0.44) wave (3.64)(0.19)(1.61)(3.50)(2.84)(11.93) lh total edge weight487380(0.33)(0.06)(0.06)(0.23) mat (1.80)(2.04)(3.45)(11.52) DEBR (48.99)(988.39)(16.63)(577.97) (x.xx) – run time in seconds ML – Multilevel (spectral on coarse – KL on intermediate) IN – Inertial Party – 5 or 6 different methods

Dynamic Load Balancing Load is statically partitioned initially Adjust load when an imbalance is detected. Objectives rebalance the load keep edge cut minimized (communication) avoid having too much overhead

Dynamic Load Balancing Consider adaptive algorithms After an interval of computation mesh is adjusted according to an estimate of the discretization error  coarsened in areas  refined in others Mesh adjustment causes load imbalance

Dynamic Load Balancing After refinement, node 1 ends up with more work

Centralized DLB Control of the load is centralized Two approaches Master-worker (Task scheduling)  Tasks are kept in central location  Workers ask for tasks  Requires that you have lots of tasks with weak locality requirements. No major communication between workers Load Monitor  Periodically, monitor load on the processors  Adjust load to keep optimal balance

Repartitioning Consider: dynamic situation is simply a sequence of static situations Solution: repartition the load after each some partitioning algorithms are very quick Issues scalability problems how different are current load distribution and new load distribution data dependencies

Decentralizing DLB Generally focused on work pool Two approaches Hierarchy Fully distributed

Fully Distributed DLB Lower overhead than centralized schemes. No global information Load is locally optimized Propagation is slow Load balance may not be as good as centralized load balance scheme Three steps Flow calculation (How much to move) Mesh node selection (Which work to move) Actual mesh node migration

Flow calculation View as a network flow problem Add source and sink nodes Connect source to all nodes  edge value is current load Connect sink to all nodes  edge value is mean load processor communication graph

Flow calculation Many network flow algorithms more intense than necessary not parallel Use simpler, more scalable algorithms Random Matchings pick random neighboring processes exchange some load eventually you may get there

Diffusion Each processor balances its load with all its neighbors How much work should I have? How much to send on an edge? Repeat until all load is balanced steps

Diffusion Convergence to load balance can be slow Can be improved with over-relaxation Monitor what is sent in each step Determine how much to send based on current imbalance and how much was sent in previous steps Diffuses load in steps

Dimension Exchange Rather than communicate with all neighbors each round, only communicate with one Comes from dimensions of hypercube Use edge coloring for general graphs Exchange load with neighbor along a dimension l = (li + lj)/2 Will converge in d steps if hypercube Some graphs may need different factor to converge faster l = li * a + lj * (1 –a)

Diffusion & Dimension Exchange Can view diffusion as a Jacobi method dimension exchange as Gauss-Seidel Can use multi-level variants Divide the processor communication graph in half Determine the load to shift across the cut Recursively rebalance each half

Mesh node selection Must identify which mesh nodes to migrate minimize edge cut and overhead Very dependent on problem Shape & size of partition may play a role in accuracy Aspect ratio maintenance Move items that are further away from center of gravity.

Load Balancing Schemes (Who do I request work from?) Asynchronous Round Robin each processor maintains target Ask from target then increment target Global Round Robin target is maintained by master node Random Polling randomly select a donor each processor has equal probability