High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.

Slides:



Advertisements
Similar presentations
Partitioning Screen Space for Parallel Rendering
Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Adaptive Mesh Applications
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
CS 584. Review n Systems of equations and finite element methods are related.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
Accessing Spatial Data
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
Parallel Decomposition-based Contact Response Fehmi Cirak California Institute of Technology.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.
A Parallelisation Approach for Multi-Resolution Grids Based Upon the Peano Space-Filling Curve Student: Adriana Bocoi Advisor: Dipl.-Inf.Tobias Weinzierl.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Network Aware Resource Allocation in Distributed Clouds.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 1  Capabilities: Dynamic load balancing and static data partitioning -Geometric, graph-based, hypergraph-based -Interfaces to ParMETIS, PT-Scotch, PaToH.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
October 25, 2007 P_HUGG and P_OPT: An Overview of Parallel Hierarchical Cartesian Mesh Generation and Optimization- based Smoothing Presented at NASA Langley,
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
Parallel Computing Presented by Justin Reschke
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Parallel Hypergraph Partitioning for Scientific Computing
Operating Systems (CS 340 D)
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Operating Systems (CS 340 D)
Parallel Programming in C with MPI and OpenMP
Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.
Adaptive Mesh Applications
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

High Performance Computing 1 Load-Balancing

High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes when running codes on a parallel machine Load-balancing constraints –Minimize interprocess communication Also called: –partitioning, mesh partitioning, (domain decomposition)

High Performance Computing 1 Know your data and memory Memory is organized by banks. Between access to any bank, there is a latency period. Matrix entries are stored column-wise in FORTRAN.

High Performance Computing 1 is addressed Matrix addressing in FORTRAN

High Performance Computing 1 Addressing Memory For illustration purposes, lets imagine 8 banks [128 or 256 common on chips today], with bank busy time (bbt) of 8 cycles between accesses. Thus we have: data a13 a23 a33 a43 a14 a24 a34 a44 data a11 a21 a31 a41 a12 a22 a32 a42 bank

High Performance Computing 1 Addressing Memory If we access data column-wise, we proceed through each bank in order. By the time we call a13, we (just) avoid bbt. On the other hand, if we access data row- wise, we get a11 in bank 1, a12 in bank 5, a13 in bank 1 again - so instead of access on clock cycle 3, we have to wait until cycle 9. Then we get a14 in bank 5 again on cycle 10, etc.

High Performance Computing 1 Indirect addressing If addressing is indirect we may wind up jumping all over, and suffer performance hits because of it.

High Performance Computing 1 Shared Memory Bank conflicts depend on granularity of memory If N memory refs per cycle, p processors, memory with b cycles bbt, need p*N*b memory banks to see uninterrupted access of data With B banks, granularity is g = B/(p*N*b)

High Performance Computing 1 Moral Separate selection of data from its processing Each subtask requires its own data structure. Be prepared to change structures between tasks

High Performance Computing 1 Load-balancing nomenclature Object Edge Objects get distributed among different processes Edges represent information that need to be shared between objects

High Performance Computing 1 Partitioning Divides up the work 5 & 4 objects assigned to processes Creates “edge-cuts” Necessary communications between processes

High Performance Computing 1 Work/Edge Weights Need a good measure of what the expected work may be –Molecular dynamics: number of molecules regions –FEM/finite difference/finite volume, etc: Degrees of freedom Cells/elements If edge weights are used, also need a good measure on how strongly objects are coupled to each other

High Performance Computing 1 Static/Dynamic Load-Balancing Static load-balancing –Done as a “preprocessing” step before the actual calculation –If the objects and edges don’t change very much or at all, can do static load-balancing Dynamic load-balancing –Done during the calculation –Significant changes in the objects and/or edges

High Performance Computing 1 Dynamic Load-Balancing Example h-adapted mesh Workload is changing as the computation proceeds Calculate a new partition Need to migrate the elements to their assigned process

High Performance Computing 1 Static vs. Dynamic Load Balancing Static partitioning insufficient for many applications –Adaptive mesh refinement –Multi-phase/Multi-physics computations –Particle simulations –Crash simulations –Parallel mesh generation –Heterogeneous computers Need dynamic load balancing

High Performance Computing 1 Dynamic Load-Balancing Constraints Minimize load-balancing time –Memory constraints Minimize data migration -- incremental partitions –Small changes in the computation should result in small changes in the partitioning –Calculating new partition and data migration should take less time than the amount of time saved by performing computations on new grid Done in parallel

High Performance Computing 1 Methods of Load-Balancing Geometric –Based on geometric location –Faster load-balancing time with medium quality results Graph-based –Create a graph to represent the objects and their connections –Slower load-balancing time but high quality results Incremental methods –Use graph representation and “shuffle” around objects

High Performance Computing 1 Choosing a Load-Balancing Algorithm/Method No algorithm/method is appropriate for all applications! Graph load-balancing algorithms for: –Static load-balancing –Computations where computation to load-balancing time ratio is high Implicit schemes with a linear and non-linear solution scheme

High Performance Computing 1 Choosing a Load-Balancing Algorithm/Method Geometric load-balancing algorithms for: –Computations where computation to load-balancing time ratio is low For explicit time stepping calculations with many time steps and varying workload (MD, FEM crash simulations, etc.) Problems with many load-balancing objects

High Performance Computing 1 Geometric Load-Balancing Based on the objects’ coordinates –Want a unique coordinate associated with an object Node coordinates, element centroid, molecule coordinate/centroid, etc. Partition “space” which results in a partition of the load-balancing objects Edge cuts are usually not explicitly dealt with

High Performance Computing 1 Geometric Load-Balancing Assumptions Objects that are close will likely need to share information –Want compact partitions High volume to surface area or high area to perimeter length ratios Coordinate information Bounded domain

High Performance Computing 1 Geometric Load-Balancing Algorithms Recursive Coordinate Bisection (RCB) –Berger & Bokhari Recursive Inertial Bisection (RIB) –Taylor & Nour-Omid Space Filling Curves (SFC) –Warren & Salmon, Ou, Ranka, & Fox, Baden & Pilkington Octree Partitioning/Refinement-tree Partitioning –Loy & Flaherty, Mitchell

High Performance Computing 1 Recursive Coordinate Bisection 1.Choose an axis for the cut 2.Find the proper location of the cut 3.Group objects together according to location relative to cut 4.If more partitions are needed, go to step 1

High Performance Computing 1 Recursive Inertial Bisection 1.Choose a direction for the cut 2.Find the proper location of the cut 3.Group objects together according to location relative to cut 4.If more partitions are needed, go to step 1

High Performance Computing 1 Space Filling Curves A Space Filling Curve is a 1-dimensional curve which passes through every point in an n-dimensional domain

High Performance Computing 1 Load-Balancing with Space Filling Curves The SFC gives a 1- dimensional ordering of objects located in an n-dimensional domain –Easier to work with objects in 1 dimension than in n dimensions Algorithm: 1.Sort objects by their location on the SFC 2.Calculate cuts along the SFC

High Performance Computing 1 Octree Partitioning/Refinement- Tree Partitioning Tree based algorithms for applications with multiple levels of data, simulation accuracy, etc. –Tree is usually built from specific computational schemes –Tightly coupled with the simulation

High Performance Computing 1 Comparisons of RCB, RIB, and SFC RCB and RIB usually give slightly better partitions than SFC SFC is usually a little faster SFC is a little better for incremental partitions –RIB can be real unstable for incremental partitions

High Performance Computing 1 Load-Balancing Libraries There are many load-balancing libraries downloadable from the web –Mostly graph partitioning libraries Static: Chaco, Metis, Party, Scotch Dynamic: ParMetis, DRAMA, Jostle, Zoltan Zoltan ( –Dynamic load-balancing library with: SFC, RCB, RIB, Octree, ParMetis, Jostle –Same interface to all load-balancing algorithms

High Performance Computing 1 Methods to Avoid Communication Avoiding load-balancing –Load-balancing not needed every time the workload and/or edge connectivity changes Ghost cells Predictive load-balancing

High Performance Computing 1 Accessing Information on Other Processors Need communication between processors Use ‘ghost’ cells – need to maintain consistency of data in ghost cells

High Performance Computing 1 Ghost Cells Copies of cells assigned to other processors Make needed information available No solution values are computed at the ghost cells Ghost cell information needs to be updated whenever necessary Ghost cells need to be calculated dynamically because of changing mesh and dynamic load-balancing

High Performance Computing 1 Predictive Load-Balancing Predict the workload and/or edge connectivity and load-balance with that information –Assumes that you can predict the workload and/or edge connectivity Still need to perform communication but reduces data migration

High Performance Computing 1 Predictive Load-Balancing Refine then load-balance – 4 objects migrated Predictive load-balance then refine – 1 object migrated