Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.

Slides:

Advertisements

Similar presentations

Partitioning Screen Space for Parallel Rendering

Advertisements

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

METIS Three Phases Coarsening Partitioning Uncoarsening

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

CS 584. Review n Systems of equations and finite element methods are related.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

A scalable multilevel algorithm for community structure detection

Multilevel Graph Partitioning and Fiduccia-Mattheyses

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Multilevel Hypergraph Partitioning G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar Computer Science Department, U of MN Applications in VLSI Domain.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Graph Partitioning Donald Nguyen October 24, 2011.

Network Aware Resource Allocation in Distributed Clouds.

High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.

DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.

CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Static Process Scheduling

Spectral Partitioning: One way to slice a problem in half C B A.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Computing Approximate Weighted Matchings in Parallel Fredrik Manne, University of Bergen with Rob Bisseling, Utrecht University Alicia Permell, Michigan.

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.

Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.

PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;

University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.

Concurrency and Performance Based on slides by Henri Casanova.

James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.

High Performance Computing Seminar

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Auburn University

2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure

Parallel Hypergraph Partitioning for Scientific Computing

Parallel Graph Algorithms

Ana Gainaru Aparna Sasidharan Babak Behzad Jon Calhoun

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

A Continuous Optimization Approach to the Minimum Bisection Problem

Parallel Programming in C with MPI and OpenMP

CPSC 531: System Modeling and Simulation

Parallel Sort, Search, Graph Algorithms

Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.

Gary M. Zoppetti Gagan Agrawal

Adaptive Mesh Applications

Parallel Programming in C with MPI and OpenMP

Mesh Segmentation and Partition

Parallel Exact Stochastic Simulation in Biochemical Systems

Dynamic Load Balancing of Unstructured Meshes

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality in the parallelization of the the Fast Multipole Algorithm (FMA). The research has been conducted under the direction of my advisor, Professor Susan Flynn Hummel from Polytechnic University. Scientific problems are large, often irregular and computationally intensive. Mark Bilderback and Prashant Soni NSF ERC for Computational Field Simulation Mississippi State University

Overview Graph Oriented Applications (i.e. CFD) Graph Partitioning Load Balancing via Fractiling Environment Used for the Experiments Experimental Results Conclusions & Future Work The thesis studies an important class of scientific problems known as the N-body problem using the Fast Multipole Algorithm, by Leslie Greengard, one of the most efficient hierarchical methods. I will concentrate on parallelizing the FMA and identify and survey the critical factors that affect its performance on parallel machines. I will further introduce an effective technique to map the FMA onto parallel architectures using a dynamic scheduling technique, Fractiling, by Susan Flynn Hummel. Next, I will present our implementations on the KSR1 at the Cornell Theory Center. Following, I will summarize our experimental results. I will conclude my talk with some of the insights we have gained from this work and future directions our research can grow into.

Load Balancing Load balancing: Evenly divide work among processors Graph applications suffer from load imbalance because of: System characteristics: operating system interference etc.. Problem characteristics nonuniform distribution of vertices Algorithmic characteristics uneven weights of vertices/edges

Fractiling Dynamic scheduling Exploits self-similarity of fractals Accommodates load imbalances by: predictable phenomena (irregular data) unpredictable phenomena (latency, etc..) Code simplicity Fractiling = Factoring + Tiling

Fractiling Factoring: Tiling: allocation of work in decreasing-size chunks goal: minimize load imbalance Tiling: static partitioning of the space into regions of suitable granularity and shapes goal: minimizing inter-tile communication

Factoring Allocating half of the work in P chunks, then half of the remaining work, etc.. Example: 4 processor [P=4] 1024 leaf boxes [ 128, 128, 128, 128, 64, 64, 64, 64, 32, 32, 32, 32, 16, 16, 16, 16, 8, 8, 8, 8, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1 ] Idle processors obtain chunks of next size

Tiling Maximizing data reuse C = A  B =  C A B

Fractiling in N-Body Simulations Shuffle order Self-similarity 1 1 4 5 2 3 2 3 6

Fractiling Algorithm Initially: the computation space is divided in P tiles While work remains in my tile get a global fractile size allocate a subtile of that size from my tile While work remains in some tile allocate a subtile of that size from an unfinished tile

Fractiling Execution Example (N-Body Simulation) 2 2 2 2 2 2 2 2 2 2 2 2 9 5 6 2 2 2 2 9 4 14 2 2 2 2 8 10 5 2 2 2 2 1 2 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 10 10 10 10 10 10 10 10 6 7 11 6 7 11 7 13 12 6 11 6 7 12 6 12 1 1 1 1 1 1 1 1 1 1 1 1 9 8 14 10 1 1 1 1 5 4 15 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 2 1 8 3 3 3 3 3 10 9 3 3 3 3 15 14 4 9 9 9 9 9 9 9 9 9 9 9 9 9 9 11 11 11 11 6 7 12 7 11 12 6 11 4 4 4 4 4 4 4 4 14 14 14 14 14 14 14 14 6 13 12 6 12 5 5 6 12 12 10 10 5 5 5 5 5 5 5 5 15 15 15 15 15 15 11 11 15 15 15 15 15 15 15 15 15 15 10 10 14 14 15 15 15 15 15 15 8 8 7 13 7 13 12 7 13

The Graph Partitioning Problem Divide a graph G = (V, E) into 2 disjoint subset, called partitions, such that: Each partition is nearly equal in size V = |V1||V2|... |VN|  N/P All vertices are assigned to one and only one partition. V1V2...VP = V Vi  VJ =  The number of edges connecting vertices in separate partitions (Edge-Cuts) is minimized. iJ ei, J = { (v,w) | v  Vi, w  VJ }

The SuperMSPARC Architecture Hardware characteristics: distributed memory multicomputer designed and constructed at NSF/ERC 32 processors organized in 8 clusters of four tightly-coupled processors arranged in a mesh topology each cluster contains four 90 MHz Ross hyperSPARC processors each cluster is a shares 288 Mbytes of RAM total RAM 2.3 Gbytes Connected via 32-bit SBus

Graph Used An unstructured unweighted 3D tetrahedral grid Converted to its dual graphs 45,538 Vertices 244,939 Edges

Graph Partitioning Algorithm Types Global Algorithms construction algorithms Local Algorithms refinement algorithms Multilevel Algorithms coarsening partitioning uncoarsening

Graph Partitioning Packages Chaco - version 2.0 Sandia National Laboratories Jostle - version2.0 University of Greenwich METIS - version 2.0 University of Minnesota ParMETIS - version 1.0 Party - version1.1 Paderborn University

Chaco Linear - partition = i div P Scattered - partition = i mod P Random - randomly assign Inertial - sort along elongated axis and assign in a linear manner Spectral - sort along the Fiedler vector of the Laplacian matrix L = D - A and assign in a linear manner

Chaco Results

METIS Graph Growing Greedy Graph Growing Spectral select a vertex randomly and grow a region around it in a breath-first manner Greedy Graph Growing select an initial vertex randomly add vertices with the least increase in edge-cuts Spectral sort along the Fiedler vector of the Laplacian matrix L = D - A

METIS Results

ParMETIS PARKMETIS PARGKMETIS PARGMETIS

ParMETIS Results

Party Linear - partition = i div P Scattered - partition = i mod P Random - randomly assign Gain start with all vertices on one partition fill other partitions one at a time selecting vertices which increase total number of edge-cuts the least.

Party Farhat Coordinate Sorting start with all partitions empty select vertex with lowest degree assign vertices in a breath-first manner Coordinate Sorting sort along elongated axis assign using the linear algorithm

Party Results

Jostle - version 2.0 Developed at University of Greenwich by Chris Walshaw

Overall Results

Conclusions and Future Work Implement a static graph application using Fractiling and several graph partitioning algorithms on unweighted graphs. Comparing the results to non-Fractiled applications. Implement a dynamic graph application, again using Fractiling and several graph partitioning algorithms on unweighted graphs. Comparing the results to non-Fractiled applications. Implement the above two studies using weighted graphs.