Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality in the parallelization of the the Fast Multipole Algorithm (FMA). The research has been conducted under the direction of my advisor, Professor Susan Flynn Hummel from Polytechnic University. Scientific problems are large, often irregular and computationally intensive. Mark Bilderback and Prashant Soni NSF ERC for Computational Field Simulation Mississippi State University
Overview Graph Oriented Applications (i.e. CFD) Graph Partitioning Load Balancing via Fractiling Environment Used for the Experiments Experimental Results Conclusions & Future Work The thesis studies an important class of scientific problems known as the N-body problem using the Fast Multipole Algorithm, by Leslie Greengard, one of the most efficient hierarchical methods. I will concentrate on parallelizing the FMA and identify and survey the critical factors that affect its performance on parallel machines. I will further introduce an effective technique to map the FMA onto parallel architectures using a dynamic scheduling technique, Fractiling, by Susan Flynn Hummel. Next, I will present our implementations on the KSR1 at the Cornell Theory Center. Following, I will summarize our experimental results. I will conclude my talk with some of the insights we have gained from this work and future directions our research can grow into.
Load Balancing Load balancing: Evenly divide work among processors Graph applications suffer from load imbalance because of: System characteristics: operating system interference etc.. Problem characteristics nonuniform distribution of vertices Algorithmic characteristics uneven weights of vertices/edges
Fractiling Dynamic scheduling Exploits self-similarity of fractals Accommodates load imbalances by: predictable phenomena (irregular data) unpredictable phenomena (latency, etc..) Code simplicity Fractiling = Factoring + Tiling
Fractiling Factoring: Tiling: allocation of work in decreasing-size chunks goal: minimize load imbalance Tiling: static partitioning of the space into regions of suitable granularity and shapes goal: minimizing inter-tile communication
Factoring Allocating half of the work in P chunks, then half of the remaining work, etc.. Example: 4 processor [P=4] 1024 leaf boxes [ 128, 128, 128, 128, 64, 64, 64, 64, 32, 32, 32, 32, 16, 16, 16, 16, 8, 8, 8, 8, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1 ] Idle processors obtain chunks of next size
Tiling Maximizing data reuse C = A B = C A B
Fractiling in N-Body Simulations Shuffle order Self-similarity 1 1 4 5 2 3 2 3 6
Fractiling Algorithm Initially: the computation space is divided in P tiles While work remains in my tile get a global fractile size allocate a subtile of that size from my tile While work remains in some tile allocate a subtile of that size from an unfinished tile
Fractiling Execution Example (N-Body Simulation) 2 2 2 2 2 2 2 2 2 2 2 2 9 5 6 2 2 2 2 9 4 14 2 2 2 2 8 10 5 2 2 2 2 1 2 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 10 10 10 10 10 10 10 10 6 7 11 6 7 11 7 13 12 6 11 6 7 12 6 12 1 1 1 1 1 1 1 1 1 1 1 1 9 8 14 10 1 1 1 1 5 4 15 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 2 1 8 3 3 3 3 3 10 9 3 3 3 3 15 14 4 9 9 9 9 9 9 9 9 9 9 9 9 9 9 11 11 11 11 6 7 12 7 11 12 6 11 4 4 4 4 4 4 4 4 14 14 14 14 14 14 14 14 6 13 12 6 12 5 5 6 12 12 10 10 5 5 5 5 5 5 5 5 15 15 15 15 15 15 11 11 15 15 15 15 15 15 15 15 15 15 10 10 14 14 15 15 15 15 15 15 8 8 7 13 7 13 12 7 13
The Graph Partitioning Problem Divide a graph G = (V, E) into 2 disjoint subset, called partitions, such that: Each partition is nearly equal in size V = |V1||V2|... |VN| N/P All vertices are assigned to one and only one partition. V1V2...VP = V Vi VJ = The number of edges connecting vertices in separate partitions (Edge-Cuts) is minimized. iJ ei, J = { (v,w) | v Vi, w VJ }
The SuperMSPARC Architecture Hardware characteristics: distributed memory multicomputer designed and constructed at NSF/ERC 32 processors organized in 8 clusters of four tightly-coupled processors arranged in a mesh topology each cluster contains four 90 MHz Ross hyperSPARC processors each cluster is a shares 288 Mbytes of RAM total RAM 2.3 Gbytes Connected via 32-bit SBus
Graph Used An unstructured unweighted 3D tetrahedral grid Converted to its dual graphs 45,538 Vertices 244,939 Edges
Graph Partitioning Algorithm Types Global Algorithms construction algorithms Local Algorithms refinement algorithms Multilevel Algorithms coarsening partitioning uncoarsening
Graph Partitioning Packages Chaco - version 2.0 Sandia National Laboratories Jostle - version2.0 University of Greenwich METIS - version 2.0 University of Minnesota ParMETIS - version 1.0 Party - version1.1 Paderborn University
Chaco Linear - partition = i div P Scattered - partition = i mod P Random - randomly assign Inertial - sort along elongated axis and assign in a linear manner Spectral - sort along the Fiedler vector of the Laplacian matrix L = D - A and assign in a linear manner
Chaco Results
METIS Graph Growing Greedy Graph Growing Spectral select a vertex randomly and grow a region around it in a breath-first manner Greedy Graph Growing select an initial vertex randomly add vertices with the least increase in edge-cuts Spectral sort along the Fiedler vector of the Laplacian matrix L = D - A
METIS Results
ParMETIS PARKMETIS PARGKMETIS PARGMETIS
ParMETIS Results
Party Linear - partition = i div P Scattered - partition = i mod P Random - randomly assign Gain start with all vertices on one partition fill other partitions one at a time selecting vertices which increase total number of edge-cuts the least.
Party Farhat Coordinate Sorting start with all partitions empty select vertex with lowest degree assign vertices in a breath-first manner Coordinate Sorting sort along elongated axis assign using the linear algorithm
Party Results
Jostle - version 2.0 Developed at University of Greenwich by Chris Walshaw
Overall Results
Conclusions and Future Work Implement a static graph application using Fractiling and several graph partitioning algorithms on unweighted graphs. Comparing the results to non-Fractiled applications. Implement a dynamic graph application, again using Fractiling and several graph partitioning algorithms on unweighted graphs. Comparing the results to non-Fractiled applications. Implement the above two studies using weighted graphs.