Domain decomposition in parallel computing Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University COT 5410 – Spring 2004.

Slides:

Advertisements

Similar presentations

Partitioning Screen Space for Parallel Rendering

Advertisements

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

ECE 667 Synthesis and Verification of Digital Circuits

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Information Networks Graph Clustering Lecture 14.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Adaptive Mesh Applications

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Graph Clustering. Why graph clustering is useful? Distance matrices are graphs  as useful as any other clustering Identification of communities in social.

Lecture 21: Spectral Clustering

CS 584. Review n Systems of equations and finite element methods are related.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

A scalable multilevel algorithm for community structure detection

2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:

Multilevel Graph Partitioning and Fiduccia-Mattheyses

Partitioning Outline –What is Partitioning –Partitioning Example –Partitioning Theory –Partitioning Algorithms Goal –Understand partitioning problem –Understand.

Clustering Vertices of 3D Animated Meshes

Image Segmentation Rob Atlas Nick Bridle Evan Radkoff.

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.

Graph Partitioning Donald Nguyen October 24, 2011.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.

June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections

Spectral Partitioning: One way to slice a problem in half C B A.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

High Performance Computing Seminar

2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure

Solving Linear Systems Ax=b

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

A Continuous Optimization Approach to the Minimum Bisection Problem

CS 290H Administrivia: April 16, 2008

Haim Kaplan and Uri Zwick

Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

3.3 Network-Centric Community Detection

Algorithms (2IL15) – Lecture 7

EE5900 Advanced Embedded System For Smart Infrastructure

Adaptive Mesh Applications

Major Design Strategies

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004

Outline Background Geometric partitioning Graph partitioning –Static –Dynamic Important points

Background Tasks in a parallel computation need access to certain data Same datum may be needed by multiple tasks –Example: In matrix-vector multiplication, b 2 is needed for the computation of all c i2, 1 < i < n –If a process does not “own” a datum needed by its task, then it has to get it from a process that has it This communication is expensive –Aims of domain decomposition  Distribute the data in such a manner that the communication required is minimized  Ensure that the computational loads on processes are balanced

Domain decomposition example Finite difference computation –New value of a node depends on old values of its neighbors We want to divide the nodes amongst the processes so that –Communication is minimized Measure of partition quality –Computational load is evenly balanced

Geometric partitioning Partition a set of points –Uses only coordinate information Balances the load –The heuristic tries to ensure that communication costs are low Algorithms are typically fast, but partition not of high quality Examples –Orthogonal recursive bisection –Inertial –Space filling curves

Orthogonal recursive bisection Recursively bisect orthogonal to the longest dimension –Assume communication is proportional to the surface area of the domain, and aligned with coordinate axes –Recursive bisection Divide into two pieces, keeping load balanced Apply recursively, until desired number of partitions obtained

Inertial ORB may not be effective if cuts along the x, y, or z directions are not good ones Inertial –Recursively bisect orthogonal to the inertial axis

Space filling curves  Space filling curves –A continuous curve that fills the space –Order the points based on their relative position on the curve –Choose a curve that preserves proximity Points that are close in space should be close in the ordering too Example –Hilbert curve

Hilbert curve Sources – – H1H1 H2H2 HiHi H i+1 Hilbert curve = lim H n n

Domain decomposition with a space filling curve Order points based on their position on the curve Divide into P parts –P is the number of processes Space filling curves can be used in adaptive computations too They can be extended to higher dimensions too

Graph partitioning  Model as graph partitioning –Graph G = (V, E) –Each task is represented by a vertex A weight can be used to represent the computational effort –An edge exists between tasks if one needs data owned by the other Weights can be associated with edges too –Goal Partition vertices into P parts such that each partition has equal vertex weights Minimize the weights of edges cut Problem is NP hard –Edge cut metric Judge the quality of the partitioning by the number of edges cut

Static graph partitioning Combinatorial –Levelized nested dissection –Kernighan-Lin/Feduccia-Matheyses Spectral partitioning Multi-level methods

Combinatorial partitioning Use only connectivity information Examples –Levelized nested dissection –Kernighan-Lin/Feduccia-Matheyses

Levelized nested dissection (LND) Idea is similar to the geometric methods –But cannot use coordinate information –Instead of projecting vertices along the longest axis, order them based on distance from a vertex that may be one extreme of the longest dimension of a graph Pseudo-peripheral vertex –Perform a breadth-first search, starting from an arbitrary vertex –The vertex that is encountered last might be a good approximation to a peripheral vertex

LND example Finding a pseudoperipheral vertex Initial vertex Pseudoperipheral vertex

LND example – Partitioning Initial vertex Partition Recursively bisect the subgraphs

Kernighan-Lin/Fiduccia-Matheyses Refines an existing partition Kernighan-Lin –Consider pairs of vertices from different partitions –Choose a pair whose swapping will result in the best improvement in partition quality The best improvement may actually be a worsening –Perform several passes Choose best partition among those encountered Fiduccia-Matheyses –Similar but more efficient Boundary Kernighan-Lin –Consider only boundary vertices to swap... and many other variants

Kernighan-Lin example Better partition Edge cut = 3 Existing partition Edge cut = 4 Swap these

Spectral method Based on the observation that a Fiedler vector of a graph contains connectivity information Laplacian of a graph: L –l ii = d i (degree of vertex i) –l ij = -1 if edge {i,j} exists, otherwise 0 Smallest eigenvalue of L is 0 with eigenvector all 1 All other eigenvalues are positive for a connected graph Fiedler vector –Eigenvector corresponding to the second smallest eigenvalue

Fiedler vector Consider a partitioning of V into A and B –Let y i = 1 if v i  A, and y i = -1 if v i  B –For load balance,  i y i = 0 –Also  e ij  E (y i -y j ) 2 = 4 x number of edges across partitions –Also, y T Ly =  i d i y i 2 – 2  e ij  E y i y j =  e ij  E (y i -y j ) 2

Optimization problem  The optimal partition is obtain by solving –Minimize y T Ly –Constraints: y i  {-1,1}  i y i = 0 –This is NP hard Relaxed problem – Minimize y T Ly –Constraints:  i y i = 0 Add a constraint on a norm of y, example, ||y|| 2 = n 0.5 –Note (1, 1,..., 1) T is an eigenvector with eigenvalue 0 For a connected graph, all other eigenvalues are positive and orthogonal to this eigenvector, which implies  i y i = 0 The objective function is minimized by a Fiedler vector

Spectral algorithm Find a Fiedler vector of the Laplacian of the graph –Note that the Fiedler value (the second smallest eigenvalue) yields a lower bound on the communication cost, when the load is balanced From the Fiedler vector, bisect the graph –Let all vertices with components in the Fiedler vector greater than the median be in one component, and the rest in the other Recursively apply this to each partition Note: Finding the Fiedler vector of a large graph can be time consuming

Multilevel methods Idea –It takes time to partition a large graph –So partition a small graph instead!  Three phases –Graph coarsening Combine vertices to create a smaller graph –Example: Find a suitable matching Apply this recursively until a suitably small graph is obtained –Partitioning Use spectral or another partitioning algorithm to partition the small graph –Multilevel refinement Uncoarsen the graph to get a partitioning of the original graph At each level, perform some graph refinement

Multilevel example (without refinement)

Multilevel example (without refinement)

Multilevel example (without refinement)

Multilevel example (without refinement)

Multilevel example (without refinement)

Dynamic partitioning We have an initial partitioning –Now, the graph changes –  Determine a good partition, fast –  Also minimize the number of vertices that need to be moved Examples –PLUM –Jostle –Diffusion

PLUM Partition based on the initial mesh –Vertex and edge weights alone changed Map partitions to processors –Use more partitions than processors Ensures finer granularity –Compute a similarity matrix based on data already on a process Measures savings on data redistribution cost for each (process, partition) pair Choose assignment of partitions to processors –Example: Maximum weight matching »Duplicate each processor: # of partitions/P times –Alternative: Greedy approximation algorithm »Assign in order of maximum similarity value

JOSTLE Use Hu and Blake’s scheme for load balancing –Solve Lx = b using Conjugate Gradient L = Laplacian of processor graph, b i = Weight on process P i – Average weight –Move max(x i -x j, 0) weight between P i and P j Leads to balanced load –Equivalent to P i sending x i load to each neighbor j, and each neighbor P j sending x j to P i –Net loss in load for P i = d i x i -  neighbor j x j = L (i) x = b i »where L (i) is row i of L, and d i is degree of i –New load for P i = weight on P i - b i = average weight Leads to minimum L 2 norm of load moved –Using max(x i -x j, 0) Select vertices to move, based on relative gain –

Diffusion Involves only communication with neighbors A simple scheme –Processor P i repeatedly sends  w i weight to each neighbor w i = weight on P i w k = (I –  L) w k-1, w k = weight vector at iteration k –Simple criteria exist for choosing  to ensure convergence »Example:  = 0.5/(max i d i ), More sophisticated schemes exist

Important points Goals of domain decomposition –Balance the load –Minimize communication Space filling curves Graph partitioning model –Spectral method Relax NP hard integer optimization to floating point, and then discretize to get approximate integer solution –Multilevel methods Three phases Dynamic partitioning – additional requirements –Use old solution to find new one fast –Minimize number of vertices moved