Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

Slides:



Advertisements
Similar presentations
Partitioning Screen Space for Parallel Rendering
Advertisements

Sharks and Fishes – The problem
A backtrack data structure and algorithm
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
1 Meshes of Trees (MoT) and Applications in Integer Arithmetic Panagiotis Voulgaris Petros Mol Course: Parallel Algorithms.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Adaptive Mesh Applications
Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves Csaba Attila Vigh Department of Informatics, TU München JASS 2006,
CS 206 Introduction to Computer Science II 04 / 28 / 2009 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 04 / 27 / 2009 Instructor: Michael Eckmann.
1cs542g-term Notes. 2 Meshing goals  Robust: doesn’t fail on reasonable geometry  Efficient: as few triangles as possible Easy to refine later.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
CS 584. Review n Systems of equations and finite element methods are related.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Mapping Techniques for Load Balancing
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
March 12, 2008 A Parallel Algorithm for Optimization-Based Smoothing of Unstructured 3-D Meshes by Vincent C. Betro.
Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.
Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,
1 Data Structures for Scientific Computing Orion Sky Lawlor charm.cs.uiuc.edu 2003/12/17.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
The swiss-carpet preconditioner: a simple parallel preconditioner of Dirichlet-Neumann type A. Quarteroni (Lausanne and Milan) M. Sala (Lausanne) A. Valli.
Sorting CS 105 See Chapter 14 of Horstmann text. Sorting Slide 2 The Sorting problem Input: a collection S of n elements that can be ordered Output: the.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
BOĞAZİÇİ UNIVERSITY – COMPUTER ENGINEERING Mehmet Balman Computer Engineering, Boğaziçi University Parallel Tetrahedral Mesh Refinement.
11/26/02(C) University of Wisconsin Last Time BSplines.
Super computers Parallel Processing
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
Scalable Dynamic Adaptive Simulations with ParFUM Terry L. Wilmarth Center for Simulation of Advanced Rockets and Parallel Programming Laboratory University.
Interconnection topologies
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
ParFUM: High-level Adaptivity Algorithms for Unstructured Meshes
Parallel Sorting Algorithms
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Component Frameworks:
Parallel Sorting Algorithms
Adaptive Mesh Applications
Adaptivity and Dynamic Load Balancing
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Implementation of Adaptive Spacetime Simulations A
Presentation transcript:

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale

2 Unstructured Grids Typically arise in finite element method: –E.g. Space is tiled with variable-size-and-shape triangles –in 3D: may be tetrahedra, or hexahedra –Allows one to adjust the resolution in different regions The base data structure is a graph –Often, represented as bipartite graph: E.g. Triangles (Elements) and Nodes

3 Unstructured grid computations Typically –Attributes (stresses, strains, pressure, temperature, velocities) are attached to nodes and elements –Programs loop over elements and loop over nodes, separately Each time you “visit” an element: –Need to access, and possibly modify, all nodes connected to it. Each time you visit a node: –Typically, access and modify only node attributes –Rarely: access/modify attributes of elements connected to it

4 Unstructured grids: parallelization issues Two concerns: –The unstructured grid graph must be partitioned across processors vproc (virtual processor, in general) –Boundary values must be shared What to partition and what to duplicate (at the boundaries) –Partition elements (so each element belongs to exactly one vproc) –Share nodes at the boundary Each node potentially has several ghost copies –Why is this better than partitioning nodes, and sharing elements?

5 Partitioning unstructured grids Not so simple as structured grids –“by rows”, “by columns”, “rectangular”,.. Don’t work Geometric? –Applicable only if each node has coordinates –Even when applicable, may not lead to good performance What performance metrics to use? –Load balance: the number of elements in each partition –Communication Number of shared nodes (Total) Maximum number of shared nodes for any one partition Maximum number of “neighbor partitions” for any partition –Why? per message cost Geometric: difficult to optimize both

6 MP issues: Charm++ help: –Today (Wed, 2/21) 2pm to 5:30 pm, –2504, 2506, 2508 DCL (Parallel Programming Laboratory) My office hours for this week: –Thursday 10:00 A.M. to 12:00 noon on Thursday

7 Grid partitioning When communication costs are relatively low –Either because the data-set is large or the computation per element is large –Geometric methods can be used: Orthogonal Recursive Bisection (ORB) –Basic idea: Recursively divide sets into two Keep shapes squarish as long as possible –For each set: Find bounding box (Xmax, Xmin, Ymax, Ymin,..) Find the longer dimension (X or Y or..) Find a cut along the longer dimension that will divide the set equally –Doesn’t have to be at the midpoint of the section Partition the element in the two sets based on the cut Repeat for each set –Variation: non-power-of-two processors

8 Grid partitioning: quad/oct trees Another Geometric technique: At each step, divide the set into 2xD subsets, where D is the number of physical dimensions\ –In 2-D: 4 quadrants –Dividing line goes thru geometric midpoint of the box. –Bounding box is NOT recalculated each time in the recursion Comparison with ORB

9 Grid partitioning: Graph partitioners CHACO and METIS are well-known programs Optimize both load imbalance and communication overhead –But often ignore per-message cost, or the maximum-per-partition costs Earlier algorithm: KR (Kernigham-Ritchie) –METIS first coarsens the graph, applies KR to it, and then refines the graph Doing this not just once, but a k-level coarsening-refining

10 Crack Propagation Explicit FEM code Zero-volume Cohesive Elements inserted near the crack As the crack propagates, more cohesive elements added near the crack, which leads to severe load imbalance Framework handles –Partitioning elements into chunks –Communication between chunks –Load Balancing Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Pictures: S. Breitenfeld, and P. Geubelle

11 Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

12 Unstructured grid: managing communication Suppose triangles A B and C are on different processors –Node 1 is shared between all 3 processors –Must have a copy on all 3 processors –When values need to be added up: Option 1 (star): let A (say) be the “owner” of 1, –B and C send their copy of “1” to A, –A combines them (usually, just adding them up) –A sends updated values to B and C Option 2: (symmetric): each sends its copy of 1 to both the others Which one is better? A B C 1

13 Unstructured grid: managing communication In either scheme: –Each vproc maintains a list of neighboring vprocs –For each neighbor: maintains a list of shared nodes Each node has a local index (my 5th node). The same list works in both directions –Send –Receive

14 Adaptive variations: Structured grids: Suppose you need a different level of refinement at different places in the grid: Adaptive Mesh Refinement –Quad and Oct trees can be used –Neighboring regions may have resolutions that differ by 1 level Requiring (possibly complex) interpolation algorithms –The fact that you have to do the refinement in the middle of a parallel computation makes a difference Again and again, but often not every step Adjust your communication list Alternatively, put a layer of software in the middle to do the interpolations –so each square chunk thinks it has exactly one nbr on each side

15 Adaptive variations: unstructured grids Mesh may be refined in places, dynamically: –This is much harder to do (even sequentially) than for structured grids Think about triangles: –Quality restriction: avoid skinny long triangles –From parallel computing point of view: Need to change the list of shared nodes Load balance may shift Load balancing: –Abandon partitioning and repartition –Incrementally adjust (typically with virtualization)