Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.

Slides:



Advertisements
Similar presentations
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Advertisements

Chapter 5: Tree Constructions
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.
Exact Inference in Bayes Nets
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
A survey of techniques for precise program slicing Komondoor V. Raghavan Indian Institute of Science, Bangalore.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
1cs542g-term Notes. 2 Meshing goals  Robust: doesn’t fail on reasonable geometry  Efficient: as few triangles as possible Easy to refine later.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
FunState – An Internal Design Representation for Codesign A model that enables representations of different types of system components. Mixture of functional.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Distributed process management: Distributed deadlock
L20: Putting it together: Tree Search (Ch. 6) November 29, 2011.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Art of Multiprocessor Programming 1 Programming Paradigms for Concurrency Pavol Černý, Vasu Singh, Thomas Wies.
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Parallel graph algorithms Antonio-Gabriel Sturzu, SCPD Adela Diana Almasi, SCPD Adela Diana Almasi, SCPD Iulia Alexandra Floroiu, ISI Iulia Alexandra Floroiu,
Survey Propagation. Outline Survey Propagation: an algorithm for satisfiability 1 – Warning Propagation – Belief Propagation – Survey Propagation Survey.
The Complexity of Distributed Algorithms. Common measures Space complexity How much space is needed per process to run an algorithm? (measured in terms.
Teacher: Chun-Yuan Lin
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Hwajung Lee. Well, you need to capture the notions of atomicity, non-determinism, fairness etc. These concepts are not built into languages like JAVA,
Hwajung Lee. Why do we need these? Don’t we already know a lot about programming? Well, you need to capture the notions of atomicity, non-determinism,
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Diagnostic Information for Control-Flow Analysis of Workflow Graphs (aka Free-Choice Workflow Nets) Cédric Favre(1,2), Hagen Völzer(1), Peter Müller(2)
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Mesh Generation, Refinement and Partitioning Algorithms Xin Sui The University of Texas at Austin.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
More on Clustering in COSC 4335
Automatic Detection of Extended Data-Race-Free Regions
Amir Kamil and Katherine Yelick
L21: Putting it together: Tree Search (Ch. 6)
Amir Kamil and Katherine Yelick
CSE572: Data Mining by H. Liu
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with Object Assemblies TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A

Data parallelism: - Highly coarse-grained (MapReduce) - Highly fine-grained (numeric computations on dense arrays) -Problem-specific methods Taming parallelism Task-parallelism Message-passing

Taming parallelism Our target: Data-parallel computations over large, unstructured, shared-memory graphs Unknown granularity High-level correctness as well as efficiency. Our target: Data-parallel computations over large, unstructured, shared-memory graphs Unknown granularity High-level correctness as well as efficiency.

Delaunay mesh refinement Triangulate a given set of points. Delaunay property: No point is contained within the circumcircle of a triangle. Quality property: No bad triangles—i.e., triangles with an angle > 120 o. Mesh refinement: Fix bad triangles through an iterative algorithm.

Retriangulation Cavity: all triangles whose circumcircle contains new point. Quality constraint may not hold for all new triangles.

Sequential mesh refinement Mesh m = /* read input mesh */ Worklist wl = new Worklist(m.getBad()); foreach triangle t in wl { Cavity c = new Cavity(t); c.expand(); c.retriangulate(); m.updateMesh(c); wl.add(c.getBad()); } Cavities are contiguous “regions” in the mesh. Worst-case cavities can encompass the whole mesh.

Parallelization Computation over complex, unstructured graphs Mesh = Heap-allocated graph. Nodes = triangles. Edges = adjacency Atomicity: Cavities must be retriangulated atomically. Non-overlapping cavities can be processed in parallel. Seems impossible to handle with static analysis: – Shape of data structure changes greatly over time. – Shape of data structure is highly input-dependent. – Without deep algorithmic knowledge, impossible to say if statically if cavities will overlap. Lots of recent work, notably by Pingali et al.

List of similar applications Delaunay mesh refinement, Delaunay triangulation Agglomerative clustering, ray tracing Social network maintenance Minimum spanning tree, Maximum flow N-body simulation, epidemiological simulation Sparse matrix-vector multiplication, sparse Cholesky factorization Belief propagation, survey propagation in Bayesian inference Iterative dataflow analysis, Petri net simulation Finite-difference PDE solution

Locality of updates in Chorus Cavity On a mesh of ~100,000 triangles from Lonestar benchmarks: Average cavity size = 3.75 triangles. Maximum cavity size = 12 triangles Average-case locality the essence of parallelism. Chorus: parallel computation driven by “neighborhoods” in heaps.

Heaps, regions, assemblies Heap = directed graph Nodes = objects Labeled edges = pointers Region = induced subgraph Assembly = region + thread of control Typically speculative and shortlived.

Assembly class = set of local variables + set of guarded updates + constructor + public variables. Program = set of classes Synchronization happens in guard evaluation. Programs, assembly classes busy executing update terminated ready to be preempted or execute next update :: Guard: Update

g is a condition on the local variables and owned objects of Guards can merge assemblies :: merge (u.f): S :: merge (u.f) when g: S u f gets a bigger region, keeps local state dies. must be in ready state while merge happens

Split into assemblies of class T. Other assemblies not affected. Not a synchronization construct. Updates can split an assembly split(T)

Attempts to access objects outside region lead to exceptions. Local updates x = u.f; x.f = y; u f

Delaunay mesh refinement Use two assembly classes: Triangle and Cavity. – Cavity = local region in mesh. Each triangle: – Determines if it is bad (local check). – If so, merges with neighbors to become cavity. Each cavity: – Determines if it is complete (local check). – If no, merges with a neighbor. – If yes, retriangulates (locally) and splits.

Delaunay mesh refinement: sketch assembly Triangle::... action:: merge (v.f, Cavity) when isBad: skip assembly Cavity::... action:: merge (v.f) when (not isComplete):... isComplete: retriangulate(); split(Triangle)

Delaunay mesh refinement: sketch assem Triangle::... action:: merge (v.f, Cavity, u) when bad?: skip assem Cavity::... action:: merge (v.f) when (not complete?): skip complete?: retriangulate(); split(Triangle) What happens on a conflict? Cavity i “absorbed” by cavity j. Cavity j now has some “unnecessary” triangles. j will later split. What happens on a conflict? Cavity i “absorbed” by cavity j. Cavity j now has some “unnecessary” triangles. j will later split.

Boruvka’s algorithm for minimum spanning tree Assembly = spanning tree Initially, each assembly has one node. As algorithm progresses, trees merge.

Race-freedom No aliasing, only ownership transfer. can merge with only when is not in the middle of an update.

Deadlock-freedom Classic definition: Process P waits for a resource from Q and vice versa. Deadlock in Chorus: – has a locally enabled merge with – No other progress is possible. But one of the merges can always be carried out. (An assembly can always be killed at its ready state.) u

JChorus Chorus + sequential Java. Assembly classes in addition to object classes. 7: assembly Cavity { 8: action { // expand cavity 9: merge(outgoingedges, TriangleObject t): 10: { outgoingedges.remove(t); 11: frontier.add(t); 12: build(); } 13: } 14: Set members; Set border; 15: Queue frontier; // current frontier 16: List outgoingedges; // outgoing edges on which to merge 17: TriangleObject initial;...

Division-based implementation Division = set of assemblies mapped to a core. Local access: Merge-actions within a division Split-actions Local updates Remote access: Merge-actions issued across divisions Uses assembly-level locks.

Implementation strategies Adaptive divisions. Heuristic for reducing the number of remote merges. During a merge, not only the target assembly, but also assemblies reachable by k pointer indirections, are migrated. Adaptation heuristic does elementary load balancing. Union-find data structure to relate objects and assemblies that they belong to Needed for splits and merges. Token-passing for deadlock prevention and termination detection.

Experiments: Delaunay refinement from Lonestar benchmarks Large dataset from Lonestar benchmarks. – 100,364 triangles. – 47,768 initially bad. 1 to 8 threads. Competing approaches: – Object-level locking – DSTM (Software transactions)

Locality: mesh snapshots The initial mesh and divisionsMesh after several thousand retriangulations

Delaunay: Speedup over sequential

Delaunay: Self-relative speedup

Delaunay: Conflicts

Related models Threads + explicit locking: Global heap abstraction, arbitrary aliasing. Software transactions: Burden of reasoning passed to transaction manager. In most implementations, heap is viewed as global. Static data partitioning: Unpredictable nature of the computation makes static analysis hard. Actors: Based on low-level messaging. If sending references, potential of races. If copying triangles, inefficient. Pingali et al’s Galois: Same problem, but ours is an alternative.

More information Parallel programming with object assemblies. Roberto Lublinerman, Swarat Chaudhuri, Pavol Cerny. OOPSLA