The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Scheduling and Performance Issues for Programming using OpenMP

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.

ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.

Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,

Galois Performance Mario Mendez-Lojo Donald Nguyen.

“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.

1 Keshav Pingali University of Texas, Austin Towards a Science of Parallel Programming.

1 Johannes Schneider Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Elixir : A System for Synthesizing Concurrent Graph Programs

Oct Multi-threaded Active Objects Ludovic Henrio, Fabrice Huet, Zsolt Istvàn June 2013 –

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Data-parallel Abstractions for Irregular Applications Keshav Pingali University of Texas, Austin.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.

1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

CS533 – Spring Jeanie M. Schwenk Experiences and Processes and Monitors with Mesa What is Mesa? “Mesa is a strongly typed, block structured programming.

Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.

Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.

Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.

Tuning Threaded Code with Intel® Parallel Amplifier.

TensorFlow– A system for large-scale machine learning

Distributed Shared Memory

Martin Rinard Laboratory for Computer Science

L21: Putting it together: Tree Search (Ch. 6)

Chapter 10 Transaction Management and Concurrency Control

Programming with Shared Memory Specifying parallelism

CSE 542: Operating Systems

CSE 542: Operating Systems

Parallel Data Structures

Presentation transcript:

The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios Prountzos, Zifei Zhong

Overview of Galois Project Focus of Galois project: –parallel execution of irregular programs pointer-based data structures like graphs and trees –raise abstraction level for “Joe programmers” explicit parallelism is too difficult for most programmers performance penalty for abstraction should be small Research approach: a)study algorithms to find common patterns of parallelism and locality - Amorphous Data-Parallelism (ADP) b)design programming constructs for expressing these patterns c)implement these constructs efficiently - Abstraction-Based Speculation (ABS) For more information –papers in PLDI 2007, ASPLOS 2008, SPAA 2008 –website:

Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Order in which bad triangles should be refined: –final mesh depends on order in which bad triangles are processed –but all bad triangles will be eliminated ultimately regardless of order Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (true) { if ( wl.empty() ) break; Triangle e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); //determine cavity c.retriangulate();//re-triangulate cavity m.update(c);//update mesh wl.add(c.badTriangles()); }

Delaunay Mesh Refinement Parallelism: –triangles with non-overlapping cavities can be processed in parallel –if cavities of two triangles overlap, they must be done serially –in practice, lots of parallelism Exploiting this parallelism –compile-time parallelization techniques like points-to and shape analysis cannot expose this parallelism (property of algorithm, not program) –runtime dependence checking is needed Galois approach: optimistic parallelization

Take-away lessons Amorphous data-parallelism –data structures: graphs, trees, etc. –iterative algorithm over unordered or ordered work-list –elements can be added to work-list during computation –complex patterns of dependences between computations on different work-list elements (possibly input-sensitive) –but many of these computations can be done in parallel Contrast: crystalline (regular) data-parallelism –data structures: dense matrices –iterative algorithm over fixed integer interval –simple dependence patterns: affine subscripts in array accesses (mostly input-insensitive) for i = 1, N for j = 1, N for k = 1, N C[i,j] = C[i,j] + A[i,k]*B[k,j]

Take-away lessons (contd.) Amorphous data-parallelism is ubiquitous –Delaunay mesh generation: points to be inserted into mesh –Delaunay mesh refinement: list of bad triangles –Agglomerative clustering: priority queue of points from data-set –Boykov-Kolmogorov algorithm for image segmentation –Reduction-based interpreters for  -calculus: list of redexes –Iterative dataflow analysis algorithms in compilers –Approximate SAT solvers: survey propagation, WalkSAT –……

Take-away lessons (contd.) Amorphous data-parallelism is obscured within while loops, exit conditions, etc. in conventional languages –Need transparent syntax similar to FOR loops for crystalline data-parallelism Optimistic parallelization is necessary in general Compile-time approaches using points-to analysis or shape analysis may be adequate for some cases In general, runtime dependence checking is needed Property of algorithms, not programs Handling of dependence conflicts depends on the application Delaunay mesh generation: roll back any conflicting computation Agglomerative clustering: must respect priority queue order

Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

Galois Design Philosophy Do not worry about “dusty decks” (for now) –Restructuring existing code to expose amorphous data-parallelism: not our focus (cf. Google map/reduce) Evolution, not revolution –Modification of existing programming paradigms: OK –Radical solutions like functional programming: not OK No reliance on parallelizing compiler technology –will not work for many of our applications anyway –parallelizing compilers are very complex software artifacts Support two classes of programmers: –domain experts (Joe): should be shielded from complexities of parallel programming most programmers will be Joes –parallel programming experts (Steve):: small number of highly trained people –analogs: industry model even for sequential programming norm in domains like numerical linear algebra –Steves implement BLAS libraries –Joes express their algorithms in terms of BLAS routines

Galois system Application program –Has well-defined sequential semantics current implementation: sequential Java –Uses optimistic iterators to highlight for the runtime system opportunities for exploiting parallelism Class libraries –Like Java collections library but with additional information for concurrency control Runtime system –Managing optimistic parallelism

Optimistic set iterators for each e in Set S do B(e) –evaluate block B(e) for each element in set S –sequential semantics set elements are unordered, so no a priori order on iterations there may be dependences between iterations –set S may get new elements during execution for each e in OrderedSet S do B(e) –evaluate block B(e) for each element in set S –sequential semantics perform iterations in order specified by OrderedSet there may be dependences between iterations –set S may get new elements during execution

Galois version of mesh refinement Mesh m = /* read in mesh */ Set wl; wl.add(mesh.badTriangles()); // initialize the Set wl for each e in Set wl do { //unordered Set iterator if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); m.update(c); wl.add(c.badTriangles()); //add new bad triangles to Set } - Scheduling policy for iterator: controlled by implementation of Set class good choice for temporal locality: stack

Shared Memory Objects main() …. for each …..{ ……. }..... for each ….{ …..... …….. } …… Master Program Threads Object-based shared-memory model Master thread and some number of worker threads –master thread begins execution of program and executes code between iterators –when it encounters iterator, worker threads help by executing iterations concurrently with master –threads synchronize by barrier synchronization at end of iterator Threads invoke methods to access internal state of objects –how do we ensure sequential semantics of program are respected? Parallel execution model

Baseline solution: PLDI 2007 Iteration must lock object to invoke method Two types of objects: –catch and keep policy lock is held even after method invocation completes all locks released at end of iteration poor performance for programs with collections and accumulators –catch and release policy like Java locking policy permits method invocations from different concurrent iterations to be interleaved how do we make sure this is safe? Objects time i j

Catch and keep: iteration rollback What does iteration j do if object is already locked by some other iteration i ? –one possibility: wait and try to acquire lock again but this might lead to deadlock –our implementation: runtime system rolls back one of the iterations by undoing its updates to shared objects –Undoing updates: any “copy-on-write” solution works Make a copy of entire object when you acquire the lock (wasteful for large objects) Runtime systems maintains an “undo log” that holds information for undoing side-effects to objects as they are made (cf. software transactional memory) Shared Memory Objects time i j

Problem with catch and keep Poor performance for programs that deal with mutable collections and accumulators –work-sets are mutable collections –accumulators are ubiquitous Example: Delaunay refinement –Work-set is a (mutable) collection of bad triangles –Some thread grabs lock on work-set object, gets a bad triangle and removes it from the work-set –That thread must retain the lock on the work-set till iteration completes, which shuts out all other threads (same problem arises with transactional memory) Lesson: –For some objects, we need to interleave method invocations from different iterations –But must not lose serializability Shared Memory Objects i j 4

Galois solution: selective catch and release Example: accumulator –two methods: add(int) read() //return value –adds commute with other adds and reads commute with other reads Interleaving of commuting method invocations from different iterations  OK Interleaving of non-commuting method invocations from different iterations  trigger abort Rolling back side-effects: programmer must provide “inverse” methods for forward methods –Inverse method for add(n) is subtract(n) –Semantic inverse, not representational inverse This solution works for sets as well. Shared Memory Accumulator a.add(5) a.add(8) a.add(3) a.add(-4) a.add(5) a.add(3) a.add(8) a.add(-4) a.read() 058058  3

Abstraction-based Speculation Library writer: –specifies commutativity and inverse information for some classes Runtime system: –catch and release locking for these classes –keeps track of forward method invocations –checks commutativity of forward method invocations –invokes appropriate inverse methods on abort More details: PLDI 2007 Related work: –logical locks in database systems –Herlihy et al: PPoPP 2008

Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

Scheduling iterators Control scheduling by changing implementation of work- set class –stack/queue/etc. Scheduling can have a profound effect on performance Example: Delaunay mesh refinement –10,156 triangles of which 4,837 were bad –sequential code, work-set is stack: 21,918 completed iterations+0 aborted –4-processor Itanium-2, work-set implementations: stack: 21,736 iterations completed+28,290 aborted array+random choice: 21,908 iterations completed+49 aborted

Scheduling iterators (SPAA 2008) Crystalline data-parallelism: DO-ALL loops –main scheduling concerns are locality and load-balancing –Open-MP: static, dynamic, guided, etc. Amorphous data-parallelism: many more issues –Conflicts –Dynamically created work –Algorithmic issues: efficiency of data structures SPAA 2008 paper: –Scheduling framework for exploiting amorphous data-parallelism –Generalizes Open-MP DO-ALL loop scheduling constructs

Data Partitioning (ASPLOS 2008) Partition the graph between cores Data-centric assignment of work: – core gets bad triangles from its own partitions – improves locality – can dramatically reduce conflicts Lock coarsening: –associate locks with partitions Over-decomposition –improves core utilization Cores

Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

Small-scale multiprocessor results 4-processor Itanium 2 – 16 KB L1, 256 KB L2, 3MB L3 cache Versions: –GAL: using stack as worklist –PAR: partitioned mesh + data-centric work assignment –LCO: locks on partitions –OVD: over-decomposed version (factor of 4)

Large-scale multiprocessor results –128-core Sun Fire E25K 1 GHz –64 dual-core processors –Sun Solaris First “out-of-the-box” results Speed-up of 20 on 32 cores for refinement Mesh partitioning is still sequential –time for mesh partitioning starts to dominate after 8 processors (32 partitions) Need parallel mesh partitioning –Par-Metis (Karypis et al)

Galois version of mesh refinement Mesh m = /* read in mesh */ Set wl; wl.add(mesh.badTriangles()); // initialize the Set wl for each e in Set wl do { //unordered Set iterator if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); m.update(c); wl.add(c.badTriangles()); //add new bad triangles to Set } Partitioned Work-set Partitioned Graph Galois runtime system

Results for BK Image Segmentation Versions: –GAL: standard Galois version –PAR: partitioned graph –LCO: locks on partitions –OVD: over-decomposed version

Related work Transactions –programming model is explicitly parallel –assumes someone else is responsible for parallelism, locality, load- balancing, and scheduling, and focuses only on synchronization –Galois: main concerns are parallelism, locality, load-balancing, and scheduling “catch and keep” classes can use TM for roll-back but this is probably overkill Thread level speculation –not clear where to speculate in C programs wastes power in useless speculation –many schemes require extensive hardware support –no notion of abstraction-based speculation –no analogs of data partitioning or scheduling –overall results are disappointing

Ongoing work Case studies of irregular programs –understand “parallelism and locality patterns” in irregular programs Lonestar benchmark suite for irregular programs –joint work with Calin Cascaval’s group at IBM Yorktown Heights Optimizing the Galois runtime system –improve performance for iterators in which work/iteration is relatively low Compiler analysis to reduce overheads of optimistic parallel execution Scalability studies –larger number of cores Distributed-memory implementation –billion element meshes? Program analysis to verify assertions about class methods –Need a semantic specification of class

Irregular applications have amorphous data-parallelism –Work-list based iterative algorithms over unordered and ordered sets Amorphous data-parallelism may be inherently data-dependent –Pointer/shape analysis cannot work for these apps Optimistic parallelization is essential for such apps –Analysis might be useful to optimize parallel program execution Exploiting abstractions and high-level semantics is critical –Galois knows about sets, ordered sets, accumulators… Galois approach provides unified view of data-parallelism in regular and irregular programs –Baseline is optimistic parallelism –Use compiler analysis to make decisions at compile-time whenever possible Summary