Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.

Similar presentations


Presentation on theme: "The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios."— Presentation transcript:

1 The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios Prountzos, Zifei Zhong

2 Overview of Galois Project Focus of Galois project: –parallel execution of irregular programs pointer-based data structures like graphs and trees –raise abstraction level for “Joe programmers” explicit parallelism is too difficult for most programmers performance penalty for abstraction should be small Research approach: a)study algorithms to find common patterns of parallelism and locality - Amorphous Data-Parallelism (ADP) b)design programming constructs for expressing these patterns c)implement these constructs efficiently - Abstraction-Based Speculation (ABS) For more information –papers in PLDI 2007, ASPLOS 2008, SPAA 2008 –website: http://iss.ices.utexas.edu

3 Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

4 Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Order in which bad triangles should be refined: –final mesh depends on order in which bad triangles are processed –but all bad triangles will be eliminated ultimately regardless of order Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (true) { if ( wl.empty() ) break; Triangle e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); //determine cavity c.retriangulate();//re-triangulate cavity m.update(c);//update mesh wl.add(c.badTriangles()); }

5 Delaunay Mesh Refinement Parallelism: –triangles with non-overlapping cavities can be processed in parallel –if cavities of two triangles overlap, they must be done serially –in practice, lots of parallelism Exploiting this parallelism –compile-time parallelization techniques like points-to and shape analysis cannot expose this parallelism (property of algorithm, not program) –runtime dependence checking is needed Galois approach: optimistic parallelization

6 Take-away lessons Amorphous data-parallelism –data structures: graphs, trees, etc. –iterative algorithm over unordered or ordered work-list –elements can be added to work-list during computation –complex patterns of dependences between computations on different work-list elements (possibly input-sensitive) –but many of these computations can be done in parallel Contrast: crystalline (regular) data-parallelism –data structures: dense matrices –iterative algorithm over fixed integer interval –simple dependence patterns: affine subscripts in array accesses (mostly input-insensitive) for i = 1, N for j = 1, N for k = 1, N C[i,j] = C[i,j] + A[i,k]*B[k,j]

7 Take-away lessons (contd.) Amorphous data-parallelism is ubiquitous –Delaunay mesh generation: points to be inserted into mesh –Delaunay mesh refinement: list of bad triangles –Agglomerative clustering: priority queue of points from data-set –Boykov-Kolmogorov algorithm for image segmentation –Reduction-based interpreters for  -calculus: list of redexes –Iterative dataflow analysis algorithms in compilers –Approximate SAT solvers: survey propagation, WalkSAT –……

8 Take-away lessons (contd.) Amorphous data-parallelism is obscured within while loops, exit conditions, etc. in conventional languages –Need transparent syntax similar to FOR loops for crystalline data-parallelism Optimistic parallelization is necessary in general Compile-time approaches using points-to analysis or shape analysis may be adequate for some cases In general, runtime dependence checking is needed Property of algorithms, not programs Handling of dependence conflicts depends on the application Delaunay mesh generation: roll back any conflicting computation Agglomerative clustering: must respect priority queue order

9 Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

10 Galois Design Philosophy Do not worry about “dusty decks” (for now) –Restructuring existing code to expose amorphous data-parallelism: not our focus (cf. Google map/reduce) Evolution, not revolution –Modification of existing programming paradigms: OK –Radical solutions like functional programming: not OK No reliance on parallelizing compiler technology –will not work for many of our applications anyway –parallelizing compilers are very complex software artifacts Support two classes of programmers: –domain experts (Joe): should be shielded from complexities of parallel programming most programmers will be Joes –parallel programming experts (Steve):: small number of highly trained people –analogs: industry model even for sequential programming norm in domains like numerical linear algebra –Steves implement BLAS libraries –Joes express their algorithms in terms of BLAS routines

11 Galois system Application program –Has well-defined sequential semantics current implementation: sequential Java –Uses optimistic iterators to highlight for the runtime system opportunities for exploiting parallelism Class libraries –Like Java collections library but with additional information for concurrency control Runtime system –Managing optimistic parallelism

12 Optimistic set iterators for each e in Set S do B(e) –evaluate block B(e) for each element in set S –sequential semantics set elements are unordered, so no a priori order on iterations there may be dependences between iterations –set S may get new elements during execution for each e in OrderedSet S do B(e) –evaluate block B(e) for each element in set S –sequential semantics perform iterations in order specified by OrderedSet there may be dependences between iterations –set S may get new elements during execution

13 Galois version of mesh refinement Mesh m = /* read in mesh */ Set wl; wl.add(mesh.badTriangles()); // initialize the Set wl for each e in Set wl do { //unordered Set iterator if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); m.update(c); wl.add(c.badTriangles()); //add new bad triangles to Set } - Scheduling policy for iterator: controlled by implementation of Set class good choice for temporal locality: stack

14 Shared Memory Objects main() …. for each …..{ ……. }..... for each ….{ …..... …….. } …… Master Program Threads Object-based shared-memory model Master thread and some number of worker threads –master thread begins execution of program and executes code between iterators –when it encounters iterator, worker threads help by executing iterations concurrently with master –threads synchronize by barrier synchronization at end of iterator Threads invoke methods to access internal state of objects –how do we ensure sequential semantics of program are respected? Parallel execution model

15 Baseline solution: PLDI 2007 Iteration must lock object to invoke method Two types of objects: –catch and keep policy lock is held even after method invocation completes all locks released at end of iteration poor performance for programs with collections and accumulators –catch and release policy like Java locking policy permits method invocations from different concurrent iterations to be interleaved how do we make sure this is safe? Objects time 1 2 3 i j

16 Catch and keep: iteration rollback What does iteration j do if object is already locked by some other iteration i ? –one possibility: wait and try to acquire lock again but this might lead to deadlock –our implementation: runtime system rolls back one of the iterations by undoing its updates to shared objects –Undoing updates: any “copy-on-write” solution works Make a copy of entire object when you acquire the lock (wasteful for large objects) Runtime systems maintains an “undo log” that holds information for undoing side-effects to objects as they are made (cf. software transactional memory) Shared Memory Objects time 1 2 3 i j

17 Problem with catch and keep Poor performance for programs that deal with mutable collections and accumulators –work-sets are mutable collections –accumulators are ubiquitous Example: Delaunay refinement –Work-set is a (mutable) collection of bad triangles –Some thread grabs lock on work-set object, gets a bad triangle and removes it from the work-set –That thread must retain the lock on the work-set till iteration completes, which shuts out all other threads (same problem arises with transactional memory) Lesson: –For some objects, we need to interleave method invocations from different iterations –But must not lose serializability Shared Memory Objects 1 2 3 i j 4

18 Galois solution: selective catch and release Example: accumulator –two methods: add(int) read() //return value –adds commute with other adds and reads commute with other reads Interleaving of commuting method invocations from different iterations  OK Interleaving of non-commuting method invocations from different iterations  trigger abort Rolling back side-effects: programmer must provide “inverse” methods for forward methods –Inverse method for add(n) is subtract(n) –Semantic inverse, not representational inverse This solution works for sets as well. Shared Memory Accumulator a.add(5) a.add(8) a.add(3) a.add(-4) a.add(5) a.add(3) a.add(8) a.add(-4) a.read() 058058  3

19 Abstraction-based Speculation Library writer: –specifies commutativity and inverse information for some classes Runtime system: –catch and release locking for these classes –keeps track of forward method invocations –checks commutativity of forward method invocations –invokes appropriate inverse methods on abort More details: PLDI 2007 Related work: –logical locks in database systems –Herlihy et al: PPoPP 2008

20 Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

21 Scheduling iterators Control scheduling by changing implementation of work- set class –stack/queue/etc. Scheduling can have a profound effect on performance Example: Delaunay mesh refinement –10,156 triangles of which 4,837 were bad –sequential code, work-set is stack: 21,918 completed iterations+0 aborted –4-processor Itanium-2, work-set implementations: stack: 21,736 iterations completed+28,290 aborted array+random choice: 21,908 iterations completed+49 aborted

22 Scheduling iterators (SPAA 2008) Crystalline data-parallelism: DO-ALL loops –main scheduling concerns are locality and load-balancing –Open-MP: static, dynamic, guided, etc. Amorphous data-parallelism: many more issues –Conflicts –Dynamically created work –Algorithmic issues: efficiency of data structures SPAA 2008 paper: –Scheduling framework for exploiting amorphous data-parallelism –Generalizes Open-MP DO-ALL loop scheduling constructs

23 Data Partitioning (ASPLOS 2008) Partition the graph between cores Data-centric assignment of work: – core gets bad triangles from its own partitions – improves locality – can dramatically reduce conflicts Lock coarsening: –associate locks with partitions Over-decomposition –improves core utilization Cores

24 Organization Case study of amorphous data-parallelism –Delaunay mesh refinement Galois system (PLDI 2007) –Programming model –Baseline implementation Galois system optimizations –Scheduling (SPAA 2008) –Data and computation partitioning (ASPLOS 2008) Experimental results Ongoing work

25 Small-scale multiprocessor results 4-processor Itanium 2 – 16 KB L1, 256 KB L2, 3MB L3 cache Versions: –GAL: using stack as worklist –PAR: partitioned mesh + data-centric work assignment –LCO: locks on partitions –OVD: over-decomposed version (factor of 4)

26 Large-scale multiprocessor results Maverick@TACC –128-core Sun Fire E25K 1 GHz –64 dual-core processors –Sun Solaris First “out-of-the-box” results Speed-up of 20 on 32 cores for refinement Mesh partitioning is still sequential –time for mesh partitioning starts to dominate after 8 processors (32 partitions) Need parallel mesh partitioning –Par-Metis (Karypis et al)

27 Galois version of mesh refinement Mesh m = /* read in mesh */ Set wl; wl.add(mesh.badTriangles()); // initialize the Set wl for each e in Set wl do { //unordered Set iterator if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); m.update(c); wl.add(c.badTriangles()); //add new bad triangles to Set } Partitioned Work-set Partitioned Graph Galois runtime system

28 Results for BK Image Segmentation Versions: –GAL: standard Galois version –PAR: partitioned graph –LCO: locks on partitions –OVD: over-decomposed version

29 Related work Transactions –programming model is explicitly parallel –assumes someone else is responsible for parallelism, locality, load- balancing, and scheduling, and focuses only on synchronization –Galois: main concerns are parallelism, locality, load-balancing, and scheduling “catch and keep” classes can use TM for roll-back but this is probably overkill Thread level speculation –not clear where to speculate in C programs wastes power in useless speculation –many schemes require extensive hardware support –no notion of abstraction-based speculation –no analogs of data partitioning or scheduling –overall results are disappointing

30 Ongoing work Case studies of irregular programs –understand “parallelism and locality patterns” in irregular programs Lonestar benchmark suite for irregular programs –joint work with Calin Cascaval’s group at IBM Yorktown Heights Optimizing the Galois runtime system –improve performance for iterators in which work/iteration is relatively low Compiler analysis to reduce overheads of optimistic parallel execution Scalability studies –larger number of cores Distributed-memory implementation –billion element meshes? Program analysis to verify assertions about class methods –Need a semantic specification of class

31 Irregular applications have amorphous data-parallelism –Work-list based iterative algorithms over unordered and ordered sets Amorphous data-parallelism may be inherently data-dependent –Pointer/shape analysis cannot work for these apps Optimistic parallelization is essential for such apps –Analysis might be useful to optimize parallel program execution Exploiting abstractions and high-level semantics is critical –Galois knows about sets, ordered sets, accumulators… Galois approach provides unified view of data-parallelism in regular and irregular programs –Baseline is optimistic parallelism –Use compiler analysis to make decisions at compile-time whenever possible Summary


Download ppt "The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios."

Similar presentations


Ads by Google