Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.

Similar presentations


Presentation on theme: "A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of."— Presentation transcript:

1 A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of Computer Science, The University of Texas at Austin 2: Institute for Computational Engineering and Sciences, The University of Texas at Austin

2 Motivation 2 Graph algorithms are ubiquitous Goal: Compiler analysis for optimization of parallel graph algorithms Computational biology Social Networks Computer Graphics

3 Organization Parallelization of graph algorithms in Galois system – Speculative execution – Example: Boruvka MST algorithm Optimization opportunities – Reduce speculation overheads – Analysis problem: LockSet shape analysis Lockset shape analysis – Abstract Data Type (ADT) modeling – Hierarchy summarization abstraction – Predicate discovery Evaluation – Fast and infers all available optimizations – Optimizations give speedup up to 12x 3

4 Boruvka’s Minimum Spanning Tree Algorithm 4 Build MST bottom-up repeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a single node cd ab ef g 24 6 5 3 7 4 1 d a,c b ef g 4 6 3 4 1 7 lt

5 Algorithm = repeated application of operator to graph – Active node: Node where computation is needed – Activity: Application of operator to active node – Neighborhood: Sub-graph read/written to perform activity – Unordered algorithms: Active nodes can be processed in any order Amorphous data-parallelism – Parallel execution of activities, subject to neighborhood constraints Neighborhoods are functions of runtime values – Parallelism cannot be uncovered at compile time in general Parallelism in Boruvka i1i1 i2i2 i3i3 5

6 Optimistic Parallelization in Galois Programming model – Client code has sequential semantics – Library of concurrent data structures Parallel execution model – Thread-level speculation (TLS) – Activities executed speculatively Conflict detection – Each node/edge has associated exclusive lock – Graph operations acquire locks on read/written nodes/edges – Lock owned by another thread  conflict  iteration rolled back – All locks released at the end Two main overheads – Locking – Undo actions 6 i1i1 i2i2 i3i3

7 Overheads (I): Locking Optimizations – Redundant locking elimination – Lock removal for iteration private data – Lock removal for lock domination ACQ(P): set of definitely acquired locks per program point P Given method call M at P: Locks(M)  ACQ(P)  Redundant Locking 7

8 Overheads (II): Undo actions 8 Lockset Grows Lockset Stable Failsafe … foreach (Node a : wl) { … … } foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Program point P is failsafe if:  Q : Reaches(P,Q)  Locks(Q)  ACQ(P)

9 GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Lockset Analysis Redundant Locking Locks(M)  ACQ(P) Undo elimination  Q : Reaches(P,Q)  Locks(Q)  ACQ(P) Need to compute ACQ(P) 9 Runtime overhead : Runtime overhead

10 Analysis Challenges The usual suspects: – Unbounded Memory  Undecidability – Aliasing, Destructive updates Specific challenges: – Complex ADTs: unstructured graphs – Heap objects are locked – Adapt abstraction to ADTs We use Abstract Interpretation [CC’77] – Balance precision and realistic performance 10

11 Organization Parallelization of graph algorithms in Galois system – Speculative execution – Example: Boruvka MST algorithm Optimization opportunities – Reduce speculation overheads – Analysis problem: LockSet shape analysis Lockset shape analysis – Abstract Data Type (ADT) modeling – Hierarchy summarization abstraction – Predicate discovery Evaluation – Fast and infers all available optimizations – Optimizations give speedup up to 12x 11

12 Shape Analysis Overview 12 HashMap-Graph Tree-based Set ………… Graph { @rep nodes @rep edges … } Graph Spec Concrete ADT Implementations in Galois library Predicate Discovery Shape Analysis Boruvka.java Optimized Boruvka.java Set { @rep cont … } Set Spec ADT Specifications

13 ADT Specification Graph { @rep set nodes @rep set edges Set neighbors(Node n); } Graph Spec 13... Set S1 = g.neighbors(n);... Boruvka.java Abstract ADT state by virtual set fields @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Assumption: Implementation satisfies Spec

14 Graph { @rep set nodes @rep set edges @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Set neighbors(Node n); } Modeling ADTs 14 c ab Graph Spec dst src dst src

15 Modeling ADTs 15 c ab nodes edges Abstract State cont ret nghbrs Graph Spec dst src dst src Graph { @rep set nodes @rep set edges @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Set neighbors(Node n); }

16 Organization Parallelization of graph algorithms in Galois system – Speculative execution – Example: Boruvka MST algorithm Optimization opportunities – Reduce speculation overheads – Analysis problem: LockSet shape analysis Lockset shape analysis – Abstract Data Type (ADT) modeling – Hierarchy summarization abstraction – Predicate discovery Evaluation – Fast and infers all available optimizations – Optimizations give speedup up to 12x 16

17 cont S1 S2 L(S1.cont) L(S2.cont) Abstraction Scheme 17 cont S1S2 L(S1.cont) L(S2.cont) (S1 ≠ S2) ∧ L(S1.cont) ∧ L(S2.cont) Parameterized by set of LockPaths: L(Path)   o. o ∊ Path  Locked(o) – Tracks subset of must-be-locked objects Abstract domain elements have the form: Aliasing-configs  2 LockPaths  … 

18   Joining Abstract States 18 Aliasing is crucial for precision May-be-locked does not enable our optimizations #Aliasing-configs : small constant (  6)

19 lt GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Example Invariant in Boruvka 19 The immediate neighbors of a and lt are locked a ( a ≠ lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst) ∧ L(a.rev(dst).src) ∧ L(lt) ∧ L(lt.rev(dst)) ∧ L(lt.rev(src)) ∧ L(lt.rev(dst).src) ∧ L(lt.rev(src).dst) …..

20 Heuristics for Finding Paths Hierarchy Summarization (HS) – x.( fld )* – Type hierarchy graph acyclic  bounded number of paths – Preflow-Push: L( S.cont) ∧ L(S.cont.nd) Nodes in set S and their data are locked 20 Set S Node NodeData cont nd

21 Footprint Graph Heuristic Footprint Graphs (FG)[Calcagno et al. SAS’07] – All acyclic paths from arguments of ADT method to locked objects – x.( fld | rev(fld) )* – Delaunay Refinement: L(S.cont) ∧ L(S.cont.rev(src)) ∧ L(S.cont.rev(dst)) ∧ L(S.cont.rev(src).dst) ∧ L(S.cont.rev(dst).src) – Nodes in set S and all of their immediate neighbors are locked Composition of HS, FG – Preflow-Push: L(a.rev(src).ed) 21 FG HS

22 Organization Parallelization of graph algorithms in Galois system – Speculative execution – Example: Boruvka MST algorithm Optimization opportunities – Reduce speculation overheads – Analysis problem: LockSet shape analysis Shape analysis – Abstract Data Type modeling – Hierarchy summarization abstraction – Predicate discovery Evaluation – Fast and infers all available optimizations – Optimizations give speedup up to 12x 22

23 Experimental Evaluation Implement on top of TVLA – Encode abstraction by 3-Valued Shape Analysis [SRW TOPLAS’02] Evaluation on 4 Lonestar Java benchmarks Inferred all available optimizations # abstract states practically linear in program size 23 BenchmarkAnalysis Time (sec) Boruvka MST6 Preflow-Push Maxflow7 Survey Propagation12 Delaunay Mesh Refinement16

24 Impact of Optimizations for 8 Threads 24 8-core Intel Xeon @ 3.00 GHz

25 Related Work Safe programmable speculative parallelism [Prabhu et al. PLDI’10] – Focused on value speculation on ordered algorithms – Different rollback freedom condition Transactional Memory compiler optimizations [Harris et al. PLDI’06, Dragojevic et al. SPAA’09] – Similar optimizations – Don’t target rollback freedom – Imprecise for unbounded data-structures Optimizations for parallel graph programs [Mendez-Lojo et al. PPOPP’10] – Manual optimizations – Failsafe subsumes cautious Verifying conformance of ADT implementation to specification – The Jahob project (Kuncak, Rinard, Wies et al.) 25

26 Conclusion New application for static analysis – Optimization of optimistically parallelized graph programs Novel shape analysis – Utilize observations on the structure of concrete states and programming style Enables optimizations crucial for performance 26

27 Thank You! 27

28 Backup 28

29 Outline of Boruvka MST Code GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Pick arbitrary worklist node Find lightest neighbor Update neighbors of lightest Update worklist and MST

30 Approximating Sets of Locked Objects 30 Reachability-based Scheme cont S1S2 Reach(S1.cont) Reach(S2.cont) cont S1S2 Reach(S1.cont) Reach(S2.cont)

31 cont S1S2 L(S1.cont) L(S2.cont) Approximating Sets of Locked Objects 31 Hierarchy Summarization Scheme cont S1S2 L(S1.cont) L(S2.cont) (S1 ≠ S2) ∧ L(S1.cont) ∧ L(S2.cont)

32 Graph API uses flags to enable/disable locking and storing undo actions – removeEdge(Node src, Node dst, Flag f); Enabling Optimizations in Galois Challenge: Find minimal flag per ADT method call Solution: Lockset analysis 32 UNDOLOCKS NONE ALL Lock(src) Lock(dst) Lock( (src,dst) ) addEdge(src, dst);

33 Optimization Conditions 33

34 Speculation Overheads and Optimizations Source of OverheadOptimization Locking shared objects Redundant locking elimination Lock elision for iteration private data Lock domination Backup original state for rollback Avoid backups after failsafe points 34

35 Modeling ADTs 35 cd ab 7 2 3 4 5 ns es Abstract State cont ret nghbrs Graph { @rep set ns; // nodes @rep set es; // edges @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src, ret = new Set >(cont=nghbrs) ) Set neighbors(Node n); } Graph Spec ab 5 srcdst ed

36 Failsafe Points – Eliminating Undo Actions cd ab 7 2 3 4 5 Graph Node Graph Edge Edge Data 36 lt GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } a g.neighbors(lt); CAUTIOUS OPERATOR [Mendez et al. PPOPP’10]

37 Boruvka’s Minimum Spanning Tree Algorithm 37 Build MST bottom up repeat { pick arbitrary active node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST } until graph is a singular node cd ab ef g 24 6 5 3 7 4 1 3 7

38 Parallelism in Boruvka’s Algorithm cd ab ef g 24 6 5 3 7 4 1 f 38 Dependences between activities are functions of runtime values Parallelism cannot be uncovered at compile time in general Don’t Care Non-Determinism All produced MSTs correct and optimal

39 Failsafe Points – Eliminating Undo Actions cd ab 7 2 3 4 5 Graph Node Graph Edge Edge Data Acquired New Lock Redundant 39 lt GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } a g.neighbors(lt); CAUTIOUS OPERATOR [Mendez et al. PPOPP’10]

40 GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Redundant Locking Example cd ab 7 2 3 4 5 Graph Node Graph Edge Edge Data Acquired New Lock Redundant 40 lt a

41 Boruvka’s Minimum Spanning Tree Algorithm cd ab ef g 24 6 5 3 7 4 1 3 7 GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } 41 Build MST iteratively Pick random active node Contract edge with lightest neighbor

42 GSet wl = new GSet (); wl.addAll(g.getNodes()); GBag mst = new GBag (); foreach (Node a : wl) { Set aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } g.removeNode(lt); mst.add(minW); wl.add(a); } Parallelism in Boruvka’s Algorithm cd ab ef g 24 6 5 3 7 4 1 f 42 Dependences between activities are functions of runtime values Parallelism cannot be uncovered at compile time in general Don’t Care Non-Determinism All produced MSTs correct and optimal

43 Abstraction of a Single State 43 cd ab 7 2 3 4 5 a lt State after first loop in Boruvka We maintainWe loose Set of definitely locked objects denoted by lockpaths Maybe locked information Aliasing of top level variablesCardinality of sets Uniquely pointed-to types and objects referenced from the stack Content sharing of multiple collections

44 Hierarchy Summarization Intuition 44 Graph Set Weight Node Edge Iterator es src, dst ns cont ed past, at, future nd all g aNghbrs ltNghbrs nIter a, lt, n w, wan, minW e, an Gset gcont wl GBag mst bcont Void


Download ppt "A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of."

Similar presentations


Ads by Google