Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)

Similar presentations


Presentation on theme: "1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)"— Presentation transcript:

1

2 1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin) Donald Nguyen (UT, Austin)

3 2 Objectives What are irregular applications? –how are they different from regular applications? –why are they so hard to parallelize? Is there parallelism in irregular applications? –operator formulation of algorithms –amorphous data-parallelism (ADP) –Galois programming model and baseline execution model –ParaMeter tool for estimating ADP in Galois programs Is there structure in irregular algorithms? –structural analysis of irregular algorithms –exploiting structure for efficient parallel implementation –Galois system What are the open research problems?

4 3 Schedule 1:30 PM-3:00 PM –Irregular applications, amorphous data parallelism, structure (Keshav Pingali) 3:00 PM-3:30 PM –Break 3:30 PM-4:15 PM –ParaMeter demo (Donald Nguyen) 4:15 PM-5:00 PM –Galois demo (Mario Mendez-Lojo)

5 4 Irregular applications

6 5 Definitions Regular applications –key data structures are vectors dense matrices –simple access patterns (eg) array indices are affine functions of for-loop indices –dependences are mostly independent of runtime values –examples: MMM, Cholesky & LU factorizations, stencil codes, FFT,… Irregular applications –key data structures are lists, priority queues trees, DAGs, graphs usually implemented using pointers or references –complex access patterns –dependences are mostly functions of runtime values –examples: see next slide

7 6 Examples Application/domainAlgorithm AIBayesian inferences, survey propagation CompilersIterative and elimination-based dataflow algorithms Functional interpretersGraph reduction, static and dynamic dataflow MaxflowPreflow-push, augmenting paths Computational biologyProtein homology networks, phylogenetic reconstruction Event-driven simulationChandy-Misra-Bryant, Jefferson Timewarp MeshingGeneration/refinement/partitioning Social networksConnecting the dots Data-miningClustering

8 7 Problem Handwritten parallel implementations exist for some irregular problems No systematic way of thinking about parallelism in irregular applications –no abstractions for thinking about parallelism –many mechanisms exist but it is not clear when these mechanisms are useful static parallelization: dependence analysis and restructuring inspector-executor parallelization interference graphs optimistic parallelization ….. Each new irregular application is a new phenomenon seemingly unrelated to other applications Need a science of parallel programming –analysis: framework for thinking about parallelism and locality in applications –synthesis: guide for producing efficient parallel implementations

9 8 Science of electro-magnetism Seemingly unrelated phenomena Unifying abstractions Specialized models that exploit structure

10 9 Organization of talk Seemingly unrelated parallel algorithms and data structures –Stencil codes –Delaunay mesh refinement –Event-driven simulation –Graph reduction of functional languages –……… Unifying abstractions –Operator formulation of algorithms –Amorphous data-parallelism –Galois programming model –Baseline parallel implementation Specialized implementations that exploit structure –Structure of algorithms –Optimized compiler and runtime system support for different kinds of structure

11 10 Seemingly unrelated algorithms

12 11 Stencil computation: Jacobi iteration Finite-difference method for solving pdes –discrete representation of domain: grid Values at interior points are updated using values at neighbors –values at boundary points are fixed Data structure: –dense arrays Parallelism: –values at next time step can be computed simultaneously –parallelism is not dependent on runtime values Compiler can find the parallelism –spatial loops are DO-ALL loops //Jacobi iteration with 5-point stencil //initialize array A for time = 1, nsteps for in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j) Jacobi iteration, 5-point stencil Atemp tntn t n+1

13 12 Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Dont-care non-determinism: –final mesh depends on order in which bad triangles are processed –applications do not care which mesh is produced Data structure: –graph in which nodes represent triangles and edges represent triangle adjacencies Parallelism: –bad triangles with cavities that do not overlap can be processed in parallel –parallelism is dependent on runtime values compilers cannot find this parallelism –(Gary Miller et al) at runtime, repeatedly build interference graph and find maximal independent sets for parallel execution Mesh m = /* read in mesh */ WorkList wl; wl.add(m.badTriangles()); while (true) { if ( wl.empty() ) break; Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e);//determine new cavity c.expand(); c.retriangulate();//re-triangulate region m.update(c);//update mesh wl.add(c.badTriangles()); }

14 13 Event-driven simulation Stations communicate by sending messages with time-stamps on FIFO channels Stations have internal state that is updated when a message is processed Messages must be processed in time-order at each station Data structure: –Messages in event-queue, sorted in time-order Parallelism: –activities created in future may interfere with current activities interference graph approach will not work –Jefferson time-warp station can fire when it has an incoming message on any edge requires roll-back if speculative conflict is detected –Chandy-Misra-Bryant conservative event-driven simulation requires null messages to avoid deadlock 2 5 A B 3 4 C 6

15 14 Remarks Diverse algorithms and data structures –relatively few algorithms use dense arrays –more common: graphs, trees, lists, priority queues,… Exploiting parallelism is very complex –DMR: (Miller et al) interference graph + maximal independent sets –Event-driven simulation: (Jefferson) Timewarp algorithm Complexity arises from several sources: –parallelism can be dependent on runtime values DMR, event-driven simulation, graph reduction,…. –dont-care non-determinism nothing to do with concurrency DMR, graph reduction –activities created in the future may interfere with current activities event-driven simulation…

16 15 Organization of talk Seemingly unrelated parallel algorithms and data structures –Stencil codes –Delaunay mesh refinement –Event-driven simulation –Graph reduction of functional languages –……… Unifying abstractions –Operator formulation of algorithms –Amorphous data-parallelism –Galois programming model –Baseline parallel implementation Specialized implementations that exploit structure –Structure of algorithms –Optimized compiler and runtime system support for different kinds of structure Research topics

17 16 Unifying abstractions Should provide a model of parallelism in irregular algorithms Ideally, unified treatment of parallelism in regular and irregular algorithms –parallelism in regular algorithms should emerge as a special case of general model –(cf.) correspondence principles in Physics Abstractions should be effective –should be possible to write an interpreter to execute algorithms in parallel

18 17 Traditional abstraction Dependence graph –nodes are computations –edges are dependences Parallelism –width of the computation graph –critical path: longest path from START to END Effective parallel computation graph model –dataflow model of Dennis, Arvind et al Inadequate for irregular applications –dependences between computations are functions of runtime values –dont-care non-determinism –conflicting work may be created dynamically –…–… Data structures play almost no role in this abstraction –in most programs, parallelism comes from data-parallelism (concurrent operations on data structure elements) New abstraction –data-centric: data structures play a central role –we will use graph ADT to illustrate concepts

19 18 Graph ADT Parallel algorithms are usually introduced using matrices and matrix algorithms More useful abstraction for us: graph –abstract data type (ADT): some number of nodes and edges edges may be directed nodes/edges may be labeled with values –concrete representation may be different for different applications, platforms and graphs –application program written in terms of ADT (cf.) relational databases Graphs are usually sparse in most applications –even Jacobi example: 2D grid is a sparse graph in which each internal node has four neighbors! Special graphs –cliques: isomorphic to square dense matrices –labeled graphs w/o edges: isomorphic to sets/multi-sets..... ADT Concrete representation

20 19 i1i1 i2i2 i3i3 i4i4 i5i5 Operator formulation of algorithms Algorithm = repeated application of operator to graph –active element: node or edge where computation is needed –DMR: nodes representing bad triangles –Event-driven simulation: station with incoming message activity: applying operator to active element –neighborhood: set of nodes and edges read/written to perform computation –DMR: cavity of bad triangle –Event-driven simulation: station distinct usually from neighbors in graph –ordering: order in which active elements must be executed in a sequential implementation –any order (Jacobi,DMR, graph reduction) »unordered algorithms –some problem-dependent order (event- driven simulation) »ordered algorithms : active node : neighborhood

21 20 Parallelism Amorphous data-parallelism (ADP) –Active nodes can be processed in parallel, subject to neighborhood constraints ordering constraints Unordered algorithms –Activities are independent if their neighborhoods do not overlap –Independent activities can be processed in parallel –More generally, Elements in the intersection of neighborhoods are not written Commutativity relations –How do we find independent activities? Ordered algorithms –Independence is not enough –How do we determine what is safe to execute w/o violating ordering? i1i1 i2i2 i3i3 i4i4 i5i5 2 5 A B C

22 21 Galois programming model (PLDI 2007) Program written in terms of abstractions in model Two-level programming model Joe programmers –sequential, OO model –Galois set iterators: for iterating over unordered and ordered sets of active elements for each e in Set S do B(e) –evaluate B(e) for each element in set S –no a priori order on iterations –set S may get new elements during execution for each e in OrderedSet S do B(e) –evaluate B(e) for each element in set S –perform iterations in order specified by OrderedSet –set S may get new elements during execution Stephanie programmers –Galois concurrent data structure library (Wirth) Algorithms + Data structures = Programs Mesh m = /* read in mesh */ Set ws; ws.add(m.badTriangles()); // initialize ws for each tr in Set ws do { //unordered Set iterator if (tr no longer in mesh) continue; Cavity c = new Cavity(tr); c.expand(); c.retriangulate(); m.update(c); ws.add(c.badTriangles()); //bad triangles } DMR using Galois iterators

23 22 Concurrent Data structure main() …. for each …..{ ……. }..... Master Joe Program Parallel execution model: –shared-memory –optimistic execution of Galois iterators Implementation: –master thread begins execution of program –when it encounters iterator, worker threads help by executing iterations concurrently –several work-distribution strategies implemented in Galois (cf. OpenMP) –barrier synchronization at end of iterator Independence of neighborhoods: –software TM variety –logical locks on nodes and edges Ordering constraints for ordered set iterator: –execute iterations out of order but commit in order –cf. out-of-order CPUs Galois parallel execution model i1i1 i2i2 i3i3 i4i4 i5i5

24 23 Parameter tool (PPoPP 2009) Idealized execution model: –unbounded number of processors –applying operator at active node takes one time step –execute a maximal set of active nodes, subject to neighborhood and ordering constraints Measures amorphous data-parallelism in irregular program execution Useful as an analysis tool

25 24 Example: DMR Input mesh: –Produced by Triangle (Shewchuck) –550K triangles –Roughly half are badly shaped Available parallelism: –How many non-conflicting triangles can be expanded at each time step? Parallelism intensity: –What fraction of the total number of bad triangles can be expanded at each step?

26 25 Summary of ADP abstraction Old abstraction: dependence graphs –inadequate for most irregular algorithms New abstraction: operator formulation –active elements –neighborhoods –ordering of active elements Baseline execution model –Galois programming model sequential, OO uses abstractions in model –optimistic parallel execution –conflict detection scheme: many choices exclusive locks read/write locks commutativity relations Amorphous data-parallelism –generalizes conventional data-parallelism i1i1 i2i2 i3i3 i4i4 i5i5

27 26 Organization of talk Seemingly unrelated parallel algorithms and data structures –Stencil codes –Delaunay mesh refinement –Event-driven simulation –Graph reduction of functional languages –……… Unifying abstractions –Operator formulation of algorithms –Amorphous data-parallelism –Galois programming model –Baseline parallel implementation Specialized implementations that exploit structure –Structure of algorithms –Optimized compiler and runtime system support for different kinds of structure Ongoing work

28 27 Structure in irregular algorithms Baseline implementation is general but usually inefficient –(eg) dynamic scheduling of iterations is not needed for stencil codes since grid structure is known at compile-time –(eg) hand-written parallel implementations of DMR and stencil codes do not buffer updates to neighborhood until commit point Efficient execution requires exploiting structure in algorithms and data structures How do we talk about structure in algorithms? –Previous approaches: like descriptive biology Mattson et al book Parallel programming patterns (PPP): Snir et al Berkeley motifs: Patterson, Yelick, et al … –Our approach: like molecular biology structural analysis of algorithms based on amorphous data-parallelism framework

29 28 Structural analysis of irregular algorithms irregular algorithms topology operator ordering morph local computation reader general graph grid tree unordered ordered refinement coarsening general topology-driven data-driven Jacobi: topology: grid, operator: local computation, ordering: unordered DMR, graph reduction: topology: graph, operator: morph, ordering: unordered Event-driven simulation: topology: graph, operator: local computation, ordering: ordered

30 29 Exploiting structure to reduce overheads

31 30 Cautious operators (PPoPP 2010) Cautious operator implementation: –reads all the elements in its neighborhood before modifying any of them –(eg) Delaunay mesh refinement Algorithm structure: –cautious operator + unordered active elements Optimization: optimistic execution w/o buffering –grab locks on elements during read phase conflict: someone else has lock, so release your locks –once update phase begins, no new locks will be acquired update in-place w/o making copies zero-buffering –note: this is not two-phase locking

32 31 Eliminating speculation Coordinated execution of activities: if we can build dependence graph –Run-time scheduling: cautious operator + unordered active elements execute all activities partially to determine neighborhoods create interference graph and find independent set of activities execute independent set of activities in parallel w/o synchronization –Just-in-time scheduling: local computation + topology-driven (eg) tree walks, sparse MVM inspector-executor approach –Compile-time scheduling: previous case + graph is known at compile-time (eg) Jacobi, FFT, MMM, Cholesky make all scheduling decisions at compile-time time

33 DMR Results 0.5M triangles, 0.25M bad triangles Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 2 x Xeon X5570 (4 core, 2.93 GHz) Java 1.6.0_0-b11 Linux 2.6.24-27 x86_64 20GB heap size Mario and Donald will talk about Galois programming model and optimizations

34 New C++ implementation 33 24 core machine: 6-core Intel Xeon x 4, 128 GB RAM

35 Points-to Analysis –infer variables pointed by pointers in program Results –input: Linux kernel –seq. implementation in C++ –chunked FIFO scheduling –speedup: 3.75x (8 core Xeon) –seq. phases limit speedup 34 Andersen-style points-to analysis (OOPSLA 2010)

36 FAQs How is this different from TM? –Programming model: TM: explicitly parallel (threads) –transactions: synchronization mechanism for threads –mostly memory-level conflict detection + boosting Galois: Joe programs are sequential OO programs –ADT-level conflict detection –Where do threads come from? TM: someone elses problem Galois project: focus on –sources of parallelism in algorithms –structure of parallelism in algorithms –(Wirth) Algorithms + Data structures = Programs Galois: focus on algorithms and algorithmic structure TM: focus on concurrent data structures –Galois: mechanisms customized to algorithm structure Few algorithms require full-blown speculation/optimistic parallelization 35

37 FAQs (contd.) How is this different from TLS? –Programming model: Galois: separation between ADT and its implementation is critical –permits separation of Joe and Stephanie layers (cf. relational databases) –permits more aggressive conflict detection schemes like commutativity relations TLS: FORTRAN/C, so no separation between ADT and implementation –Programming model: Galois: dont-care non-determinism plays a central role TLS: FORTRAN/C, so only ordered algorithms –Exploiting algorithmic structure: Galois: exploits algorithmic structure to –reduce speculative overheads –focus speculation only where needed 36

38 37 Summary Seemingly unrelated parallel algorithms and data structures –Stencil codes –Delaunay mesh refinement –Event-driven simulation –……… Unifying abstractions –Operator formulation of algorithms –Amorphous data-parallelism –Galois programming model –Baseline parallel implementation Specialized implementations that exploit structure –Structure of algorithms –Optimized compiler and runtime system support for different kinds of structure

39 38 Schedule 1:30 PM-3:00 PM –Irregular algorithms, amorphous data parallelism, structure (Keshav Pingali) 3:00 PM-3:30 PM –Break 3:30 PM-4:15 PM –ParaMeter (Donald Nguyen) 4:15 PM-4:50 PM –Galois (Mario Mendez-Lojo) 4:50 PM-5:00 PM –Research problems (Keshav Pingali)

40 39 Research ideas Application studies: –Lonestar benchmark suite for irregular programs http://iss.ices.utexas.edu/lonestar/ Development of ADP model –locality –intra-operator parallelism Identifying algorithmic structure useful in parallel implementations –(eg) refinement operators (see Marios paper in OOPSLA 2010) Language/compiler analysis –how do you find/communicate structure to the implementation? –(eg) use program analysis to find cautious operators (Dimitris, POPL 2011) Specializing data structure implementations to particular algorithms –can this be done semi-automatically? –programmer writes sequential implementation of ADT and tool generates thread- safe version for use in Galois program? n1n1 n2n2 n4n4 n3n3 h4h4 h3h3 h2h2 n1n1 n2n2 n4n4 n3n3 h4h4 h3h3 h2h2 n1n1 n2n2 n4n4 h1h1 n3n3 h2h2 h4h4 h3h3

41 40 Research ideas (contd.) Auto-tuning –concrete implementations for graphs –scheduling strategies –contention-management policies –….. Load-balancing –data-centric work-stealing –see Milind Kulkarnis paper in SPAA 2010 Other execution models –GPUs –distributed-memory machines Architecture –existing core designs use lots of real estate for caches –may not be well-suited for irregular applications: little spatial and temporal locality relatively small amount of computation per memory access –one idea: use multi-threading to hide latency NVIDIA GPU-like architecture for irregular algorithms? Hardware-software codesign –can we generate custom cores for particular applications to exploit ADP?

42 41 Science of Parallel Programming Seemingly unrelated algorithms Unifying abstractions Specialized models that exploit structure 2 A B …….. i1i1 i2i2 i3i3 i4i4 i5i5

43 42 Objectives 1.Irregular applications –what are irregular applications? –why are they hard to parallelize? –is there parallelism in irregular applications? 2.Parallelism in irregular applications –amorphous data-parallelism (ADP) –Galois programming model and baseline execution model –ParaMeter tool for estimating ADP in Galois programs 3.Structure in irregular algorithms –structural analysis of irregular algorithms –exploiting structure for efficient parallel implementation –Galois system 4.Research opportunities


Download ppt "1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)"

Similar presentations


Ads by Google