Galois System Tutorial Mario Méndez-Lojo Donald Nguyen.

Slides:

Advertisements

Similar presentations

1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Continuing Abstract Interpretation We have seen: 1.How to compile abstract syntax trees into control-flow graphs 2.Lattices, as structures that describe.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Greta YorshEran YahavMartin Vechev IBM Research. { ……………… …… …………………. ……………………. ………………………… } T1() Challenge: Correct and Efficient Synchronization { ……………………………

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Some Properties of SSA Mooly Sagiv. Outline Why is it called Static Single Assignment form What does it buy us? How much does it cost us? Open questions.

ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.

Data Structures & Java Generics Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,

Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz April 7 th, 2010 Youngjoon Jo.

Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,

Galois Performance Mario Mendez-Lojo Donald Nguyen.

Concurrency Control and Recovery In real life: users access the database concurrently, and systems crash. Concurrent access to the database also improves.

Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:

1 Johannes Schneider Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer.

The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.

1 Control Flow Analysis Mooly Sagiv Tel Aviv University Textbook Chapter 3

February 12, 2009 Center for Hybrid and Embedded Software Systems Model Transformation Using ERG Controller Thomas H. Feng.

Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

Fast Spectrum Allocation in Coordinated Dynamic Spectrum Access Based Cellular Networks Anand Prabhu Subramanian*, Himanshu Gupta*,

ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Elixir : A System for Synthesizing Concurrent Graph Programs

Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 3: Galois Primer Roman Manevich Ben-Gurion University.

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.

JAVA COLLECTIONS LIBRARY School of Engineering and Computer Science, Victoria University of Wellington COMP T2, Lecture 2 Marcus Frean.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Thread-Level Speculation Karan Singh CS

Implementing Parallel Graph Algorithms: Graph coloring Created by: Avdeev Alex, Blakey Paul.

1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.

1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Chameleon Automatic Selection of Collections Ohad Shacham Martin VechevEran Yahav Tel Aviv University IBM T.J. Watson Research Center Presented by: Yingyi.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Speculative Region-based Memory Management for Big Data Systems Khanh Nguyen, Lu Fang, Harry Xu, Brian Demsky Donald Bren School of Information and Computer.

Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.

Sep/05/2001PaCT Fusion of Concurrent Invocations of Exclusive Methods Yoshihiro Oyama (Japan Science and Technology Corporation, working in University.

Final Review. From ArrayLists to Arrays The ArrayList : used to organize a list of objects –It is a class in the Java API –the ArrayList class uses an.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

JAVA COLLECTIONS LIBRARY School of Engineering and Computer Science, Victoria University of Wellington COMP T2, Lecture 2 Thomas Kuehne.

January 26, 2016Department of Computer Sciences, UT Austin Characterization of Existing Systems Puneet Chopra Ramakrishna Kotla.

Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.

JAVA COLLECTIONS LIBRARY School of Engineering and Computer Science, Victoria University of Wellington COMP T2, Lecture 2 Marcus Frean.

Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

TensorFlow– A system for large-scale machine learning

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Compositional Pointer and Escape Analysis for Java Programs

Janus: exploiting parallelism via hindsight

Harry Xu University of California, Irvine & Microsoft Research

Martin Rinard Laboratory for Computer Science

structures and their relationships." - Linus Torvalds

Synchronization trade-offs in GPU implementations of Graph Algorithms

Changing thread semantics

Concurrent Cache-Oblivious B-trees Using Transactional Memory

structures and their relationships." - Linus Torvalds

Parallel Data Structures

Presentation transcript:

Galois System Tutorial Mario Méndez-Lojo Donald Nguyen

Writing Galois programs Galois data structures – choosing right implementation – API basic flags (advanced) Galois iterators Scheduling – assigning work to threads 2

Motivating example – spanning tree Compute the spanning tree of an undirected graph Parallelism comes from independent edges Release contains minimal spanning tree examples Borůvka, Prim, Kruskal 3

Spanning tree - pseudo code Graph graph = read graph from file Node startNode = pick random node from graph startNode.inSpanningTree = true Worklist worklist = create worklist containing startNode List result = create empty list foreach src : worklist foreach Node dst : src.neighbors if not dst.inSpanningTree dst.inSpanningTree = true Edge edge= new Edge(src,dst) result.add(edge) worklist.add(dst ) create graph, initialize worklist and spanning tree worklist elements can be processed in any order neighbor not processed? add edge to solution add to worklist 4

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags) 5

Galois data structures “Galoized” implementations – concurrent – transactional semantics Also, serial implementations galois.object package – Graph – GMap, GSet –... 6

Graph API > Graph createNode(data: N) add(node: GNode) remove(node: GNode) addNeighbor(s: GNode, d: GNode) removeNeighbor(s: GNode, d: GNode) … GNode setData(data: N) getData() ObjectMorphGraph > ObjectGraph addEdge(s: GNode, d: Gnode, data:E) setEdgeData(s:GNode, d:Gnode, data:E) … ObjectLocalComputationGraph > Mappable map (closure: LambdaVoid ) map(closure: Lambda2Void ) … 7

Mappable interface Implicit iteration over collections of type T interface Mappable { void map(LambdaVoid body); } LambdaVoid = closure interface LambdaVoid { void call(T arg);} Graph and Gnode are Mappable graph.map(LambdaVoid body) “apply closure once per node in graph” node.map(LambdaVoid body) “apply closure once per neighbor of this node” 8

Spanning tree - serial code Graph graph=new MorphGraph.GraphBuilder().create() GNode startNode = Graphs.getRandom(graph) startNode.inSpanningTree = true Stack worklist = new Stack(startNode); List result = new ArrayList() while !worklist.isEmpty() src = worklist.pop() src.map(new LambdaVoid(){ void call(GNode dst) { NodeData dstData = dst.getData(); if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Edge(src, dst)) worklist.add(dst ) }}) graph utilities LIFO scheduling for every neighbor of the active node has the node been processed?graphs created using builder pattern 9

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags) 10

initial worklist apply closure to each active element scheduling policy Galois iterators static void GaloisRuntime.foreach(Iterable initial, Lambda2Void > body, Rule schedule) GaloisRuntime – ordered iterators, runtime statistics, etc Upon foreach invocation – threads are spawned – transactional semantics guarantee conflicts, rollbacks transparent to the user unordered iterator 11

Scheduling Good scheduling → better performance Available schedules – FIFO, LIFO, random, chunkedFIFO/LIFO/random, etc. – can be composed Usage GaloisRuntime.foreach(initialWorklist, new ForeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { … context.add(dst ) }}}}, Priority.first(ChunkedFIFO.class)) use this scheduling strategy new active elements are added through context 12 scheduling → implementation synthesis algorithm check Donald’s paper in ASPLOS’11

Spanning tree - Galois code Graph graph = builder.create() GNode startNode = Graphs.getRandom(graph) startNode.inSpanningTree = true Bag result = Bag.create() Iterable initialWorklist = Arrays.asList(startNode) GaloisRuntime.foreach(initialWorklist, new ForeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { dstData = dst.getData() if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Pair(src, dst)) context.add(dst ) }}}}, Priority.defaultOrder()) worklist facade ArrayList replaced by Galois multiset gets element from worklist + applies closure (operator) 13

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags) 14

Optimizations - “flagged” methods Speculation overheads associated with invocations on Galois objects – conflict detection – undo actions Flagged version of Galois methods→ extra parameter N getNodeData(GNode src) N getNodeData(GNode src, byte flags) Change runtime default behavior – deactivate conflict detection, undo actions, or both – better performance – might violate transactional semantics 15

Spanning tree - Galois code GaloisRuntime.foreach(initialWorklist, new ForeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { dstData = dst.getData(MethodFlag.ALL) if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Pair(src, dst), MethodFlag.ALL) context.add(dst, MethodFlag.ALL) } }, MethodFlag.ALL) } }, Priority.defaultOrder()) acquire abstract locks + store undo actions 16

Spanning tree - Galois code (final version) GaloisRuntime.foreach(initialWorklist, new ForeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { dstData = dst.getData(MethodFlag.NONE) if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Pair(src, dst), MethodFlag.NONE) context.add(dst, MethodFlag.NONE) } }, MethodFlag.CHECK_CONFLICT) } }, Priority.defaultOrder()) acquire lock on src and neighbors we already have lock on dst nothing to lock + cannot be aborted 17 Flags can be inferred automatically! static analysis [D. Prountzos et al., POPL 2011] without loss of precision …not included in this release

Galois roadmap efficient parallel execution? correct parallel execution? write serial irregular app, use Galois objects foreach instead of loop, default flags change scheduling adjust flags NO YES NO consider alternative data structures 18

Delaunay Refinement – refine triangles in a mesh Results – input: 500K triangles half “bad” – little work available by the end of refinement – “chunked FIFO, then LIFO” scheduling – speedup: 5x 19 Experiments Xeon machine, 8 cores

Barnes Hut – n-body simulation Results – input: 1M bodies – embarrassingly parallel flag = NONE – low overheads! – comparable to hand-tuned SPLASH implementation – speedup: 7x 20

Points-to Analysis – infer variables pointed by pointers in program Results – input: linux kernel – seq. implementation in C++ – “chunked FIFO” scheduling – seq. phases limit speedup – speedup: 3.75x 21 Experiments Xeon machine, 8 cores

Irregular applications included Lonestar suite: algorithms already described plus… – minimal spanning tree Borůvka, Prim, Kruskal – maximum flow Preflow push – mesh generation Delaunay – graph partitioning Metis – SAT solver Survey propagation Check the apps directory for more examples! 22

Thank you for attending this tutorial! Questions? download Galois at

Scheduling (II) Order hierarchy – apply 1 st rule, in case of tie use 2 nd and so on Priority.first(ChunkedFIFO.class).then(LIFO.class).then(…) Local order – apply…. Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); Strict order – ordered + comparator Priority.first(Ordered.class, new Comparator(){ int compare(Object o1, Object o2) {…} }); 24