Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
Data Structures & Java Generics Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
Galois System Tutorial Mario Méndez-Lojo Donald Nguyen.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Galois Performance Mario Mendez-Lojo Donald Nguyen.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.
1 Control Flow Analysis Mooly Sagiv Tel Aviv University Textbook Chapter 3
Chapter 12 Collections. © 2004 Pearson Addison-Wesley. All rights reserved12-2 Collections A collection is an object that helps us organize and manage.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
ADT Stacks and Queues. Stack: Logical Level “An ordered group of homogeneous items or elements in which items are added and removed from only one end.”
PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
SEG4110 – Advanced Software Design and Reengineering TOPIC G Java Collections Framework.
Fast Spectrum Allocation in Coordinated Dynamic Spectrum Access Based Cellular Networks Anand Prabhu Subramanian*, Himanshu Gupta*,
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Data Structures and Abstract Data Types "Get your data structures correct first, and the rest of the program will write itself." - David Jones.
Minimum Spanning Trees
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 3: Galois Primer Roman Manevich Ben-Gurion University.
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
JAVA COLLECTIONS LIBRARY School of Engineering and Computer Science, Victoria University of Wellington COMP T2, Lecture 2 Marcus Frean.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Thread-Level Speculation Karan Singh CS
Implementing Parallel Graph Algorithms: Graph coloring Created by: Avdeev Alex, Blakey Paul.
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
HIT2037- HIT6037 Software Development in Java 22 – Data Structures and Introduction.
Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
Stacks and Queues. 2 3 Runtime Efficiency efficiency: measure of computing resources used by code. can be relative to speed (time), memory (space), etc.
Chapter 12: Collections by Lewis and Loftus (Updated by Dan Fleck) Coming up: Collections.
UPC-CHECK Project Final Report High Performance Computing Group Iowa State University Aug 30, 2011.
Final Review. From ArrayLists to Arrays The ArrayList : used to organize a list of objects –It is a class in the Java API –the ArrayList class uses an.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Introduction to Methods Shirley Moore CS 1401 Spring 2013 cs1401spring2013.pbworks.com April 1, 2013.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
JAVA COLLECTIONS LIBRARY School of Engineering and Computer Science, Victoria University of Wellington COMP T2, Lecture 2 Thomas Kuehne.
GROUPING OBJECTS CITS1001. Lecture outline The ArrayList collection Process all items: the for-each loop 2.
January 26, 2016Department of Computer Sciences, UT Austin Characterization of Existing Systems Puneet Chopra Ramakrishna Kotla.
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
JAVA COLLECTIONS LIBRARY School of Engineering and Computer Science, Victoria University of Wellington COMP T2, Lecture 2 Marcus Frean.
Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Contents. Goal and Overview. Ingredients. The Page Model.
EECE 310: Software Engineering
12 Collections Software Solutions Lewis & Loftus java 5TH EDITION
Chapter 9 : Graphs Part II (Minimum Spanning Trees)
Computational Models Database Lab Minji Jo.
Compositional Pointer and Escape Analysis for Java Programs
JAVA COLLECTIONS LIBRARY
Building Java Programs Chapter 14
structures and their relationships." - Linus Torvalds
Discussion section #2 HW1 questions?
Minimum Spanning Tree.
CS 584 Project Write up Poster session for final Due on day of final
Parallel Programming in C with MPI and OpenMP
Concurrent Cache-Oblivious B-trees Using Transactional Memory
structures and their relationships." - Linus Torvalds
Parallel Data Structures
Presentation transcript:

Galois System Tutorial Donald Nguyen Mario Méndez-Lojo

Writing Galois programs Galois data structures – choosing right implementation – API basic flags (advanced) Galois iterators Scheduling – assigning work to threads

Motivating example – spanning tree Compute the spanning tree of an undirected graph Parallelism comes from independent edges Release contains minimal spanning tree examples Borůvka, Prim, Kruskal

Spanning tree - pseudo code Graph graph = read graph from file Node startNode = pick random node from graph startNode.inSpanningTree = true Worklist worklist = create worklist containing startNode List result = create empty list foreach src : worklist foreach Node dst : src.neighbors if not dst.inSpanningTree dst.inSpanningTree = true Edge edge= new Edge(src,dst) result.add(edge) worklist.add(dst ) create graph, initialize worklist and spanning tree worklist elements can be processed in any order neighbor not processed? add edge to solution add to worklist

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags)

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags)

Galois data structures “Galoized” implementations – concurrent – transactional semantics Also, serial implementations galois.object package – Graph – GMap, GSet –...

Graph API > Graph createNode(data: N) add(node: GNode) remove(node: GNode) addNeighbor(s: GNode, d: GNode) removeNeighbor(s: GNode, d: GNode) … GNode setData(data: N) getData() ObjectMorphGraph > ObjectGraph addEdge(s: GNode, d: Gnode, data:E) setEdgeData(s:GNode, d:Gnode, data:E) … ObjectLocalComputationGraph > Mappable map (closure: LambdaVoid ) map(closure: Lambda2Void ) …

Map pabl e interface Implicit iteration over collections of type T void map(LambdaVoid body); LambdaVoid = closure void call(T arg); Graph is Mappable – “apply closure once per node in graph” GNode is Mappable – “apply closure once per neighbor of this node”

Spanning tree - serial code Graph graph=new MorphGraph.GraphBuilder().create() GNode startNode = Graphs.getRandom(graph) startNode.inSpanningTree = true Stack worklist = new Stack(startNode); List result = new ArrayList() while !worklist.isEmpty() src = worklist.pop() map(src, new LambdaVoid(){ void call(GNode dst) { NodeData dstData = dst.getData(); if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Edge(src, dst)) worklist.add(dst ) }}) graph utilities LIFO scheduling for every neighbor of the active node has the node been processed?graphs created using builder pattern

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags)

initial worklist apply closure to each active element scheduling policy Galois iterators static void GaloisRuntime.foreach(Iterable initial, Lambda2Void > body, Rule schedule) GaloisRuntime – ordered iterators, runtime statistics, etc Upon foreach invocation – threads are spawned – transactional semantics guarantee conflicts, rollbacks transparent to the user unordered iterator

Scheduling Good scheduling → better performance Available schedules – FIFO, LIFO, Random – ChunkedFIFO/LIFO/Random – many others (see Javadoc) Usage GaloisRuntime.foreach(initialWorklist, new ForeeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { … context.add(dst ) }}}}, Priority.defaultOrder()) default scheduling = ChunkedFIFO set of initial active elements new active elements are added through context

Spanning tree - Galois code Graph graph = builder.create() GNode startNode = Graphs.getRandom(graph) startNode.inSpanningTree = true Bag result = Bag.create() Iterable initialWorklist = Arrays.asList(startNode) GaloisRuntime.foreach(initialWorklist, new ForeeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { dstData = dst.getData() if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Pair(src, dst)) context.add(dst ) }}}}, Priority.defaultOrder()) worklist facade ArrayList replaced by Galois multiset gets element from worklist + applies closure (operator)

Outline 1.Serial algorithm – Galois data structures choosing right implementation basic API 2.Galois (parallel) version – Galois iterators – scheduling assigning work to threads 3.Optimizations – Galois data structures advanced API (flags)

Optimizations - “flagged” methods Speculation overheads associated with invocations on Galois objects – conflict detection – undo actions Flagged version of Galois methods→ extra parameter N getNodeData(GNode src) N getNodeData(GNode src, byte flags) Change runtime default behavior – deactivate conflict detection, undo actions, or both – better performance – might violate transactional semantics

Spanning tree - Galois code GaloisRuntime.foreach(initialWorklist, new ForeeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { dstData = dst.getData(MethodFlag.ALL) if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Pair(src, dst), MethodFlag.ALL) context.add(dst, MethodFlag.ALL) } }, MethodFlag.ALL) } }, Priority.defaultOrder()) acquire abstract locks + store undo actions

Spanning tree - Galois code (final version) GaloisRuntime.foreach(initialWorklist, new ForeeachBody() { void call(GNode src, ForeachContext context) { src.map(src, new LambdaVoid(){ void call(GNode dst) { dstData = dst.getData(MethodFlag.NONE) if !dstData.inSpanningTree dstData.inSpanningTree = true result.add(new Pair(src, dst), MethodFlag.NONE) context.add(dst, MethodFlag.NONE) } }, MethodFlag.CHECK_CONFLICT) } }, Priority.defaultOrder()) acquire lock on src and neighbors we already have lock on dst nothing to lock + cannot be aborted

Galois roadmap efficient parallel execution? correct parallel execution? write serial irregular app, use Galois objects foreach instead of loop, flags change scheduling adjust flags NO YES NO consider alternative data structures

Irregular applications included Lonestar suite N-body simulation – Barnes Hut Minimal spanning tree – Borůvka, Prim, Kruskal Maximum flow – Preflow push Mesh generation and refinement – Delaunay Graph partitioning – Metis SAT solver – Survey propagation Check the apps directory for more examples!

Questions?

Create a 2x2 grid, print contents Graph graph= builder.create() GNode n0 = graph.createNode(0); //create other three nodes … graph.addNeighbor(n0, n1); graph.addNeighbor(n0, n2); // add the other two edges … graph.map(new LambdaVoid >(){ void call(GNode node) { int label = node.getData(); System.out.println(label); } });

Scheduling (II) Order hierarchy – apply 1 st rule, in case of tie use 2 nd and so on Priority.first(ChunkedFIFO.class).then(LIFO.class).then(…) Local order – apply…. Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); Strict order – ordered + comparator Priority.first(Ordered.class, new Comparator(){ int compare(Object o1, Object o2) {…} });