The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,

Slides:



Advertisements
Similar presentations
1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.
ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 1: Introduction Roman Manevich Ben-Gurion University.
Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
Galois System Tutorial Mario Méndez-Lojo Donald Nguyen.
Galois Performance Mario Mendez-Lojo Donald Nguyen.
1 Keshav Pingali University of Texas, Austin Towards a Science of Parallel Programming.
The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios.
February 12, 2009 Center for Hybrid and Embedded Software Systems Encapsulated Model Transformation Rule A transformation.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
February 12, 2009 Center for Hybrid and Embedded Software Systems Model Transformation Using ERG Controller Thomas H. Feng.
Betweenness Centrality: Algorithms and Implementations Dimitrios Prountzos Keshav Pingali The University of Texas at Austin.
OPL: Our Pattern Language. Background Design Patterns: Elements of Reusable Object-Oriented Software o Introduced patterns o Very influential book Pattern.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
L21: “Irregular” Graph Algorithms November 11, 2010.
Graph Partitioning Donald Nguyen October 24, 2011.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.
Elixir : A System for Synthesizing Concurrent Graph Programs
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Data-parallel Abstractions for Irregular Applications Keshav Pingali University of Texas, Austin.
ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.
Art of Multiprocessor Programming 1 Programming Paradigms for Concurrency Pavol Černý, Vasu Singh, Thomas Wies.
1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.
1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Keshav Pingali Department of Computer Science University of Texas, Austin The Hunt for Right Abstractions.
Implementing Parallel Graph Algorithms Spring 2015 Implementing Parallel Graph Algorithms Lecture 2: Introduction Roman Manevich Ben-Gurion University.
Parallel Routing for FPGAs based on the operator formulation
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Parallelism without Concurrency Charles E. Leiserson MIT.
A System Performance Model Distributed Process Scheduling.
Roman Manevich Rashid Kaleem Keshav Pingali University of Texas at Austin Synthesizing Concurrent Graph Data Structures: a Case Study.
Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.
Some Computational Science Algorithms and Data Structures Keshav Pingali University of Texas, Austin.
IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.
Mesh Generation, Refinement and Partitioning Algorithms Xin Sui The University of Texas at Austin.
Auburn University
CS 395T: Program Synthesis for Heterogeneous Parallel Computers
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
CS 395T: Topics in Multicore Programming
Synchronization trade-offs in GPU implementations of Graph Algorithms
Threads and Memory Models Hal Perkins Autumn 2011
Threads and Memory Models Hal Perkins Autumn 2009
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Data-centric view of algorithms displayed in the ACES Vislab
Fifty Years of Parallel Programming: Ieri, Oggi, Domani
Parallel Data Structures
Presentation transcript:

The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem, T-H. Lee, A.Lenharth, R.Manevich, M.Mendez-Lojo,D.Prountzos,X.Sui 1

Message of paper Parallel programming needs new foundations – dependence graphs are inadequate – cannot represent parallelism in irregular algorithms New foundation: – operator formulation – data-centric abstraction – regular algorithms become special case Key insights – amorphous data-parallelism (ADP) is ubiquitous – TAO analysis: structure in algorithms – use TAO structure to exploit ADP efficiently 2 Dependence graph for FFT

Inadequacy of static dependence graphs Delaunay mesh refinement Don’t-care non-determinism – final mesh depends on order in which bad triangles are processed Data structure: graph – nodes: triangles – edges: triangle adjacencies Parallelism – triangles with disjoint cavities can be processed in parallel – parallelism depends on runtime values – static dependence graph cannot be generated – parallelization must be done at runtime 3

Operator formulation of algorithms Algorithm formulated in data-centric terms – active element: node or edge where computation is needed – activity: application of operator to active element – neighborhood: set of nodes and edges read/written by activity – ordering: order of execution of active elements in a sequential implementation – any order – problem-dependent order Amorphous data-parallelism (ADP) – process active nodes in parallel, subject to neighborhood and ordering constraints – how do we exploit ADP? 4 : active node : neighborhood

TAO analysis:structure in algorithms : active node : neighborhood Cf: Parallel programming patterns (Snir,Intel), Berkeley motifs (Patterson) 5

Parallelization Optimistic parallelization Interference graph Inspector-executor Static parallelizationCompile-time After input is given but before execution During program execution After program is finished Data-driven, ordered algorithms (discrete-event simulation, Dijkstra SSSP,..) Structured topology, topology-driven algorithms (dense linear algebra,FFT,finite-differences,..) When can you produce a parallel schedule for program? 6

Galois system Programming model: – Algorithms sequential, OO language (Joe) Java/C++ with Galois set iterators – Concurrent data structure library expert programmers (Stephanie) Execution model: – optimistic and static parallelization Galois system (Java): – 7

8 Performance Machine: 4x6-core Intel Xeon X7540 DMR: 500K triangles Barnes-Hut

Andersen-style points-to analysis Structural analysis – topology: general graph – operator: morph – ordering: unordered Optimizations – cautious operator – lock optimization Comparison – Hardekopf & Lin (PLDI 2007) – red lines in graphs Mendez-Lojo et al (OOPSLA 2010) Intel 8-core Xeon 9 Threads

Summary 10