Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Two for the Price of One: A Model for Parallel and Incremental Computation Sebastian Burckhardt, Daan Leijen, Tom Ball (Microsoft Research, Redmond) Caitlin.
McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Distributed Systems Fall 2010 Transactions and concurrency control.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
Pregel: A System for Large-Scale Graph Processing
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.
Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Thread-Level Speculation Karan Singh CS
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Lowering the Overhead of Software Transactional Memory Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Deadlocks II.
Union-find Algorithm Presented by Michael Cassarino.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation Mechanism ECE 751, Fall 2015 Peng Liu 1.
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.
10 1 Chapter 10 - A Transaction Management Database Systems: Design, Implementation, and Management, Rob and Coronel.
Martin Kruliš by Martin Kruliš (v1.0)1.
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
Adaptive Software Lock Elision
Hathi: Durable Transactions for Memory using Flash
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Faster Data Structures in Transactional Memory using Three Paths
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
L21: Putting it together: Tree Search (Ch. 6)
Concurrent Data Structures Concurrent Algorithms 2017
Lecture 6: Transactions
Chapter 10 Transaction Management and Concurrency Control
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Hybrid Transactional Memory
Minimum Spanning Tree Optimizations
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Lecture 2 The Art of Concurrency
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Dynamic Performance Tuning of Word-Based Software Transactional Memory
Presentation transcript:

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College of Computing Georgia Institute of Technology

Introduction Introduction Connected Components Problem and Speculative Parallelization Connected Components Problem and Speculative Parallelization STMs and the Merge Construct STMs and the Merge Construct Evaluation Evaluation Conclusion Conclusion Outline

Parallelism in Applications Regular Parallel ApplicationsIrregular Parallel Applications HPC Applications Dense Matrix Computations Simulations … Large Amount of Parallelism Easily parallelized Pointer Based Applications Graphs, Trees Linked Lists … Significant Amount of Parallelism available Difficult to parallelize

Irregular Parallelism Parallelism depends on runtime values Parallelism depends on runtime values Pointer based which makes it difficult to statically parallelize Pointer based which makes it difficult to statically parallelize Non-overlapping sections can be processed in parallel Non-overlapping sections can be processed in parallel Speculation coupled with optimistic execution allows parallelization Speculation coupled with optimistic execution allows parallelization Examples: applications with disconnected sparse graphs Examples: applications with disconnected sparse graphs

Introduction Introduction Connected Components Problem and Speculative Parallelization Connected Components Problem and Speculative Parallelization STMs and the Merge Construct STMs and the Merge Construct Evaluation Evaluation Conclusion Conclusion Outline

Connected Components Problem (CCP) Applications: Video Processing Video Processing Image Retrieval Image Retrieval Island Discovery in ODE (Open Dynamic Engine) Island Discovery in ODE (Open Dynamic Engine) Object Recognition Object Recognition Many more… Many more…

Connected Components Problem (CCP)

Serial Solution Pseudo-code (DFS Strategy) insert node into nodes_stack for each node in nodes_stack: if node is marked continue if node is marked continue mark node mark node insert node into marked_nodes insert node into marked_nodes for each neighbor of node: for each neighbor of node: insert neighbor into nodes_stack insert neighbor into nodes_stack Each iteration of loop depends on previous iteration

Speculative Parallelization

Serial Thread

Speculative Parallelization

Parallelized Computation

Speculative Parallelization Wrong Speculation

Speculative Parallelization Wrong Speculation

Speculative Parallelization Serial Thread Wrapped in STM STMs provide “atomic “ construct and take care of conflicts by rolling back one thread

Speculative Parallelization Serial Thread Wrapped in STM

Speculative Parallelization Serial Thread Wrapped in STM

Speculative Parallelization Serial Thread Wrapped in STM Sort of Data Parallelism over an Irregular Data Structure Can we do better?

Introduction Introduction Connected Components Problem and Speculative Parallelization Connected Components Problem and Speculative Parallelization STMs and the Merge Construct STMs and the Merge Construct Evaluation Evaluation Conclusion Conclusion Outline

STMs optimistically execute code and provide atomicity and isolation STMs optimistically execute code and provide atomicity and isolation Monitor for conflicts at runtime and perform rollbacks on conflicts Monitor for conflicts at runtime and perform rollbacks on conflicts Can be used for speculative computation but not designed for them Can be used for speculative computation but not designed for them Main STM Overheads: Main STM Overheads: Logging Logging Rollback after conflict Rollback after conflict Software Transactional Memory (STMs)

STMs optimistically execute code and provide atomicity and isolation STMs optimistically execute code and provide atomicity and isolation Monitor for conflicts at runtime and perform rollbacks on conflicts Monitor for conflicts at runtime and perform rollbacks on conflicts Can be used for speculative computation but not designed for them Can be used for speculative computation but not designed for them Main Overheads: Main Overheads: Logging Logging Rollback after conflict Rollback after conflict Inherent cost of rollback Inherent cost of rollback Cost of lost work Cost of lost work Do we have to discard all the work? Do we have to discard all the work? Checkpointing state (Safe compiler-driven transaction checkpointing and recovery, OOPSLA ‘12, Jaswanth Sreeram, Santosh Pande) Checkpointing state (Safe compiler-driven transaction checkpointing and recovery, OOPSLA ‘12, Jaswanth Sreeram, Santosh Pande) Can we try merging the state from the aborting transaction? Can we try merging the state from the aborting transaction?

Merge Construct STMs discard work from transaction STMs discard work from transaction Can we salvage the work that it has done? Can we salvage the work that it has done? Can try to merge what it has processed with other transaction Can try to merge what it has processed with other transaction

Merge Construct STMs discard work from transaction STMs discard work from transaction Can we salvage the work that it has done? Can we salvage the work that it has done? Can try to merge what it has processed with other transaction Can try to merge what it has processed with other transaction

Merge Construct STMs discard work from transaction STMs discard work from transaction Can we salvage the work that it has done? Can we salvage the work that it has done? Can try to merge what it has processed with other transaction Can try to merge what it has processed with other transaction Application dependent

Merge for CCP nodes_stack marked_nodes nodes_stack marked_nodes Conceptually Simple A transaction conflicts only because it was working on the same component Before a transaction is discarded take its nodes_stack and marked_nodes list and add it to the continuing transaction Call this MERGE function after a conflict t1t2 (t1: continuing transaction, t2: aborting transaction)

Merge for CCP We need deal with two main issues: We need deal with two main issues: Consistency of Data Structures in T2 (Aborting Transaction) Consistency of Data Structures in T2 (Aborting Transaction) Safety of updates for Data Structures in T1 (Continuing Transaction) Safety of updates for Data Structures in T1 (Continuing Transaction) We use two user-defined functions MERGE and UPDATE We use two user-defined functions MERGE and UPDATE (t1: continuing transaction, t2: aborting transaction)

When can T2 abort? When can T2 abort? Can abort only when it read/writes to shared state Can abort only when it read/writes to shared state Data Structures in T2 When can T2 abort? When can T2 abort? Can abort only when it read/writes to shared state (A or B) Can abort only when it read/writes to shared state (A or B) Are it’s data structures nodes_stack and marked_nodes safe to use in the merge function? Are it’s data structures nodes_stack and marked_nodes safe to use in the merge function? A B

Data Structures in T2 A B Irrespective of conflict at A or B Irrespective of conflict at A or B Both Data structures nodes_stack and marked_nodes either have node or do not have it Both Data structures nodes_stack and marked_nodes either have node or do not have it Valid State of Data Structures (t1: continuing transaction, t2: aborting transaction)

Data Structures in T2 A B Switch around lines 20 and 21 A B

Data Structures in T2 A B If conflict at B If conflict at B marked_nodes has node while nodes_stack does not have it neighbors marked_nodes has node while nodes_stack does not have it neighbors Invalid State of Data Structures (t1: continuing transaction, t2: aborting transaction)

Detecting Invalid or Valid States Step 1: Identify Conflict Points Step 1: Identify Conflict Points Easy step since STMs typically wrap these instructions in special calls Easy step since STMs typically wrap these instructions in special calls Step 2: Identify Possibility of an Invalid State Step 2: Identify Possibility of an Invalid State At each of the points from Step 1 check if the MERGE function has a set of valid data structures as described in previous slides. At each of the points from Step 1 check if the MERGE function has a set of valid data structures as described in previous slides. If valid, then nothing needs to be done If valid, then nothing needs to be done If cannot be determined easily or invalid use SNAPSHOT API If cannot be determined easily or invalid use SNAPSHOT API SNAPSHOT API: SNAPSHOT API: Call to API made by programmer at a point in code (typically start/end of loop) Call to API made by programmer at a point in code (typically start/end of loop) Make a copy of the data structures Make a copy of the data structures Use only this copy in MERGE function Use only this copy in MERGE function Example coming up Example coming up

T1 still executing when T2 is aborting (performing MERGE) T1 still executing when T2 is aborting (performing MERGE) Provide MERGE specific data structures: Provide MERGE specific data structures: merged_nodes_stack and merged_marked_nodes for use in MERGE merged_nodes_stack and merged_marked_nodes for use in MERGE Data Structures in T1

 Now the information needs to be incorporated back into the main data structures  Programmer defines UPDATE function  Indicates safe points to invoke UPDATE using SAFE_UPDATE_POINT() call Data Structures in T1

Putting It All Together

Introduction Introduction Connected Components Problem and Speculative Parallelization Connected Components Problem and Speculative Parallelization STMs and the Merge Construct STMs and the Merge Construct Evaluation Evaluation Conclusion Conclusion Outline

Minimum Spanning Tree (MST) Benchmark Minimum Spanning Tree (MST) Benchmark Connected Components Benchmark Connected Components Benchmark Configuration: Configuration: Dual quad-core Intel Xeon E5540 (2.53GHz) Dual quad-core Intel Xeon E5540 (2.53GHz) GCC with –O3 on Ubuntu GCC with –O3 on Ubuntu OpenMP used to parallelize the code OpenMP used to parallelize the code TinySTM (ETL) TinySTM (ETL) Experimental Evaluation

Minimum Spanning Tree (MST) Pseudo-code while node do: if node is marked return if node is marked return mark node with tree number mark node with tree number insert node into marked_nodes insert node into marked_nodes find the next edge from the marked_nodes frontier else returns find the next edge from the marked_nodes frontier else returns add this edge to tree_edges add this edge to tree_edges node = this edge’s unmarked node node = this edge’s unmarked node

MST Benchmark Results with 4 different configurations Results with 4 different configurations Serial Implementation (No parallelism and No STM overheads) Serial Implementation (No parallelism and No STM overheads) STM based Parallelization STM based Parallelization STM based Parallelization with Merge STM based Parallelization with Merge STM based Parallelization with Merge and Snapshot STM based Parallelization with Merge and Snapshot Results from 4 different datasets Results from 4 different datasets DS1 (6000 nodes): X = 6000, T = 6, N = 6. DS1 (6000 nodes): X = 6000, T = 6, N = 6. DS2 (9000 nodes): X = 9000, T = 5, N = 6. DS2 (9000 nodes): X = 9000, T = 5, N = 6. DS3 (12000 nodes): X = 12000, T = 4, N = 8. DS3 (12000 nodes): X = 12000, T = 4, N = 8. DS4 (16000 nodes): X = 16000, T = 3, N = 8. DS4 (16000 nodes): X = 16000, T = 3, N = 8. Randomly generated parameterized graphs Randomly generated parameterized graphs

MST Results Parallelization using simple STM based speculation gives performance improvement when compared to the serial implementation Parallelization using simple STM based speculation gives performance improvement when compared to the serial implementation Performance using simple STM based speculation drops after 4 threads (due to higher contention) Performance using simple STM based speculation drops after 4 threads (due to higher contention) STM Parallelization with Merge outperforms other implementations and scales STM Parallelization with Merge outperforms other implementations and scales Snapshot adds slight overhead but still demonstrates good scalability Snapshot adds slight overhead but still demonstrates good scalability

MST Results STM Parallelization with Merge scales well STM Parallelization with Merge scales well Snapshot-ing shows overheads but still scales well Snapshot-ing shows overheads but still scales well At 8 threads 90% faster than serial At 8 threads 90% faster than serial Actually demonstrates super linear speedups Actually demonstrates super linear speedups

Conclusion Speculation is needed to deal with Irregular Parallelism Speculation is needed to deal with Irregular Parallelism STMs can be used for speculation but are not designed for it STMs can be used for speculation but are not designed for it The merge construct provides explicit support for speculation by reducing the overheads of mis-speculation The merge construct provides explicit support for speculation by reducing the overheads of mis-speculation We deal with issues of consistency and safety of data structures during the merge We deal with issues of consistency and safety of data structures during the merge We demonstrate good scalability when compared to regular STM based speculation by reducing overheads of mis-speculation We demonstrate good scalability when compared to regular STM based speculation by reducing overheads of mis-speculation

Q & A Q & A Looking for applications, suggestions welcome Looking for applications, suggestions welcome Code shortly available at: Code shortly available at: Contact at: Contact at: Thank you!