Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian.

Slides:



Advertisements
Similar presentations
Inferring Locks for Atomic Sections Cornell University (summer intern at Microsoft Research) Microsoft Research Sigmund CheremTrishul ChilimbiSumit Gulwani.
Advertisements

Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Tramp Ali-Reza Adl-Tabatabai, Richard L. Hudson, Vijay Menon, Yang Ni, Bratin Saha, Tatiana Shpeisman, Adam Welc.
TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.
Software Transactional Objects Guy Eddon Maurice Herlihy TRAMP 2007.
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Supporting Persistent Objects In Python Jeremy Hylton
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.
Concurrency control 1. 2 Introduction concurrency more than one transaction have access to data simultaneously part of transaction processing.
1 Term 2, 2004, Lecture 6, TransactionsMarian Ursu, Department of Computing, Goldsmiths College Transactions 3.
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.
1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.
Concurrency Control Techniques
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.
Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
@ Carnegie Mellon Databases Data-oriented Transaction Execution VLDB 2010 Ippokratis Pandis Ryan Johnson Nikos Hardavellas Anastasia Ailamaki Carnegie.
1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.
CSCI 2720 Hashing   Spring 2005.
Hash Tables.
Operating Systems: Monitors 1 Monitors (C.A.R. Hoare) higher level construct than semaphores a package of grouped procedures, variables and data i.e. object.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Lecture plan Transaction processing Concurrency control
Backup Slides. An Example of Hash Function Implementation struct MyStruct { string str; string item; };
OPERATING SYSTEMS SHOULD PROVIDE TRANSACTIONS Donald E. Porter and Emmett Witchel The University of Texas at Austin.
Indra Budi Transaction Indra Budi
Chapter 5 Test Review Sections 5-1 through 5-4.
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
25 seconds left…...
Concurrent programming for dummies (and smart people too) Tim Harris & Keir Fraser.
Improving OLTP scalability using speculative lock inheritance Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki.
Week 1.
We will resume in: 25 Minutes.
The University of Adelaide, School of Computer Science
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel.
Transactional Memory – Implementation Lecture 1 COS597C, Fall 2010 Princeton University Arun Raman 1.
McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
WG5: Applications & Performance Evaluation Pascal Felber
Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
The ATOMOS Transactional Programming Language Mehdi Amirijoo Linköpings universitet.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Hardware and Software transactional memory and usages in MRE
AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.
Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha,
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Part 2: Software-Based Approaches
Enforcing Isolation and Ordering in STM Systems
Lecture 6: Transactions
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
Dynamic Performance Tuning of Word-Based Software Transactional Memory
Presentation transcript:

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

2 Motivation Locks are hard to get right Programmability vs scalability Transactional memory is appealing alternative Simpler programming model Stronger guarantees Atomicity, Consistency, Isolation Deadlock avoidance Closer to programmer intent Scalable implementations Questions How to lower TM overheads – particularly in software? How to balance granularity / scalability?

3 Our System Java Software Transactional Memory (STM) System –Pure software implementation (McRT-STM – PPoPP 06) –Language extensions in Java (Polyglot) –Integrated with JVM & JIT (ORP & StarJIT) Novel Features –Rich transactional language constructs in Java –Efficient, first class nested transactions –Complete GC support –Risc-like STM API / IR –Compiler optimizations –Per-type word and object level conflict detection

4 Transactional Java Java Transactional Java atomic { S; } Other Language Constructs Built on prior research –retry (STM Haskell, …) –orelse (STM Haskell) –tryatomic (Fortress) –when (X10, …) Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S; break; } finally { if(!txnCommit(th)) continue; }

5 Tight integration with JVM & JIT StarJIT & ORP On-demand cloning of methods (Harris 03) Identifies transactional regions in Java+STM code Inserts read/write barriers in transactional code Maps STM API to first class opcodes in StarJIT IR (STIR) Good compiler representation greater optimization opportunities

6 Representing Read/Write Barriers atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Traditional barriers hide redundant locking/logging

7 An STM IR for Optimization Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 }

8 Optimized Code atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Fewer & cheaper STM operations

9 Compiler Optimizations for Transactions Standard optimizations CSE, Dead-code-elimination, … Careful IR representation exposes opportunities and enables optimizations with almost no modifications Subtle in presence of nesting STM-specific optimizations Immutable field / class detection & barrier removal (vtable/String) Transaction-local object detection & barrier removal Partial inlining of STM fast paths to eliminate call overhead

10 McRT-STM PPoPP 2006 (Saha, et. al.) C / C++ STM Pessimistic Writes: –strict two-phase locking –update in place –undo on abort Optimistic Reads: –versioning –validation before commit Benefits –Fast memory accesses (no buffering / object wrapping) –Minimal copying (no cloning for large objects) –Compatible with existing types & libraries Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI 06)

11 STM Data Structures Per-thread: Transaction Descriptor –Per-thread info for version validation, acquired locks, rollback –Maintained in Read / Write / Undo logs Transaction Memento –Checkpoint of logs for nesting / partial rollback Per-data: Transaction Record –Pointer-sized field guarding a set of shared data –Transactional state of data Shared: Version number (odd) Exclusive: Owners transaction descriptor (even / aligned)

12 Mapping Data to Transaction Record Every data item has an associated transaction record TxR 1 TxR 2 TxR 3 … TxR n Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } TxR x y vtbl Transaction record embedded In object Object granularity Word granularity class Foo { int x; int y; } hash x y vtbl

13 Granularity of Conflict Detection Object-level Cheaper operation Exposes CSE opportunities Lower overhead on 1P Word-level Reduces false sharing Better scalability Mix & Match Per type basis E.g., word-level for arrays, object-level for non-arrays // Thread 1 a.x = … a.y = … // Thread 2 … = … a.z …

14 Experiments 16-way 2.2 GHz Xeon with 16 GB shared memory L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) Workloads Hashtable, Binary tree, OO7 (OODBMS) –Mix of gets, in-place updates, insertions, and removals Object-level conflict detection by default –Word / mixed where beneficial

15 Effective of Compiler Optimizations 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization

16 Scalability: Java HashMap Shootout Unsafe (java.util.HashMap) Thread-unsafe w/o Concurrency Control Synchronized Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) Multi-year effort: JSR 166 -> Java 5 Optimized for concurrent gets (no locking) For updates, divides bucket array into 16 segments (size / locking) Atomic Transactional version via AtomicMap wrapper Atomic Prime Transactional version with minor hand optimization Tracks size per segment ala ConcurrentHashMap Execution 10,000,000 operations / 200,000 elements Defaults: load factor, threshold, concurrency level

17 Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale

18 Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales

19 20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object

20 20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap

21 20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain

22 20% Inserts and Removes: Mixed-Level Mixed-level preserves wins & reduces overheads -word-level for arrays -object-level for non-arrays

23 Key Takeaways Optimistic reads + pessimistic writes is nice sweet spot Compiler optimizations significantly reduce STM overhead % over thread-unsafe % over synchronized Simple atomic wrappers sometimes good enough Minor modifications give competitive performance to complex fine-grain synchronization Word-level contention is crucial for large arrays Mixed contention provides best of both

24 Novel Contributions Rich transactional language constructs in Java Efficient, first class nested transactions Complete GC support Risc-like STM API Compiler optimizations Per-type word and object level conflict detection

25