Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian.

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

2 Motivation Locks are hard to get right Programmability vs scalability Transactional memory is appealing alternative Simpler programming model Stronger guarantees Atomicity, Consistency, Isolation Deadlock avoidance Closer to programmer intent Scalable implementations Questions How to lower TM overheads – particularly in software? How to balance granularity / scalability?

3 Our System Java Software Transactional Memory (STM) System –Pure software implementation (McRT-STM – PPoPP 06) –Language extensions in Java (Polyglot) –Integrated with JVM & JIT (ORP & StarJIT) Novel Features –Rich transactional language constructs in Java –Efficient, first class nested transactions –Complete GC support –Risc-like STM API / IR –Compiler optimizations –Per-type word and object level conflict detection

4 Transactional Java Java Transactional Java atomic { S; } Other Language Constructs Built on prior research –retry (STM Haskell, …) –orelse (STM Haskell) –tryatomic (Fortress) –when (X10, …) Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S; break; } finally { if(!txnCommit(th)) continue; }

5 Tight integration with JVM & JIT StarJIT & ORP On-demand cloning of methods (Harris 03) Identifies transactional regions in Java+STM code Inserts read/write barriers in transactional code Maps STM API to first class opcodes in StarJIT IR (STIR) Good compiler representation greater optimization opportunities

6 Representing Read/Write Barriers atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Traditional barriers hide redundant locking/logging

7 An STM IR for Optimization Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 }

8 Optimized Code atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Fewer & cheaper STM operations

9 Compiler Optimizations for Transactions Standard optimizations CSE, Dead-code-elimination, … Careful IR representation exposes opportunities and enables optimizations with almost no modifications Subtle in presence of nesting STM-specific optimizations Immutable field / class detection & barrier removal (vtable/String) Transaction-local object detection & barrier removal Partial inlining of STM fast paths to eliminate call overhead

10 McRT-STM PPoPP 2006 (Saha, et. al.) C / C++ STM Pessimistic Writes: –strict two-phase locking –update in place –undo on abort Optimistic Reads: –versioning –validation before commit Benefits –Fast memory accesses (no buffering / object wrapping) –Minimal copying (no cloning for large objects) –Compatible with existing types & libraries Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI 06)

11 STM Data Structures Per-thread: Transaction Descriptor –Per-thread info for version validation, acquired locks, rollback –Maintained in Read / Write / Undo logs Transaction Memento –Checkpoint of logs for nesting / partial rollback Per-data: Transaction Record –Pointer-sized field guarding a set of shared data –Transactional state of data Shared: Version number (odd) Exclusive: Owners transaction descriptor (even / aligned)

12 Mapping Data to Transaction Record Every data item has an associated transaction record TxR 1 TxR 2 TxR 3 … TxR n Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } TxR x y vtbl Transaction record embedded In object Object granularity Word granularity class Foo { int x; int y; } hash x y vtbl

13 Granularity of Conflict Detection Object-level Cheaper operation Exposes CSE opportunities Lower overhead on 1P Word-level Reduces false sharing Better scalability Mix & Match Per type basis E.g., word-level for arrays, object-level for non-arrays // Thread 1 a.x = … a.y = … // Thread 2 … = … a.z …

14 Experiments 16-way 2.2 GHz Xeon with 16 GB shared memory L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) Workloads Hashtable, Binary tree, OO7 (OODBMS) –Mix of gets, in-place updates, insertions, and removals Object-level conflict detection by default –Word / mixed where beneficial

15 Effective of Compiler Optimizations 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization

16 Scalability: Java HashMap Shootout Unsafe (java.util.HashMap) Thread-unsafe w/o Concurrency Control Synchronized Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) Multi-year effort: JSR 166 -> Java 5 Optimized for concurrent gets (no locking) For updates, divides bucket array into 16 segments (size / locking) Atomic Transactional version via AtomicMap wrapper Atomic Prime Transactional version with minor hand optimization Tracks size per segment ala ConcurrentHashMap Execution 10,000,000 operations / 200,000 elements Defaults: load factor, threshold, concurrency level

17 Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale

18 Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales

19 20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object

20 20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap

21 20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain

22 20% Inserts and Removes: Mixed-Level Mixed-level preserves wins & reduces overheads -word-level for arrays -object-level for non-arrays

23 Key Takeaways Optimistic reads + pessimistic writes is nice sweet spot Compiler optimizations significantly reduce STM overhead - 20-40% over thread-unsafe - 10-30% over synchronized Simple atomic wrappers sometimes good enough Minor modifications give competitive performance to complex fine-grain synchronization Word-level contention is crucial for large arrays Mixed contention provides best of both

24 Novel Contributions Rich transactional language constructs in Java Efficient, first class nested transactions Complete GC support Risc-like STM API Compiler optimizations Per-type word and object level conflict detection

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian.

Similar presentations

Presentation on theme: "Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian.

Similar presentations

Presentation on theme: "Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian."— Presentation transcript:

Similar presentations

About project

Feedback