Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10.

Slides:

Advertisements

Similar presentations

Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.

Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.

1 A Real Problem  What if you wanted to run a program that needs more memory than you have?

Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

HW Support for STM Breakout Session Summary Christos Kozyrakis Pervasive Parallelism Laboratory Stanford University

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.

1 Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies.

1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)

1 Lecture 24: Transactional Memory Topics: transactional memory implementations.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

1 Lecture 5: TM – Lazy Implementations Topics: TM design (TCC) with lazy conflict detection and lazy versioning, intro to eager conflict detection.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)

1 Lecture 7: Lazy & Eager Transactional Memory Topics: details of “lazy” TM, scalable lazy TM, implementation details of eager TM.

DTHREADS: Efficient Deterministic Multithreading

1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.

LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood Presented by Colleen Lewis.

I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.

An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.

EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper.

Detecting Atomicity Violations via Access Interleaving Invariants

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Lecture 20: Consistency Models, TM

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Cache Coherence: Directory Protocol

Cache Coherence: Directory Protocol

Lecture 12 Virtual Memory.

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Virtualizing Transactional Memory

Transactional Memory : Hardware Proposals Overview

PHyTM: Persistent Hybrid Transactional Memory

A New Coherence Method Using A Multicast Address Network

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

Lecture 19: Transactional Memories III

Example Cache Coherence Problem

Lecture 11: Transactional Memory

Lecture 22: Consistency Models, TM

Hybrid Transactional Memory

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

Lecture 23: Transactional Memory

Presentation transcript:

Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10

2 Transactional Memory (TM) Multicores require parallel programming Significantly harder than sequential programming Locks may cause incorrect behavior Deadlocks/livelocks and data races TM appears to make correct programming easier TM implementations can be efficient Transactions may provide better programmability and performance than locks

3 Previous Work Hardware, software and hybrid TMs HTMs piggyback conflict detection on coherence STMs and HybridTMs detect conflicts in software Recent HTMs support many features Transaction time and footprint not limited by hardware Can exceed caches and even be swapped out of memory Transaction-OS interactions not restricted In-flight context switches, page/thread migrations Modest hardware complexity No coherence protocol changes (very big deal) Supporting these features incurs high hardware cost

4 HTM Cost: State overhead HTMs need large state throughout memory hierarchy Numerous state bits in L1 and L2 Hijack memory ECC  weaker protection »E.g., 25% fewer SECDED bits in TokenTM Supporting all features  large state in caches + weaker memory ECC  high barrier for adoption 19 bits 16 bits

5 HTM State Overhead Thread Id/sharer-count + state bits per block  Thread Id to determine conflictors or own blocks  Sharer count to track multiple readers Ideally, need all ids but too much state  make do with counts  Avoid coherence changes  Extra bits (beyond R, W) E.g., TokenTM uses 5 bits instead of usual R, W Thread Ids+sharer-counts in hardware  Detect conflicts + identify conflictors mostly in hardware

6 LiteTM: Key Observations Most state information not needed in common case  Eliminate thread Ids and sharer-counts Intended for conflicts on L1-evicted blocks, but Conflict usually on L1-resident blocks Coherence trivially identifies L1-resident conflictors & count  Merge R,W into T Coherence’s “Modified” state can approximate W False positives possible but rare  Uncommon case: scan transactional log LiteTM detects conflicts in h/w (all cases, like all HTMs); identifies conflictors: h/w (common) & s/w (uncommon)

7 LiteTM: Contributions (1) LiteTM reduces transactional state  Average (worst) case 4% (10%) performance loss in STAMP ( 8 cores)  Key reduction is removal of thread id/count (W approx is secondary) 2 bits 19 bits 16 bits

8 LiteTM: Contributions (2) LiteTM compensates for the loss of Thread Id Read-sharer count Separate R,W bits via novel mechanisms Self-log walks Lazy clearing of L1-spilled transaction state W approximation All-log walks (a la TokenTM) Smaller state in caches & fewer hijacked memory ECC bits  significantly lower barrier for adoption

9 LiteTM in the HTM-STM spectrum LiteTM improves HTM by pushing more into software i.e., by moving HTMs closer to STMs! LiteTM differs from HybridTMs in h/w-s/w split Hybrids: conflict detection in h/w if fits in cache; otherwise in s/w LiteTM: conflict detection always in h/w; resolution in s/w Key point: Conflict detection Needed for all accesses  must be fast Is a global operation  usually hard to do fast in software Closely matches coherence which is fast  easy to piggyback Hence, always in hardware in LiteTM (like all HTMs)

10 Outline  Introduction  LiteTM transactional state  Lazy clearing  Experimental Results  Conclusion

11 Transactional State in L1 TokenTM (~16 bits)  R, W – transactionally read/written  R',W' + id – read/written and moved to another cache upon coherence movement no change in coherence Identifies conflictor  R + + count – fusion of multiple read copies LiteTM (2 bits)  T + clean/modified – transactionally read/written  T' – T moved to another cache  No id  All log walk if conflict  Upon conflict, abort writer and all but one reader, or all readers

12 Transactional State in L2 & Memory TokenTM (~16 bits)  States in L2 & memory  Idle (transactionally clean)  Single reader + id  Single writer + id  Multiple readers + count  Conflict on multiple readers  all log walks LiteTM (2 bits)  State in L2 & memory  Idle  Single reader  Single writer  Multiple readers  Conflict in any state  all log walks  No id  self log walk  No count  no decrement of count  Lazy Clearing of ‘Multiple readers’

13 Lazy Clearing ‘Multiple readers’ conflict/commit leaves state behind No count  don’t know who is last reader  cannot clear Lazy clear on next conflict via all log walks All log walk check and state clearing should be atomic Hardware address buffers + software support Details in HPCA ‘10 paper

14 Outline  Introduction  LiteTM transactional state  Lazy clearing  Experimental Results  Conclusion

15 Methodology  GEMS HTM simulator on top of Simics  8 core, 1GHz in-order issue processor  Typical memory hierarchy parameters  All STAMP benchmarks  Multiple runs for statistical significance  Transactional state bits: TokenTM 16 vs. LiteTM 2 Also show LiteTM-1bit: read sharing triggers log walks  Hybrid-bound: Emulate spilled transactions in hybrid TMs 1 extra hash-table write per first transactional access

16 Lack of distinction between read-sharing and conflict degrades LiteTM LiteTM Performance Mostly 1-3% loss; Contentious, long transactions 10% loss Labyrinth’s contention hurts base optimistic TM  small loss Mostly 1-3% loss; Contentious, long transactions 10% loss Labyrinth’s contention hurts base optimistic TM  small loss -1bit, conflict detection in s/w degrades Hybrid-bound

17 LiteTM Aborts & Log Walks Benchmarks% false abort due to W approx self log walks per commit all log walks per commit ssca2, km- low, km-high, intruder 0~0 genome ~0 vac-low0~0 vac-high yada0.90.3~0 bayes labyrinth Overhead increases with contention yet still low

18 Conclusion Current HTMs support many key features Incur high transactional state overhead Many state bits in all caches & hijacked memory ECC bits High barrier for adoption LiteTM significantly reduces transactional state Most state information not needed in common case Employs novel mechanisms for uncommon case LiteTM reduces TokenTM’s 16 bits/block to 2 bits Average (worst) case 4% (10%) performance loss in STAMP LiteTM significantly lowers the barrier for adoption

19 A couple points on Cliff’s talk Main problem: Conflicts due to auxiliary data  This problem exists for all optimistic TMs HTMs, STMs, and hybrids  Options Learn from past conflicts to skew the schedule (prevent conflict) Repair transactional state - Martin et al. ISCA ’10 (cure conflict) Instead of learning, compiler can provide hints to aid prevention These problems don’t seem big enough to give up on HTMs

20 Questions?

21 Is TokenTM overhead really high?  16 bits/L1-block is a lot in absolute terms  16 bits in memory may be hijacked from ECC 25% fewer SECDED bits  weaker protection  Or, 16 bits may be placed in main memory Increase the bandwidth requirements

22 Narrow Topic?  LiteTM separates Conflict detection (hardware) From conflictors identification (software)  Fundamental and can be applied to other unbounded HTMs

23 Focus on TokenTM  TokenTM is the only design which supports all features mentioned previously  Hence we attempt to improve TokenTM  Our design is applicable to other HTMs as well OneTM-concurrent's ids and VTM's ids (pointers to XSW in XADT) And counts (#entries in XADT))

24 What about UFO?  UFO is not a TM  Supports strong atomicity in Hybrids/STMs  We compare against upper bound on hybrids

25 Read-sharing Support  LiteTM allows read sharing  Multiple L1's can have T-bits  L2 has multiple read sharing state  Disallows readsharing if T bit + Modified Uncommon

26 Should logs be locked to avoid racing conflicts?  Recall: Conflicting access faults and retries  Suppose thread F is checking thread N’s log Looking for block X  N makes racing access to X  N takes away coherence permissions from F  After log walk F will RETRY the access to X  Coherence action will cause F to fault again  Back stop available to prevent livelock  Context switches handled similarly

27 Coherence Actions are Completed  Invalidations of a reader T' bit sent to writer T' states that there exists a token  Read sharing of writer T' bit sent to reader

28 STM Acceleration Easier?  STM-acceleration provides weaker semantics Requires at least one bit per memory block UFO-like mechanisms  LiteTM only 2 bits per block No changes to coherence protocol Performs better than STM-accelerated approach  Shown by our hybrid-upper bound comparison

29 Smallest Input Dataset  8-core setup  Suitable scaling for all benchmarks  Reasonable simulation times, Statistical variation.

30 Hybrid better than signature HTMs  Signature saturation causes serialized execution  TokenTM and LiteTM use per-block metastate,

31 Support for SMT Cores  LiteTM can support multithreaded cores Replicates the T bits per hardware context Single T' bit  T’ bit  Remote transactional access

32 Bits every where are hard?  No, adding nacks/delays in coherence is hard Leads to deadlocks/livelocks  Adding bits is quite easy

33 Validity of Hybrid Bound  Upper bound on Hybrids which retry spilled TX in s/w  Does not apply to other self-proclaimed hybrids E.g. SigTM  SigTM uses signatures for conflict detection  Signature-based TMs have other issues Signature saturation causes serualization

34 TokenTM vs LiteTM: Transactional state for Conflict Detection TokenTMLiteTM

35 Sensitivity to Busy Buffers No buffers  all L1 misses wait till lazy clearing  significant loss for high contention

36 ()Hybrid Upper-bound Upperbound for any hybrid that retries transactions in an STM (with software conflict detection) after a failure in HTM mode.

37 Transactional State Overheads Thread Id/sharer-count + state bits per block  Avoid coherence changes  Extra bits (beyond R, W) Previously, conflicting access is nacked (cannot complete) Such nacks are invasive changes to coherence (cause deadlocks) TokenTM allows coherence to complete even on a conflict »Access itself does not complete & excepts Needs transactional state to move with blocks under coherence Tracks non-local transactional state E.g., TokenTM‘s R', W', R + Thread Ids+sharer-counts in hardware  Detect conflicts + identify conflictors mostly in hardware