An Integrated Hardware-Software Approach to Transactional Memory Sean Lie 6.895 Theory of Parallel Systems Monday December 8 th, 2003.

Slides:

Advertisements

Similar presentations

1 Memory hierarchy and paging Electronic Computers M.

Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.

EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.

1 Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007.

Transactional Memory: Architectural Support for Lock- Free Data Structures Herlihy & Moss Presented by Robert T. Bauer.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.

Translation Buffers (TLB’s)

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Thread-Level Speculation Karan Singh CS

Virtual Memory Expanding Memory Multiple Concurrent Processes.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Process Synchronization. Concurrency Definition: Two or more processes execute concurrently when they execute different activities on different devices.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Non Contiguous Memory Allocation

Virtual Memory Chapter 7.4.

15-740/ Computer Architecture Lecture 3: Performance

Memory COMPUTER ARCHITECTURE

Software Coherence Management on Non-Coherent-Cache Multicores

CS161 – Design and Architecture of Computer

Memory Protection: Kernel and User Address Spaces

Lecture 12 Virtual Memory.

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Memory Caches & TLB Virtual Memory

PHyTM: Persistent Hybrid Transactional Memory

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Morgan Kaufmann Publishers

Memory Protection: Kernel and User Address Spaces

Memory Protection: Kernel and User Address Spaces

Memory Protection: Kernel and User Address Spaces

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 23: Cache, Memory, Virtual Memory

Understanding Performance Counter Data - 1

Lecture 22: Cache Hierarchies, Memory

Interconnect with Cache Coherency Manager

Fast Communication and User Level Parallelism

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Translation Buffers (TLB’s)

Hybrid Transactional Memory

Virtual Memory Overcoming main memory size limitation

Lecture 25: Multiprocessors

Translation Buffers (TLB’s)

Lecture 8: Efficient Address Translation

CSE 471 Autumn 1998 Virtual memory

Translation Buffers (TLBs)

Review What are the advantages/disadvantages of pages versus segments?

Memory Protection: Kernel and User Address Spaces

Presentation transcript:

An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003

Transactional Memory if (i<j) { a = i; b = j; } else { a = j; b = i; } Lock(L[a]); Lock(L[b]); Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; Unlock(L[b]); Unlock(L[a]); StartTransaction; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; EndTransaction; Transactional memory provides atomicity without the problems associated with locks. Locks Transactional Memory I propose an integrated hardware-software approach to transactional memory.

Hardware Transactional Memory HTM: Transactional memory can be implemented in hardware using the cache and cache coherency mechanism. [Herlihy & Moss] Uncommitted transactional data is stored in the cache. Transactional data is marked in the cache using an additional bit per cache line. HTM has very low overhead but has size and length limitations. DataTrans Tag Status

Software Transactional Memory STM: Transactional memory can be implemented in software using compiler and library support. Uncommitted transactional data is stored in a copy of the object. Transactional data is marked by flagging the object field. STM does not have size or length limitations but has high overhead. Original Object Trans Version Readers List 248 ‘a’ FLAG 77 “Mass. Ave.” NULL 48 NULL FLEX Software Transaction System

Results An integrated approach gives the best of both worlds.  Common case: HTM mode - Small/short transactions run fast.  Uncommon case: STM mode - Large/long transactions are slower but possible. An integrated hardware-software transactional memory system was implemented and evaluated.  HTM was implemented in the UVSIM software simulator.  A subset of STM functionality was implemented for the benchmark applications.  HTM was modified to be software-compatible.

Hardware vs. Software HTM has much lower overhead than STM. A network flow µbenchmark (node-push) was implemented for evaluating overhead. However, HTM has 2 serious limitations. Atomicity Mechanism “Worst”“More Realistic” Cycles (% of Base) Locks505%136% HTM153%104% STM1879%206% 1 processor overheads: “worst” case: Back-to-back small transactions “more realistic” case: Some processing between small transactions

Hardware Limitation: Cache Capacity HTM uses the cache to hold all transactional data. Therefore, HTM aborts transactions larger than the cache. Restricting transaction size is awkward and not modular.  Size will depend on associativity, block size, etc. in addition to cache size.  Cache configuration change from processor to processor. DataTrans Tag Status

Hardware Limitation: Context Switches The cache is the only transactional buffer for all threads. Therefore, HTM aborts transactions on context switches. Restricting context switches is awkward and not modular.  Context switches occur regularly in modern systems (e.g.. TLB exceptions). Processor Thread a Thread b Cache Trans data for thread a

HSTM: An Integrated Approach Transactions are switched from HTM to STM when necessary. When a transaction aborts in HTM, it is restart in STM. HTM is modified to be software-compatible. DataTrans Tag Status Readers List 248 ‘a’ FLAG 77 “Mass. Ave.”

Software-Compatible HTM 1 new instruction: xAbort Additional checks are performed in software:  On loads, check if the memory location is set to FLAG.  On stores, check if there is a readers list. Software checks are slow. Performing checks adds a 2.2x performance overhead over pure HTM in “worst” case (1.1x in “more realistic” case). Readers List 248 ‘a’ FLAG 77 “Mass. Ave.” Original Object null ? Store Check flag ? Load Check Y N Normal Memory Operation Abort YN

Overcoming Size Limitations The node-push benchmark was modified to touch more nodes to evaluate size limitations. HSTM uses HTM when possible and STM when necessary. HTM Transactions stop fitting after this point

Overcoming Size Limitations HTM Transactions stop fitting after this point The node-push benchmark was modified to touch more nodes to evaluate size limitations. HSTM uses HTM when possible and STM when necessary.

Overcoming Context Switching Limitations Context switches occur on TLB exceptions. The node-push benchmark was modified to choose from a larger set of nodes.  More nodes  higher probability of TLB miss (P abort ). HSTM behaves like HTM when P abort is low and like STM when P abort is high.

Overcoming Context Switching Limitations Context switches occur on TLB exceptions. The node-push benchmark was modified to choose from a larger set of nodes.  More nodes  higher probability of TLB miss (P abort ). HSTM behaves like HTM when P abort is low and like STM when P abort is high.

Conclusions An integrated approach gives the best of both worlds.  Common case: HTM mode - Small/short transactions run fast.  Uncommon case: STM mode - Large/long transactions are slower but possible. Trade-offs:  “STM mode” is not has fast as pure STM. This is acceptable since it is uncommon.  “HTM mode” is not has fast as pure HTM. Is this acceptable?

Future Work Full implementation of STM in UVSIM Integration of software-compatible HTM into the FLEX compiler Evaluate how software-compatible HTM performs for parallel applications Should software-compatible modifications be moved into hardware? Can a transaction be transferred from hardware to software during execution?