Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
Memory.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Memory Management and Protection Part 3:Virtual memory, mode switching,
Segmentation and Paging Considerations
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
Memory Management (II)
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CE6105 Linux 作業系統 Linux Operating System 許 富 皓. Chapter 2 Memory Addressing.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.
Memory Management Chapter 7 B.Ramamurthy. Memory Management Subdividing memory to accommodate multiple processes Memory needs to allocated efficiently.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
1 Lecture 5: TM – Lazy Implementations Topics: TM design (TCC) with lazy conflict detection and lazy versioning, intro to eager conflict detection.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
KAUSHIK LAKSHMINARAYANAN MICHAEL ROZYCZKO VIVEK SESHADRI Transactional Memory: Hybrid Hardware/Software Approaches.
Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
© 2004, D. J. Foreman 1 Memory Management. © 2004, D. J. Foreman 2 Building a Module -1  Compiler ■ generates references for function addresses may be.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
Background Program must be brought into memory and placed within a process for it to be run. Input queue – collection of processes on the disk that are.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
1 Linux Operating System 許 富 皓. 2 Memory Addressing.
1 Memory Management (b). 2 Paging  Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
8.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Fragmentation External Fragmentation – total memory space exists to satisfy.
Background Program must be brought into memory and placed within a process for it to be run. Input queue – collection of processes on the disk that are.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Chapter 8: Memory Management. 8.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 8: Memory Management Background Swapping Contiguous.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Computer Organization
Non Contiguous Memory Allocation
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Alex Kogan, Yossi Lev and Victor Luchangco
Lecture 19: Coherence and Synchronization
Chapter 8: Main Memory.
Multiprocessor Cache Coherency
Lecture 19: Transactional Memories III
Transactional Memory Coherence and Consistency
Background Program must be brought into memory and placed within a process for it to be run. Input queue – collection of processes on the disk that are.
Main Memory Background Swapping Contiguous Allocation Paging
Lecture 21: Transactional Memory
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Hybrid Transactional Memory
Lecture 25: Multiprocessors
Chapter 8: Memory Management strategies
Practical Session 8, Memory Management 2
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Transactional Memory
Lecture 18: Coherence and Synchronization
Practical Session 9, Memory Management continues
CSE 542: Operating Systems
Presentation transcript:

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data

TM framework should be able to execute transactions as efficiently as possible even if defined in a coarse-grain fashion. Usually programmers will define transactions in this manner on a large piece of code. An analysis shows that many variables accessed inside a transaction are not truly shared across multiple threads. Rather they are completely local to an individual thread.

ALGORITHM AND IMPLEMENTATIONS

We assume programmer has a-priori knowledge about some data structures which are thread-local and we require that programmer use a dual version of malloc, named local_malloc() for such structures. To filter out stack accesses of the transaction ( and any function call made from within a transaction) we use stack pointer and frame pointer register.

Implemented HTM is modeled after TCC (ISCA, 2004) in M5 (a full-system simulator from Umichgan, Ann-arbor). It belongs to Lazy-Lazy class of TM, where conflict detection and global memory updation occur at commit time. Aborts are cheap, Commits are expensive. IMPLEMENTATION OF HTM

Cache line is modified to track readset and writeset of a transaction. Each individual thread is identified by their unique Process Control Block Base register value. (This is alpha specific). Cache coherence protocol is modified to allow multiple updated copies of the same cache line. At each store the address and value are inserted into a queue called commit queue.However if SL bit of that word is set, it does not get included.(explained later). During commit, each store in the queue is replayed. IMPLEMENTATION OF HTM (cont..)

Modifications in coherence protocol are following: A) Whenever a processor makes a write, it does not invalidate other copies. Hence a processor write does not generate bus write. B) Whenever a processor wants to read a value for the first time, it is forced to go to bus. But here other should not reply with their own modified value. Hence response to Bus Read request is deactivated. This means request ultmately gets staisfied by the non-spec level of memory. IMPLEMENTATION OF HTM (cont..)

When a thread wants to commit, it locks the bus (first come, first serve mode) for the entire commit duration. Other thread can execute their transaction or else they have to spin waiting for commit permission. However as each store passes the common bus during commit, other thread snnops the address and invalidate themselves if there is a conflict. A transaction is retried immediately if it is doomed. IMPLEMENTATION OF HTM (cont..)

Cache line is augmented with new bits. R- denotes if the cache line is read in a transaction. W- denotes update to the cache line SL (Speculative Local)- One bit per word denotes if the word read or written is local to thread.

IMPLEMENTATION OF HTM (cont..) TLB structure is not modified. However an earlier unused bit in the protection filed is used to hold the locality information of a page. (Bit number 21 in case of Alpha)

IMPLEMENTATION OF HTM (cont..) To filter out stack access two new registers are added. They hold the stack bounds for current executing transaction.

Source of speed-up During commit a substantial amount of bus- bandwidth is saved which would otherwise be wasted on commiting local variables. For local variables, commit is done by clearing SL bits in the corresponding cache line.

RESULTS Filtered vs Unfiltered Read/Write set size (in bytes)

SPEED-UP Numbers

Average speed-up of 1.14x across STAMP benchmarks is observed for scalable TCC type of HTM.

SEED-UP Numbers(Cont..) Average speed-up of 1.24x across STAMP benchmarks is observed for conventional TCC type of HTM.

Commit Expedition Reduction in commit cycle time