Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev.

Slides:

Advertisements

Similar presentations

1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.

Advertisements

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Reference: Message Passing Fundamentals.

1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies.

1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

1 Lecture 24: Transactional Memory Topics: transactional memory implementations.

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 Lecture 5: TM – Lazy Implementations Topics: TM design (TCC) with lazy conflict detection and lazy versioning, intro to eager conflict detection.

1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)

1 Lecture 7: Lazy & Eager Transactional Memory Topics: details of “lazy” TM, scalable lazy TM, implementation details of eager TM.

1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.

Multiprocessor Cache Coherency

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

1 The Google File System Reporter: You-Wei Zhang.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.

1 Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian School of Computing, University of Utah

Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

Databases Illuminated

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Concurrency unlocked Programming

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

10 1 Chapter 10_B Concurrency Control Database Systems: Design, Implementation, and Management, Rob and Coronel.

Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.

1 Lecture 10: Transactional Memory Topics: lazy and eager TM implementations, TM pathologies.

Distributed Mutual Exclusion Synchronization in Distributed Systems Synchronization in distributed systems are often more difficult compared to synchronization.

Novel Paradigms of Parallel Programming Prof. Smruti R. Sarangi IIT Delhi.

Lecture 20: Consistency Models, TM

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Transactional Memory : Hardware Proposals Overview

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 19: Transactional Memories III

The University of Adelaide, School of Computer Science

Lecture 11: Transactional Memory

Lecture: Consistency Models, TM

Lecture 17: Transactional Memories I

Chapter 10 Transaction Management and Concurrency Control

Lecture 21: Transactional Memory

Lecture 22: Consistency Models, TM

Lecture: Consistency Models, TM

Introduction of Week 13 Return assignment 11-1 and 3-1-5

The University of Adelaide, School of Computer Science

BulkCommit: Scalable and Fast Commit of Atomic Blocks

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 23: Transactional Memory

Lecture 21: Transactional Memory

Lecture: Consistency Models, TM

Lecture: Transactional Memory

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev Balasubramonian School of Computing Introduction Multi-cores have established themselves as the de-facto architecture of today and coming processor generations. To fully exploit the processing power supplied by these cores, methods to exploit concurrency would have to be devised, both at the hardware and system software levels. Low overhead, scalable methods for exploiting concurrency need to be devised. Hardware Transactional Memory (HTM) systems proposed for this purpose have proven effective but not scalable. A lot of overhead is involved in the commit process, which could be avoided by novel commit algorithms that we propose in this work. In this work we propose improvements to Stanford Scalable TCC model which is one of the frontrunners in HTM systems to reduce commit overheads and increase scalability Stanford’s Scalable-TCC  Lazy versioning – changes are made locally. The “master copy” is updates only at the successful transaction commit.  Lazy conflict detection – Check for conflicts with other transaction happens when a transaction has finished all its “work”  Quick Aborts  Commit is slow - the bottleneck TCC Commit Process  Centralized TID (Transaction ID) vendor grants ID to each transaction, serializing commit order.  Probe write-set directories – check if the write set directory is done serving older committing transactions.  Send Skip messages to directories not in write set.  Send Mark messages – propagate write updates.  Probe read-set directories – conflict detection.  If read-check passes, send final Commit message to all directories in the commit set. Make changes permanent. Stanford’s Scalable-TCC  Lazy versioning – changes are made locally. The “master copy” is updates only at the successful transaction commit.  Lazy conflict detection – Check for conflicts with other transaction happens when a transaction has finished all its “work”  Quick Aborts  Commit is slow - the bottleneck TCC Commit Process  Centralized TID (Transaction ID) vendor grants ID to each transaction, serializing commit order.  Probe write-set directories – check if the write set directory is done serving older committing transactions.  Send Skip messages to directories not in write set.  Send Mark messages – propagate write updates.  Probe read-set directories – conflict detection.  If read-check passes, send final Commit message to all directories in the commit set. Make changes permanent. Optimizations to SEQ  SEQ-PRO (Parallel Readers Optimization)  Parallel reads to a directory do not cause conflicts  Allows parallel reads from transactions to occupy a directory.  SEQ-TS (Timestamp Optimization)  Transactions have timestamps – An “older” transaction can steal occupied directories from younger transaction.  Directories can be occupied in parallel, improving performance Optimizations to SEQ  SEQ-PRO (Parallel Readers Optimization)  Parallel reads to a directory do not cause conflicts  Allows parallel reads from transactions to occupy a directory.  SEQ-TS (Timestamp Optimization)  Transactions have timestamps – An “older” transaction can steal occupied directories from younger transaction.  Directories can be occupied in parallel, improving performance Results References [1] Scalable, Reliable, Power Efficient Communication for hardware transactional memory, S. Pugsley, M. Awasthi. N. Madan, N. Muralimanohar and R. Balasubramonian, SoC Technical Report, UUCS , Jan 2008 [2] A Scalable, Non-blocking Approach to Transactional Memory, Chafi et al, HPCA 2007 References [1] Scalable, Reliable, Power Efficient Communication for hardware transactional memory, S. Pugsley, M. Awasthi. N. Madan, N. Muralimanohar and R. Balasubramonian, SoC Technical Report, UUCS , Jan 2008 [2] A Scalable, Non-blocking Approach to Transactional Memory, Chafi et al, HPCA 2007 Proposed Commit Algorithm (Sequential Commit - SEQ)  Each directory has an “Occupied” bit.  Committing transaction “occupies” all commit directories sequentially.  For already occupied directories, request for occupancy is either buffered of NACKed.  When all directories are occupied, the transaction sends updates and commits, probably aborting other transactions.  Sequential occupancy order => deadlock-free algorithm  Number of network messages is reduced.  Algorithm is scalable – No centralized agent involved. Proposed Commit Algorithm (Sequential Commit - SEQ)  Each directory has an “Occupied” bit.  Committing transaction “occupies” all commit directories sequentially.  For already occupied directories, request for occupancy is either buffered of NACKed.  When all directories are occupied, the transaction sends updates and commits, probably aborting other transactions.  Sequential occupancy order => deadlock-free algorithm  Number of network messages is reduced.  Algorithm is scalable – No centralized agent involved. Issues in Scalable-TCC  Centralized TID vendor is a bottleneck as number of cores grow.  Large number of on-chip network messages are exchanged, hence bandwidth and power requirements are large.  Number of Skip messages is a function of number of cores in system.  Commit delays are a bottleneck, if most transactions are relatively short. Issues in Scalable-TCC  Centralized TID vendor is a bottleneck as number of cores grow.  Large number of on-chip network messages are exchanged, hence bandwidth and power requirements are large.  Number of Skip messages is a function of number of cores in system.  Commit delays are a bottleneck, if most transactions are relatively short. Scalable-TCC Background Hardware Transactional Memory (HTM) overview -  New paradigm to simplify parallel programming. Instead of lock-unlock, uses transaction Begin and End  Can yield better performance and eliminate deadlocks.  Programmer can freely encapsulate code sections with transactions and not worry about the impact on performance and correctness.  Programmer specifies the code sections they’d like to see execute atomically – the hardware takes care of the rest (provides illusion of atomicity).  Usually classified by their choice of data versioning and conflict detection mechanisms. Commit Delays reduced by 7x as compared to Scalable-TCC Number of network messages reduces by upto 48x