Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Slides:

Advertisements

Similar presentations

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Continuously Recording Program Execution for Deterministic Replay Debugging.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Lecture 13: Consistency Models

Multiscalar processors

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Multiprocessor Cache Coherency

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Chapter 3 Memory Management: Virtual Memory

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

Lecture 20: Consistency Models, TM

Computer Architecture Chapter (14): Processor Structure and Function

Speculative Lock Elision

Transactional Memory : Hardware Proposals Overview

Memory Consistency Models

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Memory Consistency Models

5.2 Eleven Advanced Optimizations of Cache Performance

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Transactional Memory Coherence and Consistency

The University of Adelaide, School of Computer Science

Lecture 22: Consistency Models, TM

Lecture 10: Consistency Models

Memory Consistency Models

The University of Adelaide, School of Computer Science

Lecture 24: Multiprocessors

Programming with Shared Memory - 3 Recognizing parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture: Coherence and Synchronization

Lecture: Consistency Models, TM

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Presentation transcript:

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Oluktun Presented by Peter Gilbert ECE259 Spring 2008

Motivation Shared memory can support simple programming models However, shared memory means complex consistency and cache coherence protocols Consistency - must provide rules for ordering loads/stores Tradeoff between ease of use and performance (e.g. SC vs. RC) Cache coherence - must track ownership of cache lines Requires low latency for many small coherence messages

Motivation Latency of small coherence messages unlikely to scale well Interprocessor bandwidth likely to scale better Message passing models can take advantage, but hard to program Can we design a shared memory model with: Simple programming model Simple hardware Good performance...and take advantage of available bandwidth?

TCC Transaction - sequence of instructions that executes speculatively on a processor and completes only as an atomic unit All writes are buffered locally and only committed to shared memory when the transaction completes Commit is atomic: entire effect of transaction committed to shared memory state at once Transaction’s effects not visible to other processors until commit is performed Processor broadcasts entire write buffer in one large commit packet High broadcast bandwidth needed, latency not as important Interconnect need not provide ordering System-wide view: transactions appear to execute in commit order

Dependence violations Each processor must snoop on commit packets to detect violations Transaction detecting violation must rollback and restart Register checkpointing mechanism needed

Consistency Consistency is simplified: Other consistency models: ordering rules between individual memory references TCC: sequential ordering only between transaction commits All references from an earlier commit appear to occur “before” all references from later commit Interleaving between memory accesses by different processors only allowed at transaction boundaries Can provide illusion of uniprocessor execution by imposing original program’s transaction order

Coherence Coherence is simplified: No ownership of cache lines Writes are buffered Invalidation or update only occurs by snooping commit packets Don’t rely on many latency-sensitive coherence messages

Antidependencies TCC automatically handles WAW and WAR WAWWAR

Programming model Programmer inserts transaction boundaries Similar to threading, but no locks… less errors One hard rule: transaction breaks cannot be placed between a load and a subsequent store of a shared value Steps for parallelizing code with TCC: 1. Divide into potentially parallel transactions Examples: loop iterations, after function calls Transactions need not be independent Dependence violations caught at runtime 2. Specify order 3. Tune performance

Transaction Ordering Most programs require ordering between certain transactions Solution: assign phase number to each transaction Only transactions from the “oldest” phase can commit Can implement barriers or full ordering

Performance tuning How to choose transactions: Large transactions amortize startup and commit overhead Smaller transactions should be used when violations are frequent Minimize amount of lost work TCC system provides feedback about violations to facilitate tuning

Hardware requirements Write buffer Read bit(s), modified bit, optional renamed bits for L1 cache lines System wide commit arbitration

Extensions Double buffering Allow a processor to work on next transaction while previous transaction waits to commit Use additional write buffers and sets of read and modified bits Without double bufferingExtra write bufferExtra write buffer and read bits

Extensions Hardware-controlled transactions Automatically divide program into transactions as buffers overflow Take full advantage of available buffer space Programmer must still mark critical regions where transaction boundaries cannot occur Automatically merge small transactions into larger ones I/O Must guarantee no rollback after input is read Obtain commit permission before reading input If ordering of outputs is important Same idea: request commit permission immediately

Evaluation Can TCC extract parallelism for shared memory benchmarks? How large are the read and write states which must be buffered? What is the broadcast bandwidth requirement?

Simulation results Optimal TCC model (infinite bus bandwidth, no memory delays) extracted parallelism well for many benchmarks Speedups for automatically and manually parallelized benchmarks with optimal TCC model

Read and write state Most benchmarks needed 6-12 KB of read state and 4-8 KB of write state Reasonable for current caches and on-chip write buffers Most of the benchmarks requiring large read and write state can probably be divided into smaller transactions (e.g. radix_l vs. radix_s) Write state for smallest 10%, 50%, and 90% of iterations

Broadcast bandwidth If invalidate protocol is used with 32-bit addresses, average of 0.5 bytes/cycle for 32 processors For update protocol, up to 16 bytes/cycle for 32 processors If only dirty data is sent, only 8 bytes/cycle Average bytes/cycle broadcast by 1 IPC system with an update protocol

Other parameters Snooping requirement: significantly less than 1 address/cycle Single snoop port per processor sufficient for up to 32 processors Commit arbitration overhead: compiler-parallelized apps were insensitive, while performance suffered for apps with smaller transaction sizes Extensions did not yield benefits in most cases Extra read state bits (per-word rather than per-line) only mattered for a few applications Double-buffering did not help Will be useful when bandwidth is limited

Conclusions TCC simplifies consistency and coherence No need for rules for ordering individual memory references No need for latency-sensitive coherence messages TCC provides a simple and flexible programming model Correctness is guaranteed: no error-prone locks Tuning performance based on observed violations is straightforward Uniprocessor ordering can be achieved by ordering all transactions An optimal TCC implementation extracts parallelism well for a wide range of benchmarks Performance? Will be limited by broadcast bandwidth and commit arbitration overhead Evaluation on a realistic hardware model necessary