UNIVERSITY OF MASSACHUSETTS Dept

UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Multiprocessor Systems III Csaba Andras Moritz Papers: Hammond et al, Transactional Memory Coherence and Consistency, at Stanford [ISCA2004] Y. Guo et al, Synchronization Coherence (SyC), at UMass (JPDC2007] Textbook Slides adopted from authors groups papers and focused on class needs SyC - full Copyright Csaba Andras Moritz 2016

Slides Outline Transactional Memory Coherence and Consistency (TCC) – Synchronization and Coherence with Transactions Synchronization Coherence (SyC) - Cache Coherence with Fine-Grained Synchronization Hardware Primitives

Issues with Parallel Systems
Cache coherence protocols are complex Must track ownership of cache lines Difficult to implement and verify all corner cases Memory Consistency protocols are complex Must provide rules to correctly order individual loads/stores Difficult for both hardware and software Protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers Latency of short messages unlikely to scale well in the future Bandwidth is likely to scale much better (as discussed L1) High speed interchip connections Multicore (CMP) = on-chip bandwidth

Desirable Multiprocessor System
A shared memory system with A simple, easy programming model (unlike message passing) A simple, low-complexity hardware implementation (unlike shared memory) Good performance (unlike many protocols that generate bus transactions)

Lock Freedom Why using synch locks is bad?
Common problems in conventional locking mechanisms Priority inversion: When low-priority process is preempted while holding a lock needed by a high-priority process Convoying: When a process holding a lock is de-scheduled (e.g. page fault), no forward progress for other processes capable of running Deadlock (or Livelock): Processes/threads to lock the same set of objects in different orders could be poorly designed by programmers

Using Transactions What is a transaction?
A sequence of instructions that is guaranteed to execute and complete only as an atomic unit Begin Transaction Inst #1 Inst #2 Inst #3 … End Transaction Satisfy the following properties Serializability: Transactions appear to execute serially. Atomicity (or Failure-Atomicity): A transaction either commits changes when complete, visible to all; or aborts, discarding changes (will retry again)

TCC (Stanford) Transactional Coherence and Consistency (TCC)
Programmer-defined groups of instructions within a program Begin Transaction Start Buffering Results Inst #1 Inst #2 Inst #3 … End Transaction Commit Results Now Only commit machine state at the end of each transaction Each must update machine state atomically, all at once To other processors, all instructions within one transaction appear to execute only when the transaction commits These commits impose an order on how processors may modify machine state

Transaction Code Example
MIT LTM instruction set xstart: XBEGIN on_abort lw r1, 0(r2) addi r1, r1, 1 . . . XEND on_abort: … // back off j xstart // retry

Transactional Memory Transactions appear to execute in commit order
Flow (RAW) dependency cause transaction violation and restart ld 0xdddd ... st 0xbeef Transaction A ld 0xbeef Transaction C Time ld 0xdddd ... ld 0xbbbb Transaction B 0xbeef Commit Arbitrate 0xbeef Violation! Commit Arbitrate ld 0xbeef Re-execute with new data

Transactional Memory  
Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order (think about sequential consistency) Transaction A Transaction B Store X Store X Local buffer Local buffer Commit X   Commit X Shared Memory

Local stores supply data
Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Transaction A Transaction B Transaction B ST X = 1 Store X Local stores supply data ST X = 3 Store X LD X (=1) Local buffer Local buffer LD X (=3) Commit X Commit X LD X (=3)   X = 1 Commit X Commit X X = 3 Shared Memory

TCC System Similar to prior thread-level speculation (TLS) techniques
CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC speculative multithreading CMP Loosely coupled TLS system Completely eliminates conventional cache coherence and consistency models No MESI-style cache coherence protocol But require new hardware support

The TCC Cycle Transactions run in a cycle
Speculatively execute code and buffer Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other processors Commit stores together (as a packet) Provides a well-defined write ordering Can invalidate or update other caches Large packet utilizes bandwidth effectively And repeat

Advantages of TCC Trades bandwidth for simplicity and latency tolerance Easier to build Not dependent on timing/latency of loads and stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no MESI !) Commit order implies an ownership sequence

How to Use TCC Divide code into potentially parallel tasks
Usually loop iterations For initial division, tasks = transactions But can be subdivided up or grouped to match HW limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Easier to get parallel programs running correctly Programmer then orders transactions as necessary Ordering techniques implemented using phase number Deadlock-free (At least one transaction is the oldest one) Livelock-free (watchdog HW can easily insert barriers anywhere)

How to Use TCC Three common ordering scenarios
Unordered for purely parallel tasks Fully ordered to specify sequential task (algorithm level) Partially ordered to insert synchronization like barriers

Basic TCC Transaction Control Bits
In each local cache Read bits (per cache line, or per word to eliminate false sharing) Set on speculative loads Snooped by a committing transaction (writes by other CPU) Modified bits (per cache line) Set on speculative stores Indicate what to rollback if a violation is detected Different from dirty bit Renamed bits (optional) At word or byte granularity To indicate local updates (WAR) that do not cause a violation Subsequent reads that read lines with these bits set, they do NOT set read bits because local WAR is not considered a violation

During A Transaction Commit
Need to collect all of the modified caches together into a commit packet Potential solutions A separate write buffer, or An address buffer maintaining a list of the line tags to be committed Size? Broadcast all writes out as one single (large) packet to the rest of the system

Re-execute A Transaction
Rollback is needed when a transaction cannot commit Checkpoints needed prior to a transaction Checkpoint memory Use local cache Overflow issue Conflict or capacity misses require all the victim lines to be kept somewhere (e.g. victim cache) Checkpoint register state Hardware approach: Flash-copying rename table / arch register file Software approach: extra instruction overheads

Sample TCC Hardware Write buffers and L1 Transaction Control Bits
Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequence logic

Ideal Speedups with TCC
equake_l : long transactions equake_s : short transactions

Speculative Write Buffer Needs
Only a few KB of write buffering needed Set by the natural transaction sizes in applications Small write buffer can capture 90% of modified state

Broadcast Bandwidth Broadcast is bursty Average bandwidth
Needs ~16 32 processors with whole modified lines Needs ~8 32 processors with dirty data only High, but feasible on-chip

TCC vs MESI [PACT 2005] Application, Protocol + Processor count

Further Readings M. Herlihy and J. E. B. Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. R. Rajwar and J. R. Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001 R. Rajwar and J. R. Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002 J. F. Martinez and J. Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002 L. Hammond, V. Wong, M. Chen, B. D. Calrstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukoton “Transactional Memory Coherence and Consistency,” ISCA 2004 C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, S. Lie, “Unbounded Transactional Memory,” HPCA 2005 A. McDonald, J. Chung, H. Chaf, C. C. Minh, B. D. Calrstrom, L. Hammond, C. Kozyrakis and K. Olukotun, “Characterization of TCC on a Chip-Multiprocessors,” PACT 2005.

Synchronization Coherence
A Transparent Hardware Mechanism for Cache Coherence and Fine-Grained Synchronization, Yao Guo, Vladimir Vlassov, Raksit Ashok, Richard Weiss, and Csaba Andras Moritz, Journal of Parallel and Distributed Computing, 2007 Paper accessible from my group’s site follow to papers

Goals of SyC Provide transparent fine-grained synchronization and caching in a multiprocessor machine and single-chip multiprocessor Merges fine-grained synchronization (FGS) mechanisms with traditional cache coherence protocols we discussed. FGS reduces network utilization as well as synchronization related processing overheads – same as dataflow execution in CPUs Minimal added hardware complexity as compared to cache coherence mechanisms or previously reported fine-grained synchronization techniques.

Why FGS? One popular approach is based on improving the efficiency of traditional coarse-grained synchronization (and with speculative execution beyond synchronization points). Discussed this before Scalability limitations in many-cores with 100s of cores potentially Speculation may be costly on power consumption! Another suggested alternative is fine-grained synchronization, such as synchronization at a word level. Efficient But how to do you combine with caching? SyC does it transparently

MIT Alewife CC-NUMA Machine
Implements FGS Directory-based CC-NUMA multiprocessor with a full/empty tagged shared memory. Hardware support for fine-grained word-level synchronization includes full/empty bits, special memory operations and a special full/empty trap. Software support includes trap handler routines for context switching and waiting for synchronization. While the Alewife architecture supports fine-grained synchronization and shows demonstrable benefits over a coarse-grained approach, it still implements synchronization in a software layer above the cache coherence protocol layer. Keeping the two layers separate entails additional communication overhead.

Alewife FGS Synch Structures
A J-structure is an abstract data type that can be used for consumer-producer type of process interaction. A J-structure has the semantics of a write-once variable: it can be written only once, and a read from an empty J-structure suspends until the structure is filled. An instance of the J-structure is initially empty. See more in paper. A lockable L-structure is an abstract data type that has three operations: a non-locking peek (L- peek), a locking read (L-read), and an unlocking write (L-write). An L-peek is similar to a J-read: it waits until the structure is full, and then returns its value without changing the state. An L-write is similar to a J-write: it stores the value to an empty structure, changes its state to full and releases all pending J-reads, if any.

SyC Overview Assume a multiprocessor with full/empty tagged shared memory (FE-memory) where each location (e.g. word) can be tagged as full or empty (i.e., it has a full/empty bit (FE-bit) associated with it): if the bit is set the location is full, otherwise the location is empty. In order to provide fine-grained synchronization such as MIT J-structures and L-structures described above, this multiprocessor supports special synchronized FE-memory operations (reads and writes) that can depend on the FE-bit and can alter the FE-bit in addition to reading or writing the target location. Architecture should include corresponding FE-memory instructions as well as full/empty conditional branch instructions.

SyC contd. A vector of FE-bits and a vector of P-bits associated with words in a memory block are stored in the coherence directory and as an extra field in the cache tag when the block is cached. A P-bit indicates whether there are operations (reads or writes) pending on a corresponding word: if set, it means there are pending synchronized reads (if the word is empty) or pending writes (if the word is full) for the corresponding word. This information is required so that a successfully executed altering operation can immediately satisfy one or more pending operations. This way, a tag (directory) lookup includes tag match and inspection of full/empty bits. Each home node (directory) also contains a State Miss Buffer (SMB) that holds information regarding which nodes have pending operations for a given word and whether operations are altering or not. The synchronization miss is treated as a cache miss, and information on the miss is recorded at the cache, e.g., in Miss Information/Status Holding Register (MSHR) and a write buffer.

Architectural Support
Assumes a processor system with a 32-byte (4 words) block. In addition to the 32-byte block and the tag bits, each cache block has 8 extra bits: 4 FE-bits and 4 P-bits (that is about 3% of storage overhead per cache block) The cache controller not only matches the tag but also checks the full/empty bits depending on the instruction, and makes a decision based on the state of the cache line as well as the associated FE-bits and P- bits. The directory controller is also modified to account for the Synch Miss Buffer (SMB) implementing the SyC protocol in the directory.

Architectural Support Overview for SyC
Cache

Programming Model – FE-M Instruct.
Requires compiler support, L/J structures or a suitable programming model similar to dataflow; KTH initiated the Extended Dataflow Arch model

Protocol Transactions Support
Some extensions to MESI for non ordinary reads and writes. See 3.3 if interested

Example Results

UNIVERSITY OF MASSACHUSETTS Dept

Similar presentations

Presentation on theme: "UNIVERSITY OF MASSACHUSETTS Dept"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF MASSACHUSETTS Dept

Similar presentations

Presentation on theme: "UNIVERSITY OF MASSACHUSETTS Dept"— Presentation transcript:

Similar presentations

About project

Feedback