CS5102 High Performance Computer Systems Distributed Shared Memory

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Multiprocessor Cache Coherency
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Outline Introduction (Sec. 5.1)
Lecture 8: Snooping and Directory Protocols
COMP 740: Computer Architecture and Implementation
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Dr. George Michelogiannakis EECS, University of California at Berkeley
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 18 Cache Coherence Krste Asanovic Electrical Engineering and.
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Prof. Onur Mutlu ETH Zürich Fall November 2017
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CS5102 High Performance Computer Systems Distributed Shared Memory Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. O. Mutlu, Prof. Hsien-Hsin Lee, Prof. K. Asanovic, http://compas.cs.stonybrook.edu/courses/cse502-s14/)

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization: the basics (Sec. 5.5) Models of memory consistency (Sec. 5.6)

Distributed Shared Memory (DSM) The University of Adelaide, School of Computer Science 17 November 2018 Distributed Shared Memory (DSM) Non-uniform memory access (NUMA) architecture Memory distributed among processors, logically shared All processors can directly access all memory Can use scalable point-to-point interconnection networks  no single point of coordination  simultaneous communications PE R Chapter 2 — Instructions: Language of the Computer

Can snoopy protocol work for cache coherence on DSM? Broadcast Write propagation and write serialization?

Races in DSM Consider a DSM with a mesh interconnection NW Different caches (PEs) will see different orders of writes Races due to network NI Mem P $ st X,0 R PE R PE R PE R PE 1 X R PE R PE R PE R PE 1 X st X,2 R PE R PE R PE R PE 1 X R PE R PE R PE R PE

Scalable Cache Coherence We need a mechanism to serialize/order writes Idea: instead of relying on interconnection network to provide serialization, ask a coordinate node  directory R PE R PE R PE R PE 1 X R PE R PE R PE R PE 1 X R PE R PE R PE R PE 1 X R PE R PE R PE R PE Directory X

Scalable Cache Coherence Directory: Single point of serialization, one entry for one block An entry tracks cached copies (sharer set) for each block Processors make requests for blocks through directory Directory coordinates invalidation appropriately and communicate only with processors that have copies e.g. P1 asks directory for exclusive copy, directory asks all sharers to invalidate, waits for ACKs, then responds to P1 Communication with directory and copies is through network transactions but is independent of the network, as long as it provides point-to-point communications Directory can be centralized or distributed

Directory-based Cache Coherence Directory tracks who has what Every memory block has an entry in the directory HW overhead for the directory (~ # blocks * # nodes) Can work with UMA SMP, too! P P P P $ $ $ $ Interconnection Network (Centralized) Directory Modified bit Presence bits, one for each node Memory

Directory-based Cache Coherence P P P P P $ $ $ $ $ (Centralized) Directory Interconnection Network Memory C(k) 1 1 C(k+1) C(k+j) 1 1 modified bit for each cache block in memory 1 presence bit for each processor, each cache block in memory

Distributed Directory Cache Coherence The University of Adelaide, School of Computer Science 17 November 2018 Distributed Directory Cache Coherence Distributed directories track local memory to maintain cache coherence (CC-NUMA) Assumptions: reliable network, FIFO message delivery between any given source-destination pair Chapter 2 — Instructions: Language of the Computer

Directory Coherence Protocol: Read Miss Every memory block has a “home” node, where its directory entry resides Home node can be calculated from block address P0 Read Z (missed) Pn-1 Pn $ $ $ Z Block Z is shared (clean) Z Home of Z Go to Home Node Memory Memory Memory Z 1 1 1 Interconnection Network

Directory Coherence Protocol: Read Miss Block Z is dirty (“Modified” in Pn-1) Block Z is now clean, “Shared” by 3 nodes (MS) P0 Read Z (missed) Pn-1 Pn $ Z $ Z $ Go to Home node Design choice: who is responsible for sending the up-to-date copy to the requesting node? Memory Memory Memory Z Reply to the request node 1 1 1 1 Ask Owner Reply block Interconnection Network

Directory Coherence Protocol: Write Miss P0 can now write to block Z (IM) P0 Write Z (missed) Pn-1 Pn $ X $ X $ Z Z Z Go to Home Node Design choice: who invalidate other caches, directory or requesting node? Memory Memory Memory Z Reply block 1 1 1 1 Invalid sharers Ack Interconnection Network

Directory: Basic Operations Follow semantics of snoop-based system (MSI) but with explicit request and reply messages Directory: Receives Read, ReadEx requests from nodes Sends Invalidate messages to sharers if needed Forwards request to memory if needed Replies to requestor and updates sharing state Protocol design is flexible Exact forwarding paths depend on implementation, e.g. directory node or requesting node perform all bookkeeping operations? Must not have race conditions

A Possible 4-hop Implementation L has a cache read miss on a load instruction H is the home node R is the current owner of the block, who has the most up-to-date data for that block, which is in the Modified state State: M Owner: R 1: Read Req 2: Recall Req L H R 4: Read Reply 3: Recall Reply

A Possible 3-hop Implementation L has a cache read miss on a load instruction H is the home node R is the current owner of the block, who has the most up-to-date data for that block, which is in the Modified state State: M Owner: R 1: Read Req 2: Fwd’d Read Req L H R 3: Fwd’d Read Ack 3: Read Reply

Example Cache States For each block, a home directory maintains its state: Shared: one or more nodes have the block cached, value in memory is up-to-date Modified: exactly one node has a dirty copy of the cache block, value in memory is out-of-date May add transient states to indicate the block is waiting for previous coherence operations to complete Caches in the nodes also need to track the state (e.g. MSI) of the cached blocks Nodes send coherence messages to home directory Home directory only sends messages to nodes that care

Operations in Directory (MSI) The University of Adelaide, School of Computer Science 17 November 2018 Operations in Directory (MSI) For uncached block: Read miss: requesting node is sent the requested data and is made the only sharing node, block is now shared in directory and in requesting node Write miss: the requesting node is sent the requested data and becomes the owner node, block is now modified in directory and in requesting node Chapter 2 — Instructions: Language of the Computer

Operations in Directory (MSI) The University of Adelaide, School of Computer Science 17 November 2018 Operations in Directory (MSI) For shared block: Read miss: the requesting node is sent the requested data from memory, node is added to sharing set and block is shared in directory and in requesting node Write miss: the requesting node is sent the block, all nodes in the sharing set are sent invalidate messages, sharing set now only contains requesting node, block is now modified in directory and in requesting node Chapter 2 — Instructions: Language of the Computer

Operations in Directory The University of Adelaide, School of Computer Science 17 November 2018 Operations in Directory For modified block: Read miss: the owner is sent a data request message, block becomes shared, owner replies the block to the directory, block written back to memory, sharers set now contains old owner and requestor, block is shared in directory and in requesting node Write miss: the owner is sent an invalidation message, requestor becomes new owner, block remains modified in directory and in requesting node Data write back: the owner replaces and writes back the block, block becomes uncached, sharer set is empty Chapter 2 — Instructions: Language of the Computer

Read Miss to Uncached or Shared Block CPU Cache 1 Load request at head of CPU->Cache queue. 9 Update cache tag and data and return load data to CPU. 2 Load misses in cache. 8 ReadReply arrives at cache. 3 Send ReadReq message to directory. Delay is a fact of life Serialization by directory, no longer the bus SC not guarantied Interconnection Network Directory Controller DRAM Bank 7 Send ReadReply message with contents of cache block. 4 Message received at directory controller. 6 Update directory by setting bit for new processor sharer. 5 Access state and directory for block. Block’s state is S, with 0 or more sharers

Write Miss to Read Shared Block CPU Cache CPU Cache Multiple sharers CPU Cache CPU Cache 12 Update cache tag and data, then store data from CPU 1 Store request at head of CPU->Cache queue. 8 Invalidate cache block. Send InvRep to directory. 2 Store misses in cache. 11 ExRep arrives at cache 7 InvReq arrives at cache. 3 Send ReadReqX message to directory. Delay is a fact of life Serialization by directory, no longer the bus SC not guarantied Interconnection Network 4 ReadReqX message received at directory controller. Directory Controller DRAM Bank 10 When no more sharers, send ExRep to cache. 9 InvRep received. Clear down sharer bit. 6 Send one InvReq message to each sharer. 5 Access state and directory for block. Block’s state is S, with some set of sharers

Directory: Data Structures Key operation to support is set inclusion test False positives are OK: want to know which caches may contain a copy of a block, and spurious invalidations are ignored False positive rate determines performance Most accurate (and expensive): full bit-vector Compressed representation, linked list, Bloom filters are all possible Directory Modified bit Presence bits, one for each node

Issues with Contention Resolution May have concurrent transactions to cache blocks Can have multiple requests in flight to same cache block! Need to escape race conditions by: NACKing requests to busy (pending invalidate) entries Original requestor retries OR, queuing requests and granting in sequence (Or some combination thereof) Fairness Which requestor should be preferred in a conflict? Interconnect network delivery order, and distance, both matter

An Example Race: Writeback and Read L has dirty copy, wants to write back to H R concurrently sends a read to H Races require complex intermediate states Race! WB and Fwd Rd No need to ack Race! Final State: S No need to Ack State: M Owner: L 1: WB Req 2: Read Req L 6: H R 4: 3: Fwd’d Read Req 5: Read Reply

Hybrid Snoopy and Directory Stanford DASH (4 CPUs per cluster, total 16 clusters) Invalidation-based cache coherence Keep one of 3 states of a cache block at its home directory Uncached, shared (unmodified state), dirty P P P P $ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Directory Directory Interconnection Network

Snoopy vs. Directory Coherence + Miss latency (critical path) is short + Global serialization is easy: bus arbitration + Simple: adapt bus-based uniprocessors easily - Relies on broadcast seen by all caches (in same order):  single point of serialization (bus): not scalable Directory - Adds miss latency: request  dir.  mem. - Requires extra storage space to track sharer sets - Protocols and race conditions are more complex + Does not require broadcast to all caches + Exactly as scalable as interconnect and directory storage Make reference to SGI Origin paper