CMSC 611: Advanced Computer Architecture

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
CMSC 611: Advanced Computer Architecture Shared Memory Directory Protocol Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
Chapter 5 Multiprocessors and Thread-Level Parallelism
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Advanced Computer Architectures
Lecture 2: Snooping-Based Coherence
Chapter 6 Multiprocessors and Thread-Level Parallelism
CPE 631 Session 21: Multiprocessors (Part 2)
CPE 631 Lecture 18: Multiprocessors
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
Multiprocessors CS258 S99.
CS 213: Parallel Processing Architectures
CPE 631 Session 20: Multiprocessors
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
High Performance Computing
Professor David A. Patterson Computer Science 252 Spring 1998
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 19: Coherence Protocols
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CMSC 611: Advanced Computer Architecture Distributed Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis

Centralized Shared Memory Processor Processor Processor Processor One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Zero or more levels of cache I/O Memory Processors share a single centralized memory Feasible for small processor count to limit memory contention Model for multi-core CPUs

Distributed Memory Proc Proc Proc Proc Cache Cache Cache Cache Mem Mem Mem Mem Mem Mem Mem Mem Cache Cache Cache Cache Proc Proc Proc Proc Uses physically distributed memory to support large processor counts (to avoid memory contention) Advantages Allows cost-effective way to scale the memory bandwidth Reduces memory latency Disadvantage Increased complexity of communicating data

NUMA vs. UMA PE = Processing Element UMA = Uniform Memory Access Shared memory Same cost for any PE to access NUMA = Non-Uniform Memory Access All memory is associated with some PE Faster for that PE, slower for other PEs

Shared Address Model Physical locations Shared data Each PE can name every physical location in the machine Shared data Each process can name all data it shares with other processes

Shared Address Model Data transfer Use load and store, VM maps to local or remote location Extra memory level: cache remote data Significant research on making the translation transparent and scalable for many nodes Handling data consistency and protection challenging Latency depends on the underlying hardware architecture (bus bandwidth, memory access time and support for address translation) Scalability is limited given that the communication model is so tightly coupled with process address space

Three Fundamental Issues (#1: Naming) What data is shared? How it is addressed? What operations can access data? How processes refer to each other? Choice of naming affects code produced by a compiler Just remember and load address or keep track of processor number and local virtual address for message passing Choice of naming affects replication of data In cache memory hierarchy or via SW replication and consistency

Naming Address Spaces Global physical address space any processor can generate, address and access it in a single operation Global virtual address space if the address space of each process can be configured to contain all shared data of the parallel program memory can be anywhere: virtual address translation handles it Segmented shared address space locations are named <process number, address> uniformly for all processes of the parallel program

Three Fundamental Issues (#2: Synchronization) To cooperate, processes must coordinate Message passing is implicit coordination with transmission or arrival of data Shared address → additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

Three Fundamental Issues (#3: Latency & Bandwidth) Need high bandwidth in communication Match limits in network, memory, and processor Overhead to communicate is a problem in many machines Latency Affects performance, since processor may have to wait Affects ease of programming, since requires more thought to overlap communication and computation Latency Hiding How can a mechanism help hide latency? Examples: overlap message send with computation, pre-fetch data, switch to other tasks

Snooping Cache Coherency Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium)

Directory Cache Coherency Keep track of what is being shared in one centralized place Distributed memory ⇒ distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed before Snooping-based schemes

Distributed Directory Multiprocessors Directory per cache that tracks state of every block in every cache Which caches have a block, dirty vs. clean, ... Info per memory block vs. per cache block? simpler protocol (centralized/one location) directory is O(memory size) vs. O(cache size) To prevent directory from being a bottleneck Distribute directory entries with memory Each tracks of which processor has their blocks

Directory Protocol Similar to Snoopy Protocol: Three states Shared: Multiple processors have the block cached and the contents of the block in memory (as well as all caches) is up-to-date Uncached No processor has a copy of the block (not valid in any cache) Exclusive: Only one processor (owner) has the block cached and the contents of the block in memory is out-to-date (the block is dirty) In addition to cache state, must track which processors have data when in the shared state Usually bit vector, 1 if processor has copy

Directory Protocol Keep it simple(r): Writes to non-exclusive data = write miss Processor blocks until access completes Assume messages received and acted upon in order sent Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared No bus and do not want to broadcast: Interconnect no longer single arbitration point All messages have explicit responses

Example Directory Protocol Message sent to directory causes two actions: Update the directory More messages to satisfy request We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of buffers in network

Directory Protocol Messages Type SRC DEST MSG Read miss local cache home directory P,A P has read miss at A; request data and make P a read sharer Write miss P has write miss at A; request data and make P exclusive owner Invalidate remote cache A Invalidate shared data at A Fetch Fetch block A home; change A remote state to shared Fetch/invalidate Fetch block A home; invalidate remote copy Data value reply D Return data value from home memory Data write back A,D Write back data value for A

Cache Controller State Machine States identical to snoopy case Transactions very similar. Miss messages to home directory Explicit invalidate & data fetch requests State machine for CPU requests for each memory block

Directory Controller State Machine Same states and structure as the transition diagram for an individual cache Actions: update of directory state send messages to satisfy requests Tracks all copies of each memory block Sharers set implementation can use a bit vector of a size of # processors for each block State machine for Directory requests for each memory block

Example Assumes memory blocks A1 and A2 map to same cache block P2: Write 20 to A1 Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 Excl. A1 10 Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 Excl. A1 10 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Write Back Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block P2: Write 20 to A1 Excl. A1 10 DaRp P1 A1 Excl. A1 10 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 Excl. A1 10 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 WrMs P2 A2 A2 Excl. {P2} WrBk P2 A1 20 A1 Unca. {} 20 Excl. A2 40 DaRp P2 A2 A2 Excl. {P2} Assumes memory blocks A1 and A2 map to same cache block