Chip-Multiprocessor.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
Lecture 7. Multiprocessor and Memory Coherence
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
CS 704 Advanced Computer Architecture
Outline Introduction (Sec. 5.1)
Lecture 8: Snooping and Directory Protocols
COSC6385 Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
Cache Coherence in Shared Memory Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
CS 704 Advanced Computer Architecture
Multiprocessor Cache Coherency
Multi-Processing in High Performance Computer Architecture:
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Cache Coherence in Bus-Based Shared Memory Multiprocessors
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
Multiprocessors CS258 S99.
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
CSL718 : Multiprocessors 13th April, 2006 Introduction
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Chip-Multiprocessor

Multiprocessing Flynn’s Taxonomy of Parallel Machines How many Instruction streams? How many Data streams? SISD: Single I Stream, Single D Stream A uniprocessor SIMD: Single I, Multiple D Streams Each “processor” works on its own data But all execute the same instrs in lockstep E.g. a vector processor or MMX

Flynn’s Taxonomy MISD: Multiple I, Single D Stream Not used much Stream processors are closest to MISD MIMD: Multiple I, Multiple D Streams Each processor executes its own instructions and operates on its own data This is your typical off-the-shelf multiprocessor (made using a bunch of “normal” processors) Includes multi-core processors

Multiprocessors Why do we need multiprocessors? Uniprocessor speed keeps improving But there are things that need even more speed Wait for a few years for Moore’s law to catch up? Or use multiple processors and do it now? Multiprocessor software problem Most code is sequential (for uniprocessors) MUCH easier to write and debug Correct parallel code very, very difficult to write Efficient and correct is even harder Debugging even more difficult (Heisenbugs) ILP limits reached?

Centralized Shared Memory MIMD Multiprocessors Centralized Shared Memory Distributed Memory

Centralized-Memory Machines Also “Symmetric Multiprocessors” (SMP) “Uniform Memory Access” (UMA) All memory locations have similar latencies Data sharing through memory reads/writes P1 can write data to a physical address A, P2 can then read physical address A to get that data Problem: Memory Contention All processor share the one memory Memory bandwidth becomes bottleneck Used only for smaller machines Most often 2,4, or 8 processors

Distributed-Memory Machines Two kinds Distributed Shared-Memory (DSM) All processors can address all memory locations Data sharing like in SMP Also called NUMA (non-uniform memory access) Latencies of different memory locations can differ (local access faster than remote access) Message-Passing A processor can directly address only local memory To communicate with other processors, must explicitly send/receive messages Also called multicomputers or clusters Most accesses local, so less memory contention (can scale to well over 1000 processors)

Memory Hierarchy in a Multiprocessor Shared cache P $ Bus-based shared memory Memory P P P Cache Memory P $ Memory Fully-connected shared memory Interconnection Network P $ Memory Interconnection Network Distributed shared memory

What’s the problem of shared cache?

Cache Coherency Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates Lead to incoherent state Problem exhibits in both write-through and writeback caches Bus-based  globally visible Point-to-point interconnect  visible only to communicated processor nodes

Example (Writeback Cache) Rd? Rd? Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory X= -100

Example (Write-through Cache) Rd? Cache Cache Cache X= -100 X= 505 X= -100 X= 505 X= 505 Memory X= -100

Defining Coherence An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence Write propagation Writes are visible to other processes Write serialization All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi

Sounds Easy? P0 P1 P2 P3 A=0 B=0 A=1 B=2 A=1 B=2 A=1 B=2 A=1 B=2 T1 T2 See A’s update before B’s See B’s update before A’s

Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols Update-based Protocol Invalidation-based Protocol

Bus Snooping (Update-based Protocol on Write-Through cache) X= 505 X= -100 X= 505 Memory Bus transaction X= -100 X= 505 Bus snoop Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit

Bus Snooping (Invalidation-based Protocol on Write-Through cache) Load X Cache Cache Cache X= -100 X= 505 X= 505 Memory Bus transaction X= -100 X= 505 Bus snoop Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit

Processor-initiated Transaction Bus-snooper-initiated Transaction A Simple invalidate-basedSnoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction

How about Writeback Cache? WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering

Cache Coherence Protocols for WB caches A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it Modified (dirty) cache line The cache having the line is the owner of the line, because it must supply the block

Cache Coherence Protocol (Update-based Protocol on Writeback cache) Store X Cache Cache Cache X= 505 X= -100 X= 505 X= -100 X= 505 X= -100 update Memory Bus transaction Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred

Cache Coherence Protocol (Update-based Protocol on Writeback cache) Store X Load X Cache Cache Cache X= 505 X= 333 X= 333 X= 505 X= 333 X= 505 Hit ! update update Memory Bus transaction Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X Cache Cache Cache X= -100 X= -100 X= 505 X= -100 invalidate Memory Bus transaction Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 505 X= 987 X= 333 X= 444 X= 505 Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

MSI Writeback Invalidation Protocol Modified Dirty Only this cache has a valid copy Shared Memory is consistent One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated.

MSI Writeback Invalidation Protocol Two types of request from the processor PrRd PrWr Three types of bus transactions post by cache controller BusRd PrRd misses the cache Memory or another cache supplies the line BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state BusWB Writeback due to replacement Processor does not directly involve in initiating this operation

MSI Writeback Invalidation Protocol (Processor Request) PrWr / BusRdX PrWr / --- PrRd / --- Modified Shared PrRd / --- PrWr / BusRdX PrRd / BusRd Invalid Processor-initiated

MSI Writeback Invalidation Protocol (Bus Transaction) BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- Flush data on the bus Both memory and requestor will grab the copy The requestor get data by Cache-to-cache transfer; or Memory Invalid Bus-snooper-initiated

Bus-snooper-initiated MSI Writeback Invalidation Protocol (Bus transaction) Another possible implementation Another possible, valid implementation Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- BusRd / Flush Invalid Bus-snooper-initiated

MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated

MSI Example P1 P2 P3 Bus BusRd MEMORY Processor Action State in P1 Cache Cache Cache X=10 S Bus BusRd MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=10 S X=10 S BusRd Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X

Does not come from memory if having “BusUpgrade” MSI Example P1 P2 P3 Cache Cache Cache --- I X=10 S X=10 X=-25 S M BusRdX Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X I --- M BusRdX Memory P3 writes X Does not come from memory if having “BusUpgrade”

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache --- X=-25 S I X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X Memory P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=-25 S X=-25 S X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX P3 Cache P1 reads X S --- BusRd P3 Cache P2 reads X S BusRd Memory

What’s not good about MSI?

MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS

MESI Example P1 P2 P3 Bus BusRd(noS) MEMORY Processor Action Cache Cache Cache X=10 E Bus BusRd(noS) MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory

Does not come from memory if having “BusUpgrade” MESI Example P1 P2 P3 Cache Cache Cache X=10 S X=10 S BusRd Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X Does not come from memory if having “BusUpgrade”

MESI Example P1 P2 P3 BusRdX Bus MEMORY Processor Action State in P1 Cache Cache Cache --- I X=10 S X=10 X=-25 S M BusRdX Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X I --- M BusRdX Memory P3 writes X

MESI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache --- X=-25 S I X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X Memory P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache

MESI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1 Cache Cache Cache X=-25 S X=-25 S X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX P3 Cache P1 reads X S --- BusRd P3 Cache P2 reads X S BusRd Memory

MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified PrRd / BusRd (not-S) PrWr / BusRdX PrWr / BusRdX Invalid Shared PrRd / --- PrRd / BusRd (S) S: Shared Signal Processor-initiated

MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol) Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Most of the MESI implementations simply write to memory Exclusive Modified BusRd / Flush Or ---) BusRdX / Flush BusRdX / Flush BusRd / Flush Invalid Shared BusRd / Flush* BusRdX / Flush* Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers

MESI Writeback Invalidation Protocol (Illinois Protocol) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified BusRd / Flush (or ---) BusRdX / Flush PrWr / BusRdX BusRdX / Flush BusRd / Flush PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared BusRd / Flush* BusRdX / Flush* S: Shared Signal PrRd / --- Processor-initiated Bus-snooper-initiated PrRd / BusRd (S) Flush*: Flush for data supplier; no action for other sharers

System Request Interface MOESI Protocol Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by Sun UltraSparc AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 System Request Interface Crossbar Hyper- Transport Mem Controller

Implication on Multi-Level Caches How to guarantee coherence in a multi-level cache hierarchy Snoop all cache levels? Intel’s 8870 chipset has a “snoop filter” for quad-core Maintaining inclusion property Ensure data in the outer level must be present in the inner level Only snoop the outermost level (e.g. L2) L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2

Inclusion Property Not so easy … Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 Split L1 caches: Imagine all caches are direct-mapped. Different cache line sizes

Inclusion Property Use specific cache configurations E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size Explicitly propagate L2 action to L1 L2 replacement will flush the corresponding L1 line Observed BusRdX bus transaction will invalidate the corresponding L1 line To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)

Cache coherency implementation $ $ $ $ Interconnection Network Directory Modified bit Presence bits, one for each node Memory Snooping-based protocol N transactions for an N-node MP All caches need to watch every memory request from each processor Not a scalable solution for maintaining coherence in large shared memory systems Directory protocol Directory-based control of who has what; HW overheads to keep the directory (~ # lines * # processors)

Directory-based Coherence Protocol $ $ $ $ $ Interconnection Network Memory 1 C(k) 1 C(k+1) 1 C(k+j) 1 modified bit for each cache block in memory 1 presence bit for each processor, each cache block in memory

Directory-based Coherence Protocol (Limited Dir) $ $ $ $ $ Interconnection Network Memory 0 0 0 0 1 1 1 1 0 1 1 0 0 0 1 - - - - - - - - 1 modified bit for each cache block in memory Presence encoding is NULL or not Encoded Present bits (lg2N), each cache line can reside in 2 processors in this example

Distributed Directory Coherence Protocol $ Memory Interconnection Network Directory Directory Directory Directory Directory Directory Centralized directory is less scalable (contention) Distributed shared memory (DSM) for a large MP system Interconnection network is no longer a shared bus Maintain cache coherence (CC-NUMA) Each address has a “home”

Distributed Directory Coherence Protocol $ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Directory Directory Interconnection Network Stanford DASH (4 CPUs in each cluster, total 16 clusters) Invalidation-based cache coherence Directory keeps one of the 3 status of a cache block at its home node Uncached Shared (unmodified state) Dirty

Interconnection Network DASH Memory Hierarchy P P P P $ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Directory Directory Interconnection Network Processor Level Local Cluster Level Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it Remote Cluster Level

Directory Coherence Protocol: Read Miss Miss Z (read) P P $ Z $ $ Z Home of Z Go to Home Node Memory Memory Memory Z 1 1 1 Interconnection Network Data Z is shared (clean)

Directory Coherence Protocol: Read Miss Miss Z (read) P P $ Z Z $ $ Data Request Go to Home Node Respond with Owner Info Memory Memory Memory Z 1 1 1 1 Interconnection Network Data Z is Clean, Shared by 3 nodes Data Z is Dirty

Directory Coherence Protocol: Write Miss Miss Z (write) P P $ Z $ $ Z Z Go to Home Node Invalidate ACK ACK Respond w/ sharers Invalidate Memory Memory Memory Z 1 1 1 1 Interconnection Network Write Z can proceed in P0

Questions?