Chip-Multiprocessor.

Chip-Multiprocessor

Multiprocessing Flynn’s Taxonomy of Parallel Machines
How many Instruction streams? How many Data streams? SISD: Single I Stream, Single D Stream A uniprocessor SIMD: Single I, Multiple D Streams Each “processor” works on its own data But all execute the same instrs in lockstep E.g. a vector processor or MMX

Flynn’s Taxonomy MISD: Multiple I, Single D Stream Not used much
Stream processors are closest to MISD MIMD: Multiple I, Multiple D Streams Each processor executes its own instructions and operates on its own data This is your typical off-the-shelf multiprocessor (made using a bunch of “normal” processors) Includes multi-core processors

Multiprocessors Why do we need multiprocessors?
Uniprocessor speed keeps improving But there are things that need even more speed Wait for a few years for Moore’s law to catch up? Or use multiple processors and do it now? Multiprocessor software problem Most code is sequential (for uniprocessors) MUCH easier to write and debug Correct parallel code very, very difficult to write Efficient and correct is even harder Debugging even more difficult (Heisenbugs) ILP limits reached?

Centralized Shared Memory
MIMD Multiprocessors Centralized Shared Memory Distributed Memory

Centralized-Memory Machines
Also “Symmetric Multiprocessors” (SMP) “Uniform Memory Access” (UMA) All memory locations have similar latencies Data sharing through memory reads/writes P1 can write data to a physical address A, P2 can then read physical address A to get that data Problem: Memory Contention All processor share the one memory Memory bandwidth becomes bottleneck Used only for smaller machines Most often 2,4, or 8 processors

Distributed-Memory Machines
Two kinds Distributed Shared-Memory (DSM) All processors can address all memory locations Data sharing like in SMP Also called NUMA (non-uniform memory access) Latencies of different memory locations can differ (local access faster than remote access) Message-Passing A processor can directly address only local memory To communicate with other processors, must explicitly send/receive messages Also called multicomputers or clusters Most accesses local, so less memory contention (can scale to well over 1000 processors)

Memory Hierarchy in a Multiprocessor
Shared cache P $ Bus-based shared memory Memory P P P Cache Memory P $ Memory Fully-connected shared memory Interconnection Network P $ Memory Interconnection Network Distributed shared memory

What’s the problem of shared cache?

Cache Coherency Closest cache level is private
Multiple copies of cache line can be present across different processor nodes Local updates Lead to incoherent state Problem exhibits in both write-through and writeback caches Bus-based  globally visible Point-to-point interconnect  visible only to communicated processor nodes

Example (Writeback Cache)
Rd? Rd? Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory X= -100

Example (Write-through Cache)
Rd? Cache Cache Cache X= -100 X= 505 X= -100 X= 505 X= 505 Memory X= -100

Defining Coherence An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence Write propagation Writes are visible to other processes Write serialization All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi

Sounds Easy? P0 P1 P2 P3 A=0 B=0 A=1 B=2 A=1 B=2 A=1 B=2 A=1 B=2 T1 T2
See A’s update before B’s See B’s update before A’s

Bus Snooping based on Write-Through Cache
All the writes will be shown as a transaction on the shared bus to memory Two protocols Update-based Protocol Invalidation-based Protocol

Bus Snooping (Update-based Protocol on Write-Through cache)
X= 505 X= -100 X= 505 Memory Bus transaction X= -100 X= 505 Bus snoop Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit

Bus Snooping (Invalidation-based Protocol on Write-Through cache)
Load X Cache Cache Cache X= -100 X= 505 X= 505 Memory Bus transaction X= -100 X= 505 Bus snoop Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit

Processor-initiated Transaction Bus-snooper-initiated Transaction
A Simple invalidate-basedSnoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Processor-initiated Transaction Bus-snooper-initiated Transaction

How about Writeback Cache?
WB cache to reduce bandwidth requirement The majority of local writes are hidden behind the processor nodes How to snoop? Write Ordering

Cache Coherence Protocols for WB caches
A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it Modified (dirty) cache line The cache having the line is the owner of the line, because it must supply the block

Cache Coherence Protocol (Update-based Protocol on Writeback cache)
Store X Cache Cache Cache X= 505 X= -100 X= 505 X= -100 X= 505 X= -100 update Memory Bus transaction Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred

Cache Coherence Protocol (Update-based Protocol on Writeback cache)
Store X Load X Cache Cache Cache X= 505 X= 333 X= 333 X= 505 X= 333 X= 505 Hit ! update update Memory Bus transaction Update data for all processor nodes who share the same data For a processor node keeps updating the memory location, a lot of traffic will be incurred

Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache)
Store X Cache Cache Cache X= -100 X= -100 X= 505 X= -100 invalidate Memory Bus transaction Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

Store X P P P Store X Store X Cache Cache Cache X= 505 X= 987 X= 333 X= 444 X= 505 Memory Bus transaction Bus snoop Invalidate the data copies for the sharing processor nodes Reduced traffic when a processor node keeps updating the same memory location

MSI Writeback Invalidation Protocol
Modified Dirty Only this cache has a valid copy Shared Memory is consistent One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated.

Two types of request from the processor PrRd PrWr Three types of bus transactions post by cache controller BusRd PrRd misses the cache Memory or another cache supplies the line BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state BusWB Writeback due to replacement Processor does not directly involve in initiating this operation

MSI Writeback Invalidation Protocol (Processor Request)
PrWr / BusRdX PrWr / --- PrRd / --- Modified Shared PrRd / --- PrWr / BusRdX PrRd / BusRd Invalid Processor-initiated

MSI Writeback Invalidation Protocol (Bus Transaction)
BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- Flush data on the bus Both memory and requestor will grab the copy The requestor get data by Cache-to-cache transfer; or Memory Invalid Bus-snooper-initiated

Bus-snooper-initiated
MSI Writeback Invalidation Protocol (Bus transaction) Another possible implementation Another possible, valid implementation Anticipate no more reads from this processor A performance concern Save “invalidation” trip if the requesting cache writes the shared line later BusRd / Flush BusRd / --- Modified Shared BusRdX / Flush BusRdX / --- BusRd / Flush Invalid Bus-snooper-initiated

PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated

MSI Example P1 P2 P3 Bus BusRd MEMORY Processor Action State in P1
Cache Cache Cache X=10 S Bus BusRd MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory

MSI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1
Cache Cache Cache X=10 S X=10 S BusRd Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X

Does not come from memory if having “BusUpgrade”
MSI Example P1 P2 P3 Cache Cache Cache --- I X=10 S X=10 X=-25 S M BusRdX Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X I --- M BusRdX Memory P3 writes X Does not come from memory if having “BusUpgrade”

Cache Cache Cache --- X=-25 S I X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X Memory P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache

Cache Cache Cache X=-25 S X=-25 S X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- BusRd Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX P3 Cache P1 reads X S --- BusRd P3 Cache P2 reads X S BusRd Memory

What’s not good about MSI?

MESI Writeback Invalidation Protocol
To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS

MESI Example P1 P2 P3 Bus BusRd(noS) MEMORY Processor Action
Cache Cache Cache X=10 E Bus BusRd(noS) MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory

Does not come from memory if having “BusUpgrade”
MESI Example P1 P2 P3 Cache Cache Cache X=10 S X=10 S BusRd Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X Does not come from memory if having “BusUpgrade”

MESI Example P1 P2 P3 BusRdX Bus MEMORY Processor Action State in P1
Cache Cache Cache --- I X=10 S X=10 X=-25 S M BusRdX Bus MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X I --- M BusRdX Memory P3 writes X

MESI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1
Cache Cache Cache --- X=-25 S I X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X Memory P3 writes X I --- M BusRdX P1 reads X S --- BusRd P3 Cache

MESI Example P1 P2 P3 BusRd Bus MEMORY Processor Action State in P1
Cache Cache Cache X=-25 S X=-25 S X=-25 S M BusRd Bus MEMORY X=-25 X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X E --- BusRd(noS) Memory S --- BusRd Memory P3 reads X P3 writes X I --- M BusRdX P3 Cache P1 reads X S --- BusRd P3 Cache P2 reads X S BusRd Memory

MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol)
PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified PrRd / BusRd (not-S) PrWr / BusRdX PrWr / BusRdX Invalid Shared PrRd / --- PrRd / BusRd (S) S: Shared Signal Processor-initiated

MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol)
Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) Most of the MESI implementations simply write to memory Exclusive Modified BusRd / Flush Or ---) BusRdX / Flush BusRdX / Flush BusRd / Flush Invalid Shared BusRd / Flush* BusRdX / Flush* Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers

MESI Writeback Invalidation Protocol (Illinois Protocol)
PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified BusRd / Flush (or ---) BusRdX / Flush PrWr / BusRdX BusRdX / Flush BusRd / Flush PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared BusRd / Flush* BusRdX / Flush* S: Shared Signal PrRd / --- Processor-initiated Bus-snooper-initiated PrRd / BusRd (S) Flush*: Flush for data supplier; no action for other sharers

System Request Interface
MOESI Protocol Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by Sun UltraSparc AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 L2 CPU1 System Request Interface Crossbar Hyper- Transport Mem Controller

Implication on Multi-Level Caches
How to guarantee coherence in a multi-level cache hierarchy Snoop all cache levels? Intel’s 8870 chipset has a “snoop filter” for quad-core Maintaining inclusion property Ensure data in the outer level must be present in the inner level Only snoop the outermost level (e.g. L2) L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2

Inclusion Property Not so easy …
Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 Split L1 caches: Imagine all caches are direct-mapped. Different cache line sizes

Inclusion Property Use specific cache configurations
E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size Explicitly propagate L2 action to L1 L2 replacement will flush the corresponding L1 line Observed BusRdX bus transaction will invalidate the corresponding L1 line To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)

Cache coherency implementation
$ $ $ $ Interconnection Network Directory Modified bit Presence bits, one for each node Memory Snooping-based protocol N transactions for an N-node MP All caches need to watch every memory request from each processor Not a scalable solution for maintaining coherence in large shared memory systems Directory protocol Directory-based control of who has what; HW overheads to keep the directory (~ # lines * # processors)

Directory-based Coherence Protocol
$ $ $ $ $ Interconnection Network Memory 1 C(k) 1 C(k+1) 1 C(k+j) 1 modified bit for each cache block in memory 1 presence bit for each processor, each cache block in memory

Directory-based Coherence Protocol (Limited Dir)
$ $ $ $ $ Interconnection Network Memory 1 1 1 1 modified bit for each cache block in memory Presence encoding is NULL or not Encoded Present bits (lg2N), each cache line can reside in 2 processors in this example

Distributed Directory Coherence Protocol
$ Memory Interconnection Network Directory Directory Directory Directory Directory Directory Centralized directory is less scalable (contention) Distributed shared memory (DSM) for a large MP system Interconnection network is no longer a shared bus Maintain cache coherence (CC-NUMA) Each address has a “home”

Distributed Directory Coherence Protocol
$ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Directory Directory Interconnection Network Stanford DASH (4 CPUs in each cluster, total 16 clusters) Invalidation-based cache coherence Directory keeps one of the 3 status of a cache block at its home node Uncached Shared (unmodified state) Dirty

Interconnection Network
DASH Memory Hierarchy P P P P $ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Directory Directory Interconnection Network Processor Level Local Cluster Level Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it Remote Cluster Level

Directory Coherence Protocol: Read Miss
Miss Z (read) P P $ Z $ $ Z Home of Z Go to Home Node Memory Memory Memory Z 1 1 1 Interconnection Network Data Z is shared (clean)

Directory Coherence Protocol: Read Miss
Miss Z (read) P P $ Z Z $ $ Data Request Go to Home Node Respond with Owner Info Memory Memory Memory Z 1 1 1 1 Interconnection Network Data Z is Clean, Shared by 3 nodes Data Z is Dirty

Directory Coherence Protocol: Write Miss
Miss Z (write) P P $ Z $ $ Z Z Go to Home Node Invalidate ACK ACK Respond w/ sharers Invalidate Memory Memory Memory Z 1 1 1 1 Interconnection Network Write Z can proceed in P0

Questions?

Chip-Multiprocessor.

Similar presentations

Presentation on theme: "Chip-Multiprocessor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chip-Multiprocessor.

Similar presentations

Presentation on theme: "Chip-Multiprocessor."— Presentation transcript:

Similar presentations

About project

Feedback