CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Lecture 7. Multiprocessor and Memory Coherence

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

Cache Optimization Summary

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Snooping Cache and Shared-Memory Multiprocessors

April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.

Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

The University of Adelaide, School of Computer Science

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Outline Introduction (Sec. 5.1)

COSC6385 Advanced Computer Architecture

תרגול מס' 5: MESI Protocol

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Cache Coherence for Shared Memory Multiprocessors

Chapter 5 Multiprocessors and Thread-Level Parallelism

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Chip-Multiprocessor.

11 – Snooping Cache and Directory Based Multiprocessors

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU

Two Classes of Protocols Sharing state : which caches have a copy for a given address? Snoop-based protocols – No centralized repository for sharing states – All requests must be broadcast to all nodes : don’t know who may have a copy… – Common in small-/medium sized shared memory MPs – Has been hard to scale due to the difficulty of efficient broadcasting – Most commercial MPs up to ~64 processors Directory-based protocols – Logically centralized repository of sharing states : directory – Need a directory entry for every memory blocks – Invalidation requests go to the directory first, and forwarded only to the sharers – A lot of research efforts, but only a few commercial MPs

Snoop-based Cache Coherence No explicit sharing state information  all caches must participate in snooping 1.Any cache miss request must be put on the bus 2.All caches and memory observe bus requests 3.All caches snoop a request and check it cache tags 4.Caches put responses – Just sharing state (I have a copy !) – Data transfer (I have a modified copy, and am sending it to you!) Memory $ $ $ $ P1P2

Architecture for Snoopy Protocols Extended cache states in tags – Cache tags must keep the coherence state (extend Valid and Dirty bits in single processor cache states) Broadcast medium (e.g. bus) – Need to send all requests (including invalidation) to other caches – Logically a set of wires connect all nodes and memory Serialization by bus – Only one processor is allowed to send invalidation – Provide total ordering of memory requests Snooping bus transactions – Every cache must observe all the transactions the bus – For every transaction, caches need to lookup tags to check any actions is necessary – If necessary, snoop may cause state transition and new bus transaction

Cache State Transition Cache controller – Determines the next state – State transition may initiate actions, sending bus transactions Two sources of state transition – CPU: load or store instructions – Snoop: request from other processors Snoop tag lookup – Need to snoop all requests on the bus – Consume a lot of cache tag bandwidth – May add duplicate tags only for snoop – Two identical tags, one for CPU requests and the other for snoop – Duplicate tags must be synchronized

MSI Protocol Simple three state protocols M (Modified) – Valid and dirty – Only one M state copy can exist for each block address in the entire system – Can update without invalidating other caches – Must be written back to memory when evicted S (Shared) – Valid and clean – Other caches may have copies – Cannot update I (Invalid) – Invalid State transition diagrams in the next four slides, D. Pattern, EECS, Berkeley

State Transition CPU requests – Processor Read (PrRd): load instruction – Processor Write (PrWr): store instruction – Generate bus requests Bus requests (snoop) – Bus Read (BusRd) – Bus RFO (BusRFO): Read For Ownership – Bus Upgrade (BusUp) – Bus Writeback (BusWB) – May need to send data to the requestor Notation: A / B – A : event which causes state transition – B : action generated by state transition

MSI State Transition - CPU State transition by CPU requests PrRd / --- Invalid Shared (read/only) Modified (read/write) PrRd / BusRd PrWr / BusRFO PrWr / BusUp PrRd / --- PrWr / ---

MSI State Transition - Snoop State transition by bus requests Invalid Shared (read/only) Modified (read/write) BusRFO / BusWB BusUp / BusWB BusRd / BusWB BusRd / --- BusRFO / --- BusUp / ---

Example StepP1P2P3BusMem StateValueStateValueStateValueActionProcValue III10 P1 read AS10IIBusRdP110 P2 read AS10S IBusRdP210 P2 write A (20)IM20IBusUpP210 P3 read AIS20S BusRdP320 P1 write A (30)M30IIBusRFOP120

Supporting Cache Coherence Coherence – Deal with how one memory location is seen by multiple processors – Ordering among multiple memory locations  Consistency – Must support write propagation and write serialization Write Propagation – Write become visible to other processors Write Serialization – All writes to a location must be seen in the same order by all processes For two writes w1 and w2 for a location A If a processor sees w1 before w2,  all processor must see w1 before w2

Review Snoop-based Coherence No explicit sharing state – Requestor cannot know which nodes have copies – Broadcast request to all nodes – Every node must snoop all bus transactions Traditional implementation uses bus – Allow one transaction at a time  will be relaxed later – Serialize all memory requests (total ordering)  will be relaxed later Write serialization – Conflicting stores are serialized by bus

Review From MSI Protocols Load  store sequence is common Load R1, 0 (R10)  bring in read only copy Add R1, R1, R2 Store R1, 0 (R1)  need to upgrade for modification High chance that no other caches have a copy – Private data are common (especially in well-parallelized programs) – Even shared data may not be in others’ caches (due to limited cache capacity) MSI protocols – Always installs a new line in S state – Subsequent store will cause write miss to upgrade the state to M

MESI Protocols Add E (Exclusive) state to MSI E (Exclusive) – Valid and clean – No other caches have a copy of the block Must check sharing state when install a block – For BusRd transaction, all nodes will place a response: either snoop hit (“I have a copy”) or snoop miss (“I don’t have a copy”) – If no other cache has a copy, new block is installed in E state – If any cache has a copy, new block is installed in S state E  M transition is free (no bus transaction) – Exclusivity is guaranteed in E state – For stores, upgrade E to M state without sending invalidations

MESI State Transition - CPU PrRd / --- Invalid Shared (read/only) Modified (read/write) PrRd / BusRd (snoop hit) PrWr / BusRFO Exclusive (read/only) PrWr / BusUp PrWr / --- PrRd / BusRd (snoop miss) PrRd / --- PrWr / --- PrRd / ---

MESI State Transition - Snoop Invalid Shared (read/only) Exclusive (read/only) BusRFO / BusWB BusUp / BusWB BusRd / --- BusRFO / --- BusUp / --- BusRd / --- Modified (read/write) BusRd / BusWB BusRFO / --- BusUp / ---

Example StepP1P2P3BusMem StateValueStateValueStateValueActionProcValue III10 P1 read AE10IIBusRdP110 P1 write A (15)M15IINone10 P2 read AS15S IBusRdP215 P2 write A (20)IM20IBusUpP215 P3 read AIS20S BusRdP320 P1 write A (30)M30IIBusRFOP1

Coherence Miss 3 traditional classes of misses – cold, capacity, and conflict misses New type of misses only in invalidation-based MPs – Cache miss caused by invalidation – P1 read address A (S state) – P2 write to address A (I state in P1, M state in P2) – P1 read address A  a cache miss caused by invalidation Why coherence miss occurs? true and false sharing True sharing – Producer generate a new value (invalid a copy in consumer’s cache) – Consumer read the new value False sharing – Blocks can be invalidated even if the updated part is not used

True Sharing InvalidYModified T3T3 X SharedX T1T1 Write Y X Invalidation SharedYModified T4T4 Y InvalidYModified T2T2 X ReaderWriter Write Y DataState Read

False Sharing Reader Writer SharedX InvalidAYModified XInvalidAModified T1T1 T2T2 T3T3 AXA Y AX Invalidation Write Y DataState Write Y A Read ASharedYModified T4T4 Y

Basic Operation of Directory k processors. With each cache-block in memory: k presence-bits, 1 dirty-bit With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON;... }...

Example Directory Protocol (1 st Read) MSI P1$ ESI P2$ MSU MDir ctrl ld vA -> rd pA Read pA R/replyR/req P1: pA SS

Example Directory Protocol (Read Share) MSI P1$ MSI P2$ MSU MDir ctrl ld vA -> rd pA R/replyR/req P1: pA ld vA -> rd pA P2: pA R/req R/_ SSS

Example Directory Protocol (Wr to shared) MSI P1$ MSI P2$ MSU MDir ctrl st vA -> wr pA R/replyR/req P1: pA P2: pA R/reqW/req E R/_ Invalidate pA Read for ownership pA Inv ACK RX/invalidate&replySSS MM reply xD(pA) W/req E W/_ Inv/_ EX

Example Directory Protocol (Wr to M) MSI P1$ MSI P2$ DSU MDir ctrl R/replyR/req P1: pA st vA -> wr pA R/reqW/req E R/_ Reply xD(pA) Write_back pA Read for ownership pA RX/invalidate&reply M M Inv pA W/req E W/_ Inv/_ W/req E W/_ I M W/req E RU/_

Multi-level Caches Cache coherence : must use physical address  caches must be physically tagged Two-level caches without inclusion property – Both L1 and L2 must snoop Two-level caches with complete inclusion property – Snoop only L2 caches first – If snoop hits L2, forward snoop request to L1 L1 may have modified copy – Data must be flushed down to L2 and sent to other caches

Snoopy-bus with Switched Networks Physical bus (shared wires) does not scale well Tree-based address networks (fat tree) Ring-based address networks Arbitration (serialization) point How to serialize ?

AMD HyperTransport Snoop-based cache coherence Integrated on-chip coherence and interconnection controllers (glue logics for chip connection) Use point-to-point packet-based switched networks

AMD HyperTransport How to broadcast requests? – Requests are sent to home node – Home node broadcast requests to all nodes Home node – Node where the physical address are mapped to DRAM – Statically determined by physical address – Home node serialize accesses to the same address Snoopy-based, but used point-to-point networks with home node as a serialization point – Resemble directory-based protocols Support various interconnection topologies

Read Transaction

Performance Scalability

Intel QPI Limitation of AMD HyperTansport – All snoop requests are broadcast through Home node to avoid conflicts – Home node serializes conflicting requests What happen if snoop requests are sent to caches directly? – What if two caches attempt to send ReadInvalidation to the same address? Intel QPI – Allow direct snoop requests from a requester to all nodes – However, an extra ordered request is sent to Home node too. – Home node checks any possible conflicts and resolve the conflicts only when a conflict occurs

Coherence within a Shared Cache Multiple cores sharing an LLC (L3 cache usually) How to make multiple L1s and L2s coherenct?