1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.

1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan

2 Outline Cache Coherence Problem Cache Coherence Problem Two cache schemes Two cache schemes RB RB RWB RWB Synchronization using these schemes Synchronization using these schemes Test-and-test-and-set instruction Test-and-test-and-set instruction Use of these schemes in a multi-shared bus configuration Use of these schemes in a multi-shared bus configuration Conclusion Conclusion

3 Cache Coherence Problem Process communication in shared-memory multiprocessors can be implemented by exchanging information through shared variables Process communication in shared-memory multiprocessors can be implemented by exchanging information through shared variables This sharing can result in several copies of a shared block in one or more caches at the same time. This sharing can result in several copies of a shared block in one or more caches at the same time. TimeEvent Cache contents for CPU A Cache contents for CPU B Memory contents for location X 01 1 CPU A reads X 11 2 CPU B reads X 111 3 CPU A stores 0 into X 010

4 Terminology Coherence Coherence Defines what values can be returned by a read Defines what values can be returned by a read Coherent if: Coherent if: If P writes to X then reads X, with no writes to X by other processors, it returns the value written by P If P writes to X then reads X, with no writes to X by other processors, it returns the value written by P If P writes to X and then a read to X from another processor, if the read and write are sufficiently separated and no other writes to X occur between the two accesses, returns the written value. If P writes to X and then a read to X from another processor, if the read and write are sufficiently separated and no other writes to X occur between the two accesses, returns the written value. Writes to the same location are serialized, i.e., two writes to the same location by any two processors are seen in the same order by all processors Writes to the same location are serialized, i.e., two writes to the same location by any two processors are seen in the same order by all processors Consistency Consistency determines when a written value will be returned by a read determines when a written value will be returned by a read

5 Techniques to Enforce Coherence Software Based Software Based often used in clusters of workstations or PCs often used in clusters of workstations or PCs Hardware Based Hardware Based Directory schemes Directory schemes Centralized Directory holds the status of sharing a block of physical memory Centralized Directory holds the status of sharing a block of physical memory Used in DSM machines Used in DSM machines Snooping schemes Snooping schemes No centralized directory No centralized directory Each cache “snoops” or listens to maintain coherency among caches Each cache “snoops” or listens to maintain coherency among caches Used in CSM machines using a bus Used in CSM machines using a bus

6 Architecture Assumption A logically single bus connecting the n PE’s and I/O with memory A logically single bus connecting the n PE’s and I/O with memory A bus arbitrator A bus arbitrator Each processor has a cache which can listen to the bus activity and detect the referenced address, the activity (read or write), and the data Each processor has a cache which can listen to the bus activity and detect the referenced address, the activity (read or write), and the data A cache has the ability to interrupt the current bus activity and to replace it with one of its own A cache has the ability to interrupt the current bus activity and to replace it with one of its own Both set size and block size are assumed to be one ‘word’ Both set size and block size are assumed to be one ‘word’ PE 1 Cache PE 2 Cache PE n Cache Shared Memory Bus …

7 The RB – Read Broadcast Cache Scheme Caches note bus reads, bus writes, and data returned in response to bus reads Caches note bus reads, bus writes, and data returned in response to bus reads Read/write data items are dynamically redefined in terms of the other two classes Read/write data items are dynamically redefined in terms of the other two classes Whenever there is a write, it is considered to be local to the writer Whenever there is a write, it is considered to be local to the writer Whenever there is a read, it is considered to be read only data item Whenever there is a read, it is considered to be read only data item Each address line is associated with a set of tag bits: Each address line is associated with a set of tag bits: R (Readable): Data is valid and consistent with main memory. R (Readable): Data is valid and consistent with main memory. I (Invalid): the data in the cache is assumed to be incorrect; any reference to it will cause a corresponding bus action I (Invalid): the data in the cache is assumed to be incorrect; any reference to it will cause a corresponding bus action L (Local): data can be read or written locally, causing no bus activity L (Local): data can be read or written locally, causing no bus activity

8 Definition of configuration A configuration is the collection of cache states for a particular address. A configuration is the collection of cache states for a particular address. A configuration can be seen as a n-vector, where S i denotes the state of variable X in cache i. A configuration can be seen as a n-vector, where S i denotes the state of variable X in cache i. Two possible types of configuration Two possible types of configuration Local configuration: Local configuration: Shared configuration: Shared configuration: These configurations are dynamically assigned by the scheme These configurations are dynamically assigned by the scheme

9 The RB scheme functioning X is in the shared configuration X is in the shared configuration Read: Simply fetches the cached value; no bus activity is generated. Read: Simply fetches the cached value; no bus activity is generated. Stay in the shared configuration Stay in the shared configuration Write: Write: 1. The value of X in cache i is updated and a bus write is broadcast. 2. The cache i changes to state L. 3. The bus write then updates the memory and, 4. at the same time, causes all other caches to change into state I Change from shared configuration to local configuration Change from shared configuration to local configuration TimeEvent Cache i Cache j Memory StateValueStateValue 0R0R00 1 PE i writes 1 to X L (2) 1 (1) I (4) 0 1 (3)

10 The RB scheme functioning X is in the local configuration with cache i in the local state: X is in the local configuration with cache i in the local state: Case 1: X is read by PE i Case 1: X is read by PE i Simply fetch the cached value Simply fetch the cached value No bus activity is generated. No bus activity is generated. Stay in local configuration Stay in local configuration Case 2: X is written by PE i Case 2: X is written by PE i The cached value in cache i is updated. The cached value in cache i is updated. No bus activity is generated. No bus activity is generated. Stay in local configuration Stay in local configuration

11 The RB scheme functioning X is in the local configuration X is in the local configuration Case 3: X is written by PE j Case 3: X is written by PE j 1. Update the cached value in cache j 2. Change cache j state to L 3. Generate a bus write to X with the new value 4. This bus write causes all other caches to change into state I – Change from local configuration w.r.t. i to local configuration w.r.t to j TimeEvent Cache i Cache j Memory StateValueStateValue 0L1I0x 1 PE j writes 2 to X I (4) 1 L (2) 2 (1) 2 (3)

12 The RB scheme functioning X is in the local configuration to cache i X is in the local configuration to cache i Case 4: X is read by PE j. Case 4: X is read by PE j. 1. A bus read is issued, which is “seen” by the other caches. 2. Cache i, which is in state L, interrupts the bus read and performs its own bus write updating memory to the correct value 3. The original bus read will be retried immediately to fetch the correct value 4. The bus read is noticed by all other caches which then read the value returned from the read, cache it, and change into state R – Change from local configuration to share configuration TimeEvent Cache i Cache j Memory StateValueStateValue 0L1I0x 1 PE j reads X R (2) 1 R (4) 1 (3) 1 (1)

13 Cache replacement policy Only those overwritten items that are tagged local need to be written back to the memory. Only those overwritten items that are tagged local need to be written back to the memory. A reference to an item not in the cache behaves exactly as if it were in the invalid state A reference to an item not in the cache behaves exactly as if it were in the invalid state A write results in a bus write and the state changes to local A write results in a bus write and the state changes to local A read results in a bus read and the state changes to readable A read results in a bus read and the state changes to readable

14 RB Scheme - State transactions R Write causes the writer to change to L, the others change to I Write causes the writer to change to L, the others change to I I Bus read: changes state to R in all processors Bus read: changes state to R in all processors Bus write: the writer changes state to L, others change to I Bus write: the writer changes state to L, others change to I L Bus write: changes state to I Bus write: changes state to I Bus read: is interrupted and replaced by a bus write of the correct value, the state changes to R in all processors. Bus read: is interrupted and replaced by a bus write of the correct value, the state changes to R in all processors.

15 RB Scheme - State Diagram BR R IL CR CW/1 BW CR/3 BWCW/1 BR CW BW CR BR/2 Legend CW – CPU write request CR – CPU read request BW – Bus write request BR – Bus Read request Modifiers 1.generates a BW (Write through) 2.interrupts bus read and supplies data from the cache 3.generates a BR (cache miss)

16 Proof of Consistency Consistent: a read by a processor will always fetch the “latest” value written. Consistent: a read by a processor will always fetch the “latest” value written. The latest value written is defined in terms of a corresponding virtual serial execution The latest value written is defined in terms of a corresponding virtual serial execution Theorem: Each PE always reads the latest value written. Theorem: Each PE always reads the latest value written.

17 RWB Cache Scheme The caches also note the data part of the bus writes. The caches also note the data part of the bus writes. A new tag F (first-write) is associated with an address line A new tag F (first-write) is associated with an address line A new bus signal, called bus invalidate (BI), is introduced A new bus signal, called bus invalidate (BI), is introduced

18 RWB – Read Write Broadcast Only when a variable is used exclusively by one PE, does it reenter the local configuration Only when a variable is used exclusively by one PE, does it reenter the local configuration The first write by PE(i) causes all to change state to R, and PE(i) to F. The first write by PE(i) causes all to change state to R, and PE(i) to F. Subsequent write by PE(i) causes PE(i) to change to L; the others change to I. Subsequent write by PE(i) causes PE(i) to change to L; the others change to I. A write by PE(j) causes all but j to change state to R, PE(j) change state to F. A write by PE(j) causes all but j to change state to R, PE(j) change state to F.

19 RWB Scheme - State Diagram I R L F BI CW/1 CW/4 CW CR BR/2 BW CR BW BR CW/1 BI CR/3 BR BW BR CR Legend CW – CPU write request CR – CPU read request BW – Bus write request BR – Bus Read request Modifiers 1.generates a BW (Write through) 2.interrupts bus read and supplies data from the cache 3.generates a BR (cache miss) 4.generates a BI

20 Synchronization Using Caches Synchronization mechanisms are built with user- level software routines that rely on hardware- supplied synchronization instructions Synchronization mechanisms are built with user- level software routines that rely on hardware- supplied synchronization instructions A hardware-supplied synchronization primitive is an uninterruptible instruction capable of atomically retrieving and changing a value A hardware-supplied synchronization primitive is an uninterruptible instruction capable of atomically retrieving and changing a value Test-and-set is such an instruction. Test-and-set is such an instruction.

21 Test-and-Set instruction Atomically set bit in word iff old value was zero, return old value Atomically set bit in word iff old value was zero, return old value Acquire: Acquire: test: tsl lockflag, R0/* tsl leaves old value in R0 */ bnz R0, test/* was it busy? try again */ bnz R0, test/* was it busy? try again */ If many PE’s simultaneously test-and-set the same memory location, high bus traffic and memory contention will result. If many PE’s simultaneously test-and-set the same memory location, high bus traffic and memory contention will result.

22 Test-and-Test-and-Set Instruction Test-and-test-and-set keeps the test (read) local to the cache. Test-and-test-and-set keeps the test (read) local to the cache. Test-and-Test-and-Set Test-and-Test-and-Set while (lock != 0) { if (Test-and-Set(lock) == 0) { Critical Section; }} Advantage: Spinning happens in the cache Advantage: Spinning happens in the cache Disadvantage: Can still generate a lot of traffic when many processors go to do test-and-set Disadvantage: Can still generate a lot of traffic when many processors go to do test-and-set

23 Synchronization using RB scheme Test-and-Set example PE(1)PE(2)…PE(m)SObservation R(0)R(0)…R(0)0 Initial State I(0)L(1)…I(0)1 PE(2) gets lock S R(1)R(1)…R(1)1 Some PE(i) attempts to get lock ………… Bus traffic I(1)L(0)…I(1)0 PE(2) releases S L(1)I(0)…I(1)1 PE(1) gets S

24 Synchronization using RB scheme Test-and-Test-and-Set example PE(1)PE(2)…PE(m)SObservation R(0)R(0)…R(0)0 Initial State I(0)L(1)…I(0)1 PE(2) gets lock S R(1)R(1)…R(1)1 Some PE(i) attempts to get lock ………… No bus traffic I(1)L(0)…I(1)0 PE(2) releases S R(0)R(0)…R(0)0 New value is broadcast L(1)I(0)…I(0)1 PE(1) gets S R(1)R(1)…R(1)1

25 Synchronization using RWB scheme Test-and-Test-and-Set example PE(1)PE(2)…PE(m)SObservation R(0)R(0)…R(0)0 Initial State R(1)F(1)…R(1)1 PE(2) gets lock S ………… No bus traffic I(0)L(0)…I(0)0 PE(2) releases S R(0)R(0)…R(0)0 New value broadcast F(1)R(1)…R(1)1 PE(1) gets S RWB improves test and test and set performance because when someone takes the lock, everyone knows about it and has correct values in caches

26 Shared Bus Bandwidth Problem: when the number of processors is increased, bus saturation may occur. Problem: when the number of processors is increased, bus saturation may occur. Solution: Employing a multiple shared bus Solution: Employing a multiple shared bus Private caches and memory are divided into memory banks Private caches and memory are divided into memory banks RW and RWB schemes can be easily extended to function correctly RW and RWB schemes can be easily extended to function correctly PP M M CCCC ….

27 Summary The paper concern is about private caches and their schemes: The paper concern is about private caches and their schemes: RB (“read broadcast”) RB (“read broadcast”) The cache snoops bus reads, bus writes, and data returned in response to bus reads The cache snoops bus reads, bus writes, and data returned in response to bus reads Each address line is associated with a set of tag bits: R, I, L to indicate the state of the cache line Each address line is associated with a set of tag bits: R, I, L to indicate the state of the cache line RWB (“read write broad cast”) RWB (“read write broad cast”) The cache also notes the data part of the bus writes. The cache also notes the data part of the bus writes. A new tag F is added to each cache A new tag F is added to each cache Test-and-test-and-set instruction is introduced to reduce bus traffic and memory contention. Test-and-test-and-set instruction is introduced to reduce bus traffic and memory contention. These schemes can be easily extended to function correctly in a multiple shared bus configuration These schemes can be easily extended to function correctly in a multiple shared bus configuration

28 References David A. Patterson, John L. Hennessy, Computer Architecture – A Quantitative Approach, Morgan Kaufmann, Palo Alto, CA, 1990 David A. Patterson, John L. Hennessy, Computer Architecture – A Quantitative Approach, Morgan Kaufmann, Palo Alto, CA, 1990 P. Stenstrom, A Survey of Cache Coherence Schemes for Multiprocessors, IEEE Computer, 1990. P. Stenstrom, A Survey of Cache Coherence Schemes for Multiprocessors, IEEE Computer, 1990. Shiri Manor, Impacts of contention on locking, http://www.cs.technion.ac.il/~hagit/seminar98 /smanor.ps.gz Shiri Manor, Impacts of contention on locking, http://www.cs.technion.ac.il/~hagit/seminar98 /smanor.ps.gz

1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.

Similar presentations

Presentation on theme: "1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.

Similar presentations

Presentation on theme: "1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan."— Presentation transcript:

Similar presentations

About project

Feedback