2Outline: Introduction Cache coherence problem Cache Coherence DefinitionCache Coherence Solutions:Software SolutionsHardware SolutionsCache Coherence Mechanisms:Snoopy ProtocolDirectory- based ProtocolTeam’s Implementation Plan
3IntroductionModern systems depend on using shared memory multiprocessors to increase the speed of execution time.Each processor has its own private cache.Cache is very important since it is used to improve and speedup the processing time. That is because of read or writes that can be completed in just a few cycles by the CPU.There might be multiple copies of same data in different caches.The Big Question: How to keep all of those different copies of data consistent ?
4Importance of Cache Regarding Performance Cache minimizes the average latencyMain memory access costs from 100 to 1000 cyclesCache reduces the latency down to a small number of cyclesCache minimizes the average bandwidth required to access main memoryReduce access to shared bus or interconnect.Cache allows for automatic migration of dataData is moved closer to processorCache automatically replicates the dataReplication is done based upon needProcessors can share data efficientlyBut private caches can create a problem!!
5The Cache Coherence Problem Imagine what would happen when different processors read and write to same memory location?Multiple copies of data in different processor’s caches might cause having different values for the same data.When any processor modifies its copy of data, it might NOT become visible to others.This would result in the other processors having invalid value of the data in their caches
6Example Time Event Cache-1 Cache-2 Memory - 1 Processor1 reads A 2 -1Processor1 reads A2Processor2 reads A3Processor1 writes 0 in A4Read wrong Value (1)
7Cache Coherence What Does Cache Coherence Mean? Cache coherence is the process of ensuring that all shared data are consistent.In order to get a correct execution, coherence must be enforced between the caches.
8A memory is coherent if the following conditions fulfilled : Write propagation: When any processor writes, time elapses, the written value must become visible to others.Write serialization: If two processors want to write to the same location in the same time, they will be seen in the same order by all processors.
9Cache CoherenceThere are 4 primary design issues for the coherence mechanisms must be considered to get the best performance:Coherence detection strategy.Coherence enforcement strategy.How the precision of block-sharing information.Cache block size.
10Four primary design issues Coherence detection strategy:Incoherent memory accesses can be detected.This could occur at run-time or compile-time.Precision of block-sharing information:How precise the coherence detection strategy will be.There is a trade-off between performance and implementation cost.
11Four primary design issues(cont..) Cache block size:How the cache block size effects the performance of the memory system.Coherence Enforcement strategy:To update or invalidate data to make sure that ”invalid data" will not be read by any processor.
12Cache Coherence Solutions Software Solutions:Software solutions rely on Operating system and Compiler.Hardware Solutions:We will Focus on hardware solutions because they are more common and related to our course.
13Hardware Solutions Two basic methods for Cache –Memory coherence: Write backThe memory is updated only when the block in the cache is being replaced.Write throughThe memory is updated every time the cache is updated.
14Hardware Solutions Two basic methods for cache –cache coherence: Write-invalidate:The processor invalidates all copies of data blocks to have complete access of the blockWrite-update:When a processor writes on the data blocks, all copies of the block are also updated.
15Cache Coherence Protocols The methods explained in the last slide are most commonly used by the following mechanisms:Snoop-based protocolDirectory-based protocol
16Snoopy-based protocol Snoopy-based coherence protocol is a very popular in multi-core system since it is simple and has low overhead.Bus allows each processor to monitor all of the transactions to the shared memory.A controller “snooper”: is used in each cache to response to other processors requests and bus.Snooping protocol is fast when we have enough bandwidth, and provides low average miss latency.
17Processor 1 Processor 4 Processor 3 Processor 2 caches Main Memory Bus snoopersnoopersnoopersnooperBus
18Cont..All coherence transactions are broadcasted, so all are seen by all other processors.In case cache snooper sees a write on the bus, it will invalidate the line out of its cache if it is present.In case a cache snooper sees a read request on the bus, it checks to see if it has the most recent copy of data, and if so, responds to the bus request.
19Snoop-based protocol(cont..) Two major methods are used by Snoop-based protocol:Write-invalidate.Write-update.
20Snoop-based protocol(cont..) On the case of Write update Protocol:Write to the blocks that are sharing the dataThen broadcast on bus and processors.Snoop and update all of the blocks copies.The memory is always kept freshly updatedThis method is not preferred since it needs a broadcast for each step of write, which need more bandwidth and lead to more traffic.
21Snoop-based protocol(cont..) On the case of Write Invalidate Protocol:It has one writer and many readersIn order to write to shared data:An invalidate is sent to all caches which snoop and invalidate any copies.When Read Miss occurs:Write-back: snoop in caches to find most recent copy.It is used in Most modern multicore systems since it has less bus traffic.
22Some Snoop Cache Types “based on the block states” Basic ProtocolModified, Shared and InvalidBerkeley ProtocolOwned Exclusive, Owned Shared, Shared and InvalidIllinois ProtocolPrivate Dirty, Private Clean, Shared and InvalidMESI ProtocolModified, exclusive, Shared and Invalid
23Snoopy-based protocol Each block of main memory is in one state:Clean in all caches and up-to-date in memory (shared)Dirty in exactly one cache (exclusive)Un-cached Not in any cacheEach cache block can be in one of following states :Modified: Only the valid copy in any cache and its value is different from the main memory copy.Shared: A valid copy, but other caches may also have it.Invalid: block has no valid data.Exclusive: This copy has not been modified yet but it is the only valid copy in any cache.
24If processor 1 wants to read block A : Example 1:If processor 1 wants to read block A :Read Hit:If block A is in its own cache, there will be read hit.Read miss:If block A is not in its own cache. Therefore, it will send broadcast to see:If any other cache has valid copy of this block if so, it will get it from there.If not, it will get it from the main memory as the following:
25If processor1 wants to read block A from main memory Clean A A Read missProcessor 1Processor 4Processor 3Processor 2cachesMain MemoryReadASController BusIf processor1 wants to read block A from main memoryCleanAA
26Example 2: If processor 3 wants to read block A and it is on cache of P1
27If processor 3 wants to read block A and it is on cache of P1 Read missProcessor 1Processor 4Processor 3Processor 2cachesMain MemoryReadASAASController BusExample 2:If processor 3 wants to read block A and it is on cache of P1AClean
28Example 3: If processor 4 wants to write on block A and it is on cache of P1 and P3
29Write missProcessor 1Processor 4Processor 3Processor 2cachesMain MemoryWriteAAAISAISMController BusExample 3:If processor 4 wants to write on block A and it is on cache of P1 and P3DirtyA
30Example 4: If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4
31A I A I S S A A A M A Dirty Clean Read miss Processor 1 Processor 4 cachesMain MemoryReadAAIAISSAAAMWrite- BackTo main memoryExample 4:If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4ADirtyClean
32Requests from the processor: SourceBlock stateActionRead hitProcShared/ExclusiveRead data in cacheRead missInvalidPlace read miss on busSharedConflict miss: place read miss on busModifiedwrite back block, place read miss on busWrite hitExclusiveWrite data in cacheBroadcast on bus to invalidate other copiesWrite missPlace write miss on busConflict miss: place write miss on buswrite back, place write miss on bus
33Requests from the bus: Request Source Block state Action Read miss Bus SharedNo action; allow memory to respondModifiedPlace block on bus; change to sharedWrite missInvalidate blockWrite back block; change to invalid
34Important Observations If any processor now wants to write in its block, it has to upgrade its block state from shared to exclusive copy.By write-back method, the main memory will be updated once the processor which has a modified copy wants to change its state to shared state.
35Directory-based protocol Each processor (or cluster of processors) has its own memoryThe directory is also distributed along with the corresponding memoryEach processor has:Fast access to its local memorySlower access to “remote memory which is located at other processorsThe physical address is enough to determine the location of memory.Processing nodes:The nodes are connected with a scalable interconnect, resulting in routing of the messages from sender to receiver instead of broadcasting.Cannot snoop anymore, thus records of sharing state is now kept in the directory in order to track them.
36Directory-based protocol(Cont..) Typically three processors involved:Local node: where a request creates.Home node: contains the memory location of an address.Remote node: contains a copy of the cache block, either exclusive or shared.
37Directory-based protocol(Cont..) cache states:Shared:At least one processor has cached dataMemory is up-to-dateAny processor is able to read the blockexclusive:Only one processor (the owner) has the data cached.Memory will be staled.only that processor can write to itInvalid (Un-cached):No processor has the data cached.Bit-vector use in order to keep tracking which processors have data in shared state or If it is exclusive in one processor.
39Directory-based protocol(Cont..) Assuming processor 1 wants to read block A and from the address of the block “2.5 GB”:The processor1 will recognize that block A is in the memory of processor 3.The processor 1 will send request to the node 3.The directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block
40Interconnection network Read missProcessor 4& its CachesProcessor 3Processor 2Processor 1I / O3Memory4Directory21Interconnection networkSRead AAddress 2.5 GBAAS: p1If P1 wants to read block A and from the address of the block “2.5 GB” the processor recognizes that it is in the memory of processor 3, so the processor 1 will send request to the node 3Then send a copy of A to P1 then put the block in shared state and keep tracking itThen the directory of node 3 will check the state of this block and make sure it is in the shared state
41Directory-based protocol(Cont..) Assuming now processor 2 wants to read block A and from the address of the block “2.5 GB”:The processor will recognize that it is in the memory of processor 3, also it is in shared state with processor 1.The processor 2 will send request to the node 3.Then the directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block
42Interconnection network Read missProcessor 4& its CachesProcessor 3Processor 2Processor 1I / O3Memory4Directory21Interconnection networkSSARead AAddress 2.5 GBAAS: p1, P2Directory check the state of the blockThen will put the block in shared state and keep tracking it
43Example 3: Assuming now processor 4 wants to WRITE in block A and from the address of the block “2.5 GB” 1. The processor recognizes that it is in the memory of processor 3, also it is in shared state with processor 1, and The processor 4 will send request to the node 3.
44Then the directory of node 3 will check the state of this block and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACK since there is no Bus used here.The directory will be updated by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.
45Interconnection network Write missProcessor 4& its CachesProcessor 3Processor 2Processor 1I / O3Memory4Directory21Interconnection networkACKSISACKIEAAWrite AAddress 2.5 GBANode to node message to P2 to change the stateNode to node message to P1 to change the stateS: p1E: p4, P2The directory will update its by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.Then the directory of node 3 will check the state of this black and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACKIf processor 4 wants to WRITE in block A and from the address of the black “2.5 GB” the processor recognizes that it is in the memory of node 3So, the processor 4 will send request to the node 3
46Example 4:Assuming now processor 1 wants to READ block A BUT its copy is invalid . So, from the address of the block “2.5 GB”The processor recognizes that it is in the memory of processor 3, BUT it is in Exclusive state with processor 4, so the processor 1 will send request to the node 3.Then the directory of node 3 will check the state of this block and find out it is in Exclusive state with p4
473. So, node 3 will forward the request to node 4 which will change the block state to shared and by write back technique it will update the memory of node 3 by the updated copy of block A. 4. After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node ) 5. Finally, the directory of node 3 will update its table and keep tracking of this block.
48Interconnection network Read missProcessor 4& its CachesProcessor 3Processor 2Processor 1I / O3Memory4Directory21Interconnection networkISRead AAddress 2.5 GBIMSAAAAAAS: p1,P4E: p4Read A for P1Address 2.5 GBFinally, the directory of node 3 will update its table and keep tracking of this block.processor 1 wants to read block A BUT its copy is invalid . 3, so the processor 1 will send request to the node 3 which will check the state of this black and find out it is in Exclusive state with p4Node 3 will forward the request to node 4 which will change the block state to shared and by write back technique will update the memory of node 3 by the updated copy of block A.After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node )
49Directory Actions If block is in un-cached state: Read miss: send data, make block sharedWrite miss: send data, make block exclusiveIf block is in shared state:Read miss: send data, add node to sharers listWrite miss: send data, invalidate sharers, make exclusiveIf block is in exclusive state:Read miss: ask owner for data, write-back to memory, send data, make shared, add node to sharers listData write back: write to memory, make un-cachedWrite miss: ask owner for data, write to memory, send data, update identity of new owner, remain exclusive
50Snoopy-Based Advantages and Disadvantages The average miss latency is low, especially for cache-to-cache misses.In case of having small number of processors, snoopy will be fast. Dis:The cache coherence overhead and the speed of shared buses limit the bandwidth needed to broadcast messages to all processors.For large systems, it is not scale since each request will be broadcasted to all processors.Buses have limitations for scalability:Physical (number of devices that can be attached)Performance (contention on a shared resource: the bus)
51Directory-Based Advantages and Disadvantages: The scale much better than snoopy protocols (no broadcast required ).It can exploit random point-to-point interconnectsDis:The directory access and the extra interconnect traversal is on the critical path of cache to cache misses.The latency here is longer than snoopy protocol since there are 3 hops (request, response, forward).
52Observation study:Snoopy based protocol outperforms directory based in case of high bandwidth.As the number of processors are increasing, directory based outperforms snoopy based protocol .
55Our Implementation Plan We will implement these two schemes:Snoopy-based protocolDirectory-based protocolAlso we will simulate the following:CoresLocal cachesMemory Access Patterns
56Cont.. Our Implementation Plan In this implementation, the following parameters will be considered in order to deeply understand and see how the change of these parameters might affect the performance of each scheme:Number of processorsCache/Block sizeApplied Coherence ProtocolAlso, the collected results will be including quantities of hits and misses for each cache levelIn this project, we are going to classify the misses’ type as compulsory miss, capacity miss, or conflict miss.
58References:J. Hennessy, D. Patterson. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann, 2011.Hashemi, B., "Simulation and Evaluation Snoopy Cache Coherence Protocols with Update Strategy in Shared Memory Multiprocessor Systems," Parallel and Distributed Processing with Applications Workshops (ISPAW), 2011 Ninth IEEE International Symposium on , pp.256,259, May 2011Ahmed, R.E.; Dhodhi, M.K., "Directory-based cache coherence protocol for power-aware chip-multiprocessors," Electrical and Computer Engineering (CCECE), th Canadian Conference, pp , , 8-11 May 2011.Emil Gustafsson and Bruno Nilbert,”cache coherence in parallel Multiprocessors”, Uppsala 24th February 1997, Department of computer science, Uppsala university 1997.Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, and David A.: " Bandwidth Adaptive Snooping," 8th Annual International Symposium on High-Performance Computer Architecture (HPCA-8). (2002) 2-6