Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cache Coherence Mechanisms (Research project) CSCI-5593 Prepared By Sultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri 1.

Similar presentations


Presentation on theme: "Cache Coherence Mechanisms (Research project) CSCI-5593 Prepared By Sultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri 1."— Presentation transcript:

1 Cache Coherence Mechanisms (Research project) CSCI-5593 Prepared By Sultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri 1

2 Outline: Introduction Cache coherence problem Cache Coherence Definition Cache Coherence Solutions: Software Solutions Hardware Solutions Cache Coherence Mechanisms: Snoopy Protocol Directory- based Protocol Team’s Implementation Plan 2

3 Introduction  Modern systems depend on using shared memory multiprocessors to increase the speed of execution time.  Each processor has its own private cache.  Cache is very important since it is used to improve and speedup the processing time. That is because of read or writes that can be completed in just a few cycles by the CPU.  There might be multiple copies of same data in different caches.  The Big Question: How to keep all of those different copies of data consistent ? 3

4 Importance of Cache Regarding Performance  Cache minimizes the average latency  Main memory access costs from 100 to 1000 cycles  Cache reduces the latency down to a small number of cycles  Cache minimizes the average bandwidth required to access main memory  Reduce access to shared bus or interconnect.  Cache allows for automatic migration of data  Data is moved closer to processor  Cache automatically replicates the data  Replication is done based upon need  Processors can share data efficiently But private caches can create a problem!! 4

5 The Cache Coherence Problem  Imagine what would happen when different processors read and write to same memory location? – Multiple copies of data in different processor’s caches might cause having different values for the same data. – When any processor modifies its copy of data, it might NOT become visible to others. – This would result in the other processors having invalid value of the data in their caches 5

6 Example TimeEventCache-1Cache-2Memory Processor1 reads A1-1 2Processor2 reads A111 3Processor1 writes 0 in A010 4Processor2 reads A0 Read wrong Value (1) 0 6

7 Cache Coherence What Does Cache Coherence Mean?  Cache coherence is the process of ensuring that all shared data are consistent.  In order to get a correct execution, coherence must be enforced between the caches. 7

8 A memory is coherent if the following conditions fulfilled : 1.Write propagation: When any processor writes, time elapses, the written value must become visible to others. 1. Write serialization: If two processors want to write to the same location in the same time, they will be seen in the same order by all processors. 8

9 Cache Coherence  There are 4 primary design issues for the coherence mechanisms must be considered to get the best performance: 1.Coherence detection strategy. 2.Coherence enforcement strategy. 3.How the precision of block-sharing information. 4.Cache block size. 9

10 Four primary design issues 1.Coherence detection strategy: Incoherent memory accesses can be detected. This could occur at run-time or compile-time. 2. Precision of block-sharing information: How precise the coherence detection strategy will be. There is a trade-off between performance and implementation cost. 10

11 Four primary design issues(cont..) 3.Cache block size: How the cache block size effects the performance of the memory system. 4.Coherence Enforcement strategy: To update or invalidate data to make sure that ”invalid data" will not be read by any processor. 11

12 Cache Coherence Solutions 1.Software Solutions: Software solutions rely on Operating system and Compiler. 2.Hardware Solutions: We will Focus on hardware solutions because they are more common and related to our course. 12

13 Hardware Solutions  Two basic methods for Cache –Memory coherence: 1.Write back The memory is updated only when the block in the cache is being replaced. 2.Write through The memory is updated every time the cache is updated. 13

14 Hardware Solutions  Two basic methods for cache –cache coherence: 1.Write-invalidate: The processor invalidates all copies of data blocks to have complete access of the block 2.Write-update: When a processor writes on the data blocks, all copies of the block are also updated. 14

15 Cache Coherence Protocols  The methods explained in the last slide are most commonly used by the following mechanisms: 1. Snoop-based protocol 2.Directory-based protocol 15

16 Snoopy-based protocol  Snoopy-based coherence protocol is a very popular in multi-core system since it is simple and has low overhead.  Bus allows each processor to monitor all of the transactions to the shared memory.  A controller “snooper”: is used in each cache to response to other processors requests and bus.  Snooping protocol is fast when we have enough bandwidth, and provides low average miss latency. 16

17 Processor 1 Processor 4 Processor 3 Processor 2 caches Main Memory caches Bus 17 snooper

18 Cont.. All coherence transactions are broadcasted, so all are seen by all other processors. In case cache snooper sees a write on the bus, it will invalidate the line out of its cache if it is present. In case a cache snooper sees a read request on the bus, it checks to see if it has the most recent copy of data, and if so, responds to the bus request. 18

19 Snoop-based protocol(cont..)  Two major methods are used by Snoop- based protocol: 1.Write-invalidate. 2.Write-update. 19

20 Snoop-based protocol(cont..)  On the case of Write update Protocol:  Write to the blocks that are sharing the data  Then broadcast on bus and processors.  Snoop and update all of the blocks copies.  The memory is always kept freshly updated  This method is not preferred since it needs a broadcast for each step of write, which need more bandwidth and lead to more traffic. 20

21 Snoop-based protocol(cont..)  On the case of Write Invalidate Protocol:  It has one writer and many readers  In order to write to shared data:  An invalidate is sent to all caches which snoop and invalidate any copies.  When Read Miss occurs:  Write-back: snoop in caches to find most recent copy.  It is used in Most modern multicore systems since it has less bus traffic. 21

22 Some Snoop Cache Types “based on the block states” Basic Protocol  Modified, Shared and Invalid Berkeley Protocol  Owned Exclusive, Owned Shared, Shared and Invalid Illinois Protocol  Private Dirty, Private Clean, Shared and Invalid MESI Protocol  Modified, exclusive, Shared and Invalid 22

23 Snoopy-based protocol Each block of main memory is in one state:  Clean in all caches and up-to-date in memory (shared)  Dirty in exactly one cache (exclusive)  Un-cached Not in any cache Each cache block can be in one of following states :  Modified: Only the valid copy in any cache and its value is different from the main memory copy.  Shared: A valid copy, but other caches may also have it.  Invalid: block has no valid data.  Exclusive: This copy has not been modified yet but it is the only valid copy in any cache. 23

24 Example 1: If processor 1 wants to read block A :  Read Hit: If block A is in its own cache, there will be read hit.  Read miss: If block A is not in its own cache. Therefore, it will send broadcast to see: – If any other cache has valid copy of this block if so, it will get it from there. – If not, it will get it from the main memory as the following: 24

25 Processor 1 Processor 4 Processor 3 Processor 2 caches Main Memory caches If processor1 wants to read block A from main memory A Controller Bus Read A S A Clean Read miss 25

26 Example 2: If processor 3 wants to read block A and it is on cache of P1 26

27 Processor 1 Processor 4 Processor 3 Processor 2 caches Main Memory caches Example 2: If processor 3 wants to read block A and it is on cache of P1 A S Read A A S Controller Bus A Clean Read miss 27

28 Example 3: If processor 4 wants to write on block A and it is on cache of P1 and P3 28

29 Processor 1 Processor 4 Processor 3 Processor 2 caches Main Memory caches Example 3: If processor 4 wants to write on block A and it is on cache of P1 and P3 A A A A S S I I M Controller Bus Dirty Write A Write miss 29

30 Example 4: If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4 30

31 Processor 1 Processor 4 Processor 3 Processor 2 caches Main Memory caches Dirty A I A I M A Read miss Read A A A S Clean S A Write- Back To main memory Write- Back To main memory 31 Example 4: If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4

32 Requests from the processor: RequestSourceBlock stateAction Read hitProcShared/ExclusiveRead data in cache Read missProcInvalidPlace read miss on bus Read missProcSharedConflict miss: place read miss on bus Read missProcModified write back block, place read miss on bus Write hitProcExclusiveWrite data in cache Write hitProcSharedBroadcast on bus to invalidate other copies Write missProcInvalidPlace write miss on bus Write missProcSharedConflict miss: place write miss on bus Write missProcModifiedwrite back, place write miss on bus 32

33 RequestSourceBlock stateAction Read missBusSharedNo action; allow memory to respond Read missBusModifiedPlace block on bus; change to shared Write missBusSharedInvalidate block Write missBusModifiedWrite back block; change to invalid 33 Requests from the bus:

34 Important Observations  If any processor now wants to write in its block, it has to upgrade its block state from shared to exclusive copy.  By write-back method, the main memory will be updated once the processor which has a modified copy wants to change its state to shared state. 34

35 Directory-based protocol Each processor (or cluster of processors) has its own memory The directory is also distributed along with the corresponding memory Each processor has: – Fast access to its local memory – Slower access to “remote memory which is located at other processors The physical address is enough to determine the location of memory. Processing nodes: The nodes are connected with a scalable interconnect, resulting in routing of the messages from sender to receiver instead of broadcasting. Cannot snoop anymore, thus records of sharing state is now kept in the directory in order to track them. 35

36 Directory-based protocol(Cont..)  Typically three processors involved: – Local node: where a request creates. – Home node: contains the memory location of an address. – Remote node: contains a copy of the cache block, either exclusive or shared. 36

37 Directory-based protocol(Cont..)  cache states: – Shared: At least one processor has cached data Memory is up-to-date Any processor is able to read the block – exclusive: Only one processor (the owner) has the data cached. Memory will be staled. only that processor can write to it – Invalid (Un-cached): No processor has the data cached.  Bit-vector use in order to keep tracking which processors have data in shared state or If it is exclusive in one processor. 37

38 Directory-based protocol(Cont..) Processor 4 & its Caches Processor 4 & its Caches Processor 3 & its Caches Processor 3 & its Caches Processor 2 & its Caches Processor 2 & its Caches Processor 1 & its Caches Processor 1 & its Caches I / O 3 Memory 4 Directory I / O 2 Memory 3 Directory I / O 1 Memory 2 Directory I / O 0 Memory 1 Directory Interconnection network 38

39 Directory-based protocol(Cont..)  Assuming processor 1 wants to read block A and from the address of the block “2.5 GB”: 1. The processor1 will recognize that block A is in the memory of processor The processor 1 will send request to the node 3. 3.The directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block 39

40 Processor 4 & its Caches Processor 4 & its Caches Processor 3 & its Caches Processor 3 & its Caches Processor 2 & its Caches Processor 2 & its Caches Processor 1 & its Caches Processor 1 & its Caches I / O 3 Memory 4 Directory I / O 2 Memory 3 Directory I / O 1 Memory 2 Directory I / O 0 Memory 1 Directory Interconnection network Read A Address 2.5 GB Read miss A A S S: p1 Then send a copy of A to P1 then put the block in shared state and keep tracking it Then the directory of node 3 will check the state of this block and make sure it is in the shared state If P1 wants to read block A and from the address of the block “2.5 GB” the processor recognizes that it is in the memory of processor 3, so the processor 1 will send request to the node 3 40

41  Assuming now processor 2 wants to read block A and from the address of the block “2.5 GB”: 1.The processor will recognize that it is in the memory of processor 3, also it is in shared state with processor 1. 2.The processor 2 will send request to the node 3. 3.Then the directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block Directory-based protocol(Cont..) 41

42 Processor 4 & its Caches Processor 4 & its Caches Processor 3 & its Caches Processor 3 & its Caches Processor 2 & its Caches Processor 2 & its Caches Processor 1 & its Caches Processor 1 & its Caches I / O 3 Memory 4 Directory I / O 2 Memory 3 Directory I / O 1 Memory 2 Directory I / O 0 Memory 1 Directory Interconnection network S S: p1 A A Read A Address 2.5 GB Read miss A Directory check the state of the block Then will put the block in shared state and keep tracking it, P2 S 42

43 Example 3: Assuming now processor 4 wants to WRITE in block A and from the address of the block “2.5 GB” 1. The processor recognizes that it is in the memory of processor 3, also it is in shared state with processor 1, and The processor 4 will send request to the node 3. 43

44 3.Then the directory of node 3 will check the state of this block and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACK since there is no Bus used here. 4.The directory will be updated by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block. 44

45 Processor 4 & its Caches Processor 4 & its Caches Processor 3 & its Caches Processor 3 & its Caches Processor 2 & its Caches Processor 2 & its Caches Processor 1 & its Caches Processor 1 & its Caches I / O 3 Memory 4 Directory I / O 2 Memory 3 Directory I / O 1 Memory 2 Directory I / O 0 Memory 1 Directory Interconnection network Write miss S A S Write A Address 2.5 GB A I I S: p1, P2 If processor 4 wants to WRITE in block A and from the address of the black “2.5 GB” the processor recognizes that it is in the memory of node 3 So, the processor 4 will send request to the node 3 Then the directory of node 3 will check the state of this black and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACK The directory will update its by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block. Node to node message to P1 to change the state Node to node message to P2 to change the state E: p4 ACK A E 45

46  Example 4:  Assuming now processor 1 wants to READ block A BUT its copy is invalid. So, from the address of the block “2.5 GB” 1.The processor recognizes that it is in the memory of processor 3, BUT it is in Exclusive state with processor 4, so the processor 1 will send request to the node 3. 1.Then the directory of node 3 will check the state of this block and find out it is in Exclusive state with p4 46

47 3. So, node 3 will forward the request to node 4 which will change the block state to shared and by write back technique it will update the memory of node 3 by the updated copy of block A. 4. After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node ) 5. Finally, the directory of node 3 will update its table and keep tracking of this block. 47

48 Processor 4 & its Caches Processor 4 & its Caches Processor 3 & its Caches Processor 3 & its Caches Processor 2 & its Caches Processor 2 & its Caches Processor 1 & its Caches Processor 1 & its Caches I / O 3 Memory 4 Directory I / O 2 Memory 3 Directory I / O 1 Memory 2 Directory I / O 0 Memory 1 Directory Interconnection network AA I I E: p4 M A processor 1 wants to read block A BUT its copy is invalid. 3, so the processor 1 will send request to the node 3 which will check the state of this black and find out it is in Exclusive state with p4 Read miss Read A Address 2.5 GB Node 3 will forward the request to node 4 which will change the block state to shared and by write back technique will update the memory of node 3 by the updated copy of block A. Read A for P1 Address 2.5 GB S A A After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node ) A S Finally, the directory of node 3 will update its table and keep tracking of this block. S: p1,P4 48

49 Directory Actions If block is in un-cached state:  Read miss: send data, make block shared  Write miss: send data, make block exclusive If block is in shared state:  Read miss: send data, add node to sharers list  Write miss: send data, invalidate sharers, make exclusive If block is in exclusive state:  Read miss: ask owner for data, write-back to memory, send data, make shared, add node to sharers list  Data write back: write to memory, make un-cached  Write miss: ask owner for data, write to memory, send data, update identity of new owner, remain exclusive 49

50 Snoopy-Based Advantages and Disadvantages  Adv:  The average miss latency is low, especially for cache-to-cache misses.  In case of having small number of processors, snoopy will be fast.  Dis:  The cache coherence overhead and the speed of shared buses limit the bandwidth needed to broadcast messages to all processors.  For large systems, it is not scale since each request will be broadcasted to all processors.  Buses have limitations for scalability: o Physical (number of devices that can be attached) o Performance (contention on a shared resource: the bus) 50

51 Directory-Based Advantages and Disadvantages: Adv:  The scale much better than snoopy protocols (no broadcast required ).  It can exploit random point-to-point interconnects Dis:  The directory access and the extra interconnect traversal is on the critical path of cache to cache misses.  The latency here is longer than snoopy protocol since there are 3 hops (request, response, forward). 51

52 Observation study: Snoopy based protocol outperforms directory based in case of high bandwidth. As the number of processors are increasing, directory based outperforms snoopy based protocol [5]. 52

53 53

54 54

55 Our Implementation Plan  We will implement these two schemes: Snoopy-based protocol Directory-based protocol  Also we will simulate the following: Cores Local caches Memory Access Patterns 55

56 Cont.. Our Implementation Plan  In this implementation, the following parameters will be considered in order to deeply understand and see how the change of these parameters might affect the performance of each scheme: Number of processors Cache/Block size Applied Coherence Protocol  Also, the collected results will be including quantities of hits and misses for each cache level  In this project, we are going to classify the misses’ type as compulsory miss, capacity miss, or conflict miss. 56

57 Thanks a lot… 57

58  References: 1.J. Hennessy, D. Patterson. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann, Hashemi, B., "Simulation and Evaluation Snoopy Cache Coherence Protocols with Update Strategy in Shared Memory Multiprocessor Systems," Parallel and Distributed Processing with Applications Workshops (ISPAW), 2011 Ninth IEEE International Symposium on, pp.256,259, May Ahmed, R.E.; Dhodhi, M.K., "Directory-based cache coherence protocol for power- aware chip-multiprocessors," Electrical and Computer Engineering (CCECE), th Canadian Conference, pp , , 8-11 May Emil Gustafsson and Bruno Nilbert,”cache coherence in parallel Multiprocessors”, Uppsala 24th February 1997, Department of computer science, Uppsala university Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, and David A.: " Bandwidth Adaptive Snooping," 8th Annual International Symposium on High-Performance Computer Architecture (HPCA-8). (2002)


Download ppt "Cache Coherence Mechanisms (Research project) CSCI-5593 Prepared By Sultan Almakdi, Abdulwahab Alazeb, Mohammed Alshehri 1."

Similar presentations


Ads by Google