CS 704 Advanced Computer Architecture

CS 704 Advanced Computer Architecture
Lecture 28 Memory Hierarchy Design (Cache Design and policies ) Prof. Dr. M. Ashraf Chughtai Welcome to the 28th Lecture for the series of lectures on Advanced Computer Architecture

Lecture 28 Memory Hierarchy (4)
Today’s Topics Recap: Cache Addressing Techniques Placement and Replacement Policies Cache Write Strategy Cache Performance Enhancement Summary MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Block Size Trade off
Impact of block size on the cache performance and categories of cache design The trade-off of the block size verses the Miss rate, Miss Penalty, and Average access time , the basic CPU performance matrices MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Block Size Trade off
The larger block size reduces the miss rate, but If block size is too big relative to cache size, miss rate will go up; and Miss penalty will go up as the block size increases; and Combining these two parameters, the third parameter, Average Access Time MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Cache Organizations
Block placement policy, we studied three cache organizations. MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Cache Organizations
Direct Mapped where each block has only one place it can appear in the cache – Conflict Miss Fully Associative Mapped where any block of the main memory can be placed any where in the cache; and Set Associative Mapped which allows to place a block in a set of places in the cache MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Memory Hierarchy Designer’s Concerns
Block placement: Where can a block be placed in the upper level? Block identification: How is a block found if it is in the upper level? Block replacement: Which block should be replaced on a miss? Write strategy: What happens on a write? MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Placement Policy
Fully Associative: Block can be placed any where in the upper level (Cache) E.g. Block 12 from the main memory can be place at block 2, 6 or any of the 8 block locations in cache MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Set Associative: Block can be placed any where in a set in upper level (cache) The set number in the upper level given as: (Block No) MOD (number of sets) E.g., an 8-block, 2-way set associative mapped cache, has 4 sets [0-3] each of two blocks; therefore and block 12 or 16 of main memory can go any where in set # 0 as (12 MOD 4 = 0) and (16 MOD 4 = 0) Similarly, block 14 can be placed at any of the 2 locations in set#2 (14 MOD 4 = 2) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Direct Mapped: (1 way associative) Block can be placed at only one specific location in upper level (Cache) The location in the cache is given by: Block number MOD No. of cache blocks E.g., the block 12 or block 20 can be place at location 4 in cache having 8 blocks as: 12 MOD 8 = 4 20 MOD 8 = 4 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Identification How is a block found if it is in the upper level? Tag/Block A TAG is associated with each block frame The TAG gives the block address All possible TAGS, where a block may be placed are checked in parallel Valid bit is used to identify whether the block contains correct data No need to check index or block offset MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Identification: Direct Mapped
Lower Level (Main) memory: 4GB – 32-bit address Cache Index 5bits 1 Cache Data Byte 0 4 31 Cache Tag (22-bits) Ex: 0x00 22 bit Valid Bit Byte 1 Byte 31 : Byte 32 Byte 33 Byte 63 Byte Select 9 31 31 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Identification 31 9 8 4 Cache Tag (23-bits) Cache Index 4bits Byte Select 23bit 1 Byte 0 Byte 1 Byte 31 : Byte 32 Byte 33 Byte 63 15 : 23bit Byte 31 Byte 1 Byte 0 : Byte 63 Byte 33 Byte 32 1 15 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy
In case of cache miss, a new block needs to be brought in If the existing block locations, as defined by Block placement policy, the are filled, then an existing block has to be fired based on Cache mapping; and some block replacement policy MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

For the Direct Mapped Cache, the block replacement is very simple as a block can be place at only one location given by: (Block No.) MOD (Number of Cache Blocks There are three commonly used schemes for Fully and Set Associative mapped These policies are: MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Random: replace any block it is simple and easiest to implement The candidate for replacement are randomly selected Some designers use pseudo random block numbers MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Least Recently Used (LRU): replace the block either never used of used long ago It reduces the chances of throwing out information that may be needed soon Here, the access time and number of times a block is accessed is recorded The block replaced is one that has not been used for longest time E.g., if the blocks are accessed in the sequence 0,2,3,0, 4,3,0,1,8,0 the victim to replace is block 2 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

First-in, First-out (FIFO): the block first place in the cache is thrown out first; e.g., if the blocks are accessed in the sequence 2,3,4,5,3,4 then to bring in a new block in the cache, the block 2 will be thrown out as it is the oldest accessed block in the sequence FIFO is used as approximation to LRU as LRU can be complicated to calculate MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy: Conclusion
Associativity 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB % % 4.7% 5.3% % 5.0% 64 KB % 2.0% 1.5% 1.7% % 1.5% 256 KB % 1.17% 1.13% 1.13% % 1.12% MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches I/O may address main memory directly Memory is accessed for read and write purposes MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy .. Cont’d The instruction cache accesses are read Instruction issue dominates the cache traffic as the writes are typically 10% of the cache access Furthermore, the data cache are 10%- 20% of the overall memory access are write MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy.. Cont’d In order to optimize the cache performance, according to the Amdahl’s law, we make the common case fast Fortunately, the common case, i.e., the cache read, is easy to make fast as: Read can be optimized by making the tag- checking and data-transfer in parallel Thus, the cache performance is good MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy .. Cont’d However, in case of cache-write, the cache contents modification cannot begin until the tag is checked for address-hit Therefore the cache-write cannot begin in parallel with the tag checking Another complication is that the processor specifies the size of write which is usually a portion of the block Therefore, the write needs consideration MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy.. Cont’d Write back —The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced Write through —The information is written to both the block in the cache and to the block in the lower-level memory MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy: Pros and Cons of each
Write Back: No write to the lower level for repeated writes to cache a dirty bit is commonly used to indicate the status as the cache block is modified (dirty) or not modified (clean) Reduce memory-bandwidth requirements, hence the reduces the memory power requirements MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy: Pros and Cons of each
Write Through: Simplifies the replacement procedure the block is always clean, so unlike write-back strategy the read misses cannot result in writes to the lower level always combined with write buffers so that don’t wait for lower level memory Simplifies the data-coherency as the next lower level has the most recent copy (we will discuss this later) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer for Write Through
Processor Cache Write Buffer DRAM +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write buffer is just a FIFO: Typical number of entries: 4 Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

DRAM cycle time sets the upper limit on how frequent you can write to the main memory. The write buffer works as long as the frequency of store, with respect to the time, is not too high, i.e., Store frequency << 1 / DRAM write cycle +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time, i.e., The CPU Cycle Time <= DRAM Write Cycle Time We call this Write Buffer Saturation +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer Saturation
In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you are simply feeding things in it faster than you can empty it There are two solutions to this problem: The first solution is to get rid of this write buffer and replace this write through cache with a write back cache +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer Saturation
Processor Cache Write Buffer DRAM L2 +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write-Miss Policy In case of write-miss, two options are used, these options are : Write Allocate: A block is allocated on a write-miss, followed by the write hit action No-write Allocate: Usually the write- misses do not affect the cache, rather the block is modified only in the lower level memory, i.e., Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write-Miss Policy The blocks stay out of the cache in no-write allocate until the program tries to read the blocks, but The blocks that are only written will still be in the cache with write allocate Let us discuss it with the help of example Let’s look at our 1KB direct mapped cache again Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write-miss Policy Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x00 Ex: 0x00 0x00 Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select 9 Let’s look at our 1KB direct mapped cache again. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write-miss Policy Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32- byte block select After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write-miss Policy As the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write-miss Policy So even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Example
Let us consider a fully associative write- back cache with cache entries that start empty Consider the following sequence of five memory operations and find The number of hits and misses when using no-write allocate verses write allocate +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Mem [100] Read Mem [200] Write Mem [200] For no-write allocate, the address [100] is not in the cache (i.e., its tag is not in the cache +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

So the first two writes will result in MISSES Address [200] is also not in the cache, the reed is also miss The subsequent write [200] is a hit The last write [100] is still a miss The result is 4 MISSes and 1 HIT +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

For the write-allocate policy The first access to 100 and 200 are MISSES The rest are HITS as [100] and [200] are both found in the cache The result is 2 MISSes and 3 HITs +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Conclusion
Either write miss policy could be used with the write-through or write-back Normally Write-back caches use write-allocate, hopping that the subsequent write to the block will be captured by the cache +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Conclusion
Write-through caches often use No Write Allocate, the reason is that even if there is a subsequent write to the block, the write must go to the lower level memory +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Allah Hafiz And Aslam-U-Alacum MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

CS 704 Advanced Computer Architecture

Similar presentations

Presentation on theme: "CS 704 Advanced Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 704 Advanced Computer Architecture

Similar presentations

Presentation on theme: "CS 704 Advanced Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback