Main Memory Cache Architectures

Main Memory Cache Architectures
COEN 180 Main Memory Cache Architectures

Basics Processor speed about 100 – 300 times faster than main memory access. Use faster memory as a cache. Actually: Instruction queue (part of processor) L1 cache ~ 32 KB on processor chip L2 cache ~ 1 MB (L3 cache ~ 4 MB) Caches on DRAM processor chips

Basics Cache algorithms need to be implemented in hardware and be simple. MM is byte addressable, but only words are moved, typically, the last two bits of the address are not even transmitted. Hit rate needs to be high.

Basics Cache versus Main Memory Cache: Direct Mapped Cache:
... ABAB FFFF Cache ... ABAB FFFF Cache: Contains some data Fast Direct Mapped Cache: This item can only be in one cache line. Main Memory: Contains all the data Slow

Basics Average Access Time = (Hit Rate)*(Access to Cache) +
(Miss Rate)*(Access to MM)

Your Turn Assume cache access for an on-chip cache is 5 nsec.
Assume main memory access is 145 nsec. Access time for a miss is 5 nsec nsec. Calculate the access times for a hit rate of 50% 90% 99% Conclusion: hit rates need to be high. 77.50 nsec 19.50 nsec 6.45 nsec

Basics Main Memory contents Address can be 32b, MM word can be 32b,
but addresses and contents are ontologically different

Virtual Memory Gives the impression of much more memory than there really is. Pages memory pages into and out of disk. Handled by MU (Memory Unit). Distinguish between virtual addresses and physical addresses.

Virtual Memory Virtual addresses are 32b long.
Or 64b for a 64b processor. Physical addresses are smaller. Correspond to maximum MM-size. Since most MM is byte addressable, but data is moved in words (4B, 8B, ...), the least significant bits of physical address are not part of the address bus.

Virtual Memory Can use caches at the virtual memory level
Using virtual memory addresses. Or at the physical memory level. Using physical memory addresses. If nothing is said, assume virtual memory addresses.

Cache Replacement Which items should be in the cache?
Algorithm needs to be very fast and simple. Need to implement algorithm in hardware. Simplest scheme: If MM item is read or written, put it in the cache. Throw out old item.

Direct Mapped Cache Each item in MM can be located in only one position in cache. MM addresses typically refer to a single byte (an ASCII text character) For historical reasons Hard to change Physically, only complete words are accessed.

Direct Mapped Cache Address 0110 1100 1110 1110 0101 1010 1111 0010
Go to byte 10 (=2dec) In word:

Direct Mapped Cache Address is split into Tag (highest order bits);
Index; Byte in word address. Typically the two least significant bits for 4B per word.

Direct Mapped Cache Tag serves to identify the data item in the cache.
Index is the address of the word in the cache.

Direct Mapped Cache MM[0110 1100 1110 1110 0101 1010 1111 0010]
Index Tag The index tells us where the contents of MM[ ] are stored in the cache. Namely at cache line (location)

Direct Mapped Cache Contents of main memory address
and of main memory address would be stored at the same location in cache. To know which one is stored there, keep the tag with the contents.

Direct Mapped Cache Tag: Identifies item in cache Index: Where the item is in the cache: Cache line / address Cache : Contents of MM[...]

Direct Mapped Cache Your Turn
Why are the most significant bits of the address the tag and not the index? Answer: A whole region of main memory can be loaded into cache. Makes sense because of spatial locality. Neighboring MM addresses have different indices but the same tag. Otherwise, neighboring MM addresses have different tags and same index, that is, they are competing for the same cache location.

Direct Mapped Cache Example
Memory words are 2B long. Memory contains 128B and is byte addressable 128 addressable items. 27 addresses. Memory addresses 7 b long. Cache contains 4 words. 2 b cache address = index Memory address split into 4b tag 2b index 1b Byte in word address.

Main Memory contents: : FF : FF : : 00 : : 00 : FF : FF : AF : AB Contents of MM: 2B MM address

Assume item MM[ ] is in cache. Cache contains complete MM line. Split address into tag, index, and Byte in Word address: Tag is 0000 Index is 01 Byte in Word is 0 Byte in Word Tag Index

View of Cache Cache line Tag Byte 0 Byte 1 FF FF AB CD Only this portion is stored. Cache line contains 2.5 B Cache line addresses are implicit.

Direct Mapped Cache Cache lines contain
Contents Tags Some metadata (as we will see). Distinguish between cache capacity and cache storage needs. Difference is cache storage overhead.

Direct Mapped Cache Vocabulary Byte addressable: one address per byte.
Cache lines: items stored at a single cache address (index).

Direct Mapped Cache Your Turn:
Main Memory Contains 512 MB. 8 B in a word. Byte addressable What is the length of an address? Solution 512M = 29220 = 229 addressable items. Addresses are 29 bits long.

cache line (8B + tags) Main Memory Contains 512 MB. 8 B in a word. Byte addressable Cache Contains 1 MB Cache line consists of 1 word (of 8B) How many cache lines? How long are indices? 1M / 8 = 128K = 217 cache lines. Indices are 17b long. Nr. cache lines

MM address is 29 bits Index is 17 bits How is a MM address split up? Solution: 8 B in a word  3 bits for “Byte in Word”. 17 bits for index. 9 bits for tag. Tag: 9b Index: 17b Byte in Word: 3b

What is the cache storage overhead? Solution Overhead per cache line is the tag. Cache line contains 8B contents. Cache line contains 9b tag. (Plus possibly metadata, which we ignore.) Overhead is 9b / 8B = 9/64 = %

Reads from a Cache Input is MM location
Calculate cache line from MM location This is were the item might be. Use the tag to check whether this is the correct item.

Reads from a cache Assume memory address is
Go to cache line Cache : They are: The result of the look-up is This is a HIT. Check whether the tags are the same

Reads from a cache Assume memory address is
Go to cache line Cache : They are not: The requested word is not in the cache. A MISS. Check whether the tags are the same

Reads from a Cache Miss Penalty:
The added time necessary to find the word. In this case, go to main memory and satisfy it from there.

Here is the result, sorry it took so long.
Reads from a Cache Give me MM[address] Miss: Go to cache Find out that it is a miss Go to main memory Processor Miss! Here is the result, sorry it took so long. Cache Main Memory Miss Penalty: Time to go to Main Memory.

Your Turn: ? Why don’t we send requests to both cache and MM at the same time. This way, cache access and MM access overlap. There is less miss penalty. Main memory would be overwhelmed with all these read requests.

Cache Writes A “write to cache” operation updates Contents, Tag field,
If written item replaces another item instead of writing a new value. Meta data.

Write Policies Write-through: Copy-back:
A write is performed to both the cache and to the main memory. Copy-back: A write is performed only to cache. If an item is replaced by another item, then the item to be replaced is copied back to main memory.

Write Through Processor Simultaneous update Cache Main Memory

Write-Through Cache and MM always contain the same contents.
When an item is replaced by another one in the cache, there is no need for additional synchronization. Write traffic to both cache and MM. Processor Cache MM

Cache Operations Write-Through
READ: Extract Tag and Index from Address. Go to the cache line given by Index. See whether the Tag matches the Tag stored there. If they match: Hit. Satisfy read from cache. If they do not match: Miss. Satisfy read from main memory. Also store item in cache. (Replacement policy, as we will see.)

Cache Operations Write-Through
Extract Tag and Index from address. Write datum in cache at location given by Index. Reset the tag field in the cache line with Tag. Write datum in main memory.

Copy Back Writes only to cache.
MM and cache are not in the same state after a write. Need to save values in the cache if item in cache is replaced.

Copy Back Write item MM[0000 0000 0000 1111 1111 1111 1111 1111].
Puts item MM[ ]. into cache. Read item MM[ ]. Both items have same index  latter item overwrites first item. First item not updated in cache: it is dirty. Need to write contents of MM[ ] to MM before putting MM[ ] into cache.

Copy Back Read item MM[0000 0000 0000 1111 1111 1111 1111 1111].
Puts item MM[ ]. into cache. Read item MM[ ]. Both items have same index  latter item overwrites first item. First item already in MM. It is clean. Need to write contents of MM[ ] to MM before putting MM[ ] into cache.

Copy Back Use a “dirty bit” to distinguish between clean and dirty items. When an item is put into cache, set the dirty bit to 0. (Item is clean.) When we write to an item in cache, set the dirty bit to 1. (Item is now dirty.) When we replace item in cache, read the dirty bit. If the dirty bit is 0, no synchronization is necessary. If the dirty bit is 1, write the contents of the item into MM before replacing it.

Copy Back vs. Write Through
Less write traffic to MM. Reads can be slower. Possibly need to synchronize cache and MM if replaced item is dirty. 1b more overhead per cache line (dirty bit). Write Through Write traffic at MM can slow down MM speed. Higher miss penalty. Fast cache replacement  Fast reads.

Your Turn Use virtual memory addresses. Assume 32b = 4B words.
Memory is byte addressable. What is the storage overhead for a cache with 2MB capacity?

Your Turn Virtual memory 32b addresses 2b for “Byte in Word” 19b index
2MB capacity, 4B per cache line 2M/4 = 512K = 219 cache lines. Index is 19b long. Virtual memory 32b addresses 2b for “Byte in Word” 19b index 11b tag ( = 32)

Your turn Copy-Through Cache overhead per line Direct mapped cache
Cache line contains 32b data. Tag is 11b. Copy-Through Additional dirty bit. Cache overhead per line 12b / 32b = 37.5%.

Cache Misses Cache loading (when process starts)
All data (incl. instructions) is in MM. All accesses are cache misses. Mandatory misses. Contention / Conflicts Process needs two (or more items) that map to the same cache location. Worst case: all accesses to these items are misses.

Block Cache Direct mapped cache only exploits temporal locality.
Temporal locality: Items recently used are more likely to be reaccessed. MM typically is designed for larger accesses than a whole word. Block Cache moves several words (a block) into and out of cache. Exploits spatial locality. Spatial locality: Neighbors of recently accessed items are more likely to be accessed.

Blocks If there is a miss on a read, bring all words in the block into the cache. For a write operation: Write the word, bring other words into the cache. Or Write directly to main memory. Do not update cache.

Block Cache To look up data in the cache, we need to specify the word in the block. Four components: Tag Index (address of cache line) Word in block address. Byte in word address (if byte addressable) Word in Block Byte in Word Tag Address Index

Block Cache Block cache with copy back.

Block Cache Example Cache virtual 32b addresses.
Memory is byte addressable. Memory is accessed in words of 4B. Cache line contains 4 words of 4B. Cache size is 4MB.

Block Cache Example Cache line contains 4 words of 4B. Cache has 4MB capacity. Cache line contains 16B data. 4MB/16B = 256 K = 218 cache lines. Index is 18 b long.

Block Cache Example There are four bytes per word.
Use 2b for “Byte in Word” address Block consists of 4 words. Word in block address is 2b.

Block Cache Example Tag Index 18b Word in Block 2b Byte in Word 2b
10b ( = 32b - 18b - 2b - 2b) Word in Block 2b Byte in Word 2b Address Tag 10b Index 18b

Block Cache Example Read MM[0101 1111 1100 0011 0101 1100 0110 1000].
Tag is Index is Word in block is 10 ( = 2dec) Byte in word is 00 ( = 0dec)

Block Cache Example Go to cache line 00 0011 0101 1100 0110.
Check whether tag is If it is (hit) Read first byte of third word That is, byte 1000 ( = 8dec ), i.e. ninth byte. If it is not (miss) Read all bytes from memory starting at MM[ ] and finishing with MM[ ].

Block Cache Uses spatial locality.
Moving larger blocks between MM and cache is more efficient because of MM architecture. Can lead to more contention because there are less cache lines. Write implementations are more challenging.

Block Cache Your Turn Calculate the cache storage overhead for a virtual memory cache of storage capacity 512KB. Memory is byte-addressable. 4 words of 4B in block. Copy-Back

Block Cache Your Turn Cache line contains 4*4B = 16B data.
512KB / 16B = 512K / 16 = 219 – 4 = 215 cache lines. Index is 15b Address is split up Tag Index 15b Word in block 2b Byte in word 2b 13b = ( )b

Block Cache Your Turn Cache line stores 16B = 128b Overhead:
Tag 13b Dirty bit 1b Total: 14b Storage overhead is 14b/128b = 10.9%

Block Cache Block Size Tradeoffs
Larger blocks take better advantage of spatial locality. Larger blocks mean bigger miss penalty Takes longer to transfer the block to cache. Larger blocks can increase miss rate If there are too few blocks stored in the cache.

Increased Miss Penalty
Block Cache Block Size Tradeoffs Miss Penalty Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Block Size Average Access Time Increased Miss Penalty & Miss Rate

Cache Misses Cache loading (when process starts)
All data (incl. instructions) is in MM. All accesses are cache misses. Mandatory misses. Contention / Conflicts Process needs two (or more items) that map to the same cache location. Worst case: all accesses to these items are misses.

Set Associative Cache In a direct mapped cache, there is only one possible location for a datum in the cache. Contention between two (or more) popular data is possible, resulting in low hit rates. Solution: Place more than one item in a cache line.

Set Associative Cache A n-way set associative cache has a cache line consisting of n pairs of Tag + Datum. The larger the associativity n, the larger the hit rate. The larger the associativity n, the more complex the read, since all n tags need to be compared in parallel. Cache replacement is more difficult to implement.

Set Associative Cache

Set Associative Cache Reads
Split up address into tag, index, byte in word Go to the cache line specified by index Test whether any of the tags match. If one does (hit) Read contents. Otherwise (miss) Go to main memory.

Set Associative Cache A cache with 4-way associativity
Stores words. Byte addressable memory Cache capacity 8 MB.

Set Associative Cache Each cacheline contains 4*4B = 16B
8MB/16B = 512K = 219 cache lines. Index is 19b long. Byte in Word 2b. Tag is 11b.

Set Associative Cache Read to location = 0xf0a53b00 Tag is = 0x785 Index is = 0x14ec0 Byte address is 00 Go to cache line 0x14ec0

Set Associative Cache Assume that cache line 0x14ec0 contains
Compare all tags with 0x785. This can be done in parallel. If one checks out, then the item is in cache. Our access is a hit, return 00001ef4

Set Associative Cache Assume however the following cache line contents: We need to satisfy the request from main memory. We also need to load the item at address 0xf0a53b00. Now we have four choices.

Set Associative Cache Replacement policy
The replacement policy decides which item we should replace. Remember, it must be implemented in hardware

Set Associative Cache Replacement policy: Pseudo-random.
Replace clean items (with dirty bit zero) Because we do not need to copy back the item into main memory. Use Least Recently Used (LRU) Replaces the item that has been most recently used. Good, simple, proven heuristics

Set Associative Cache Implementation of LRU
Maintain the ordering of accesses in a bit field: 2-way associative: Two orderings: 1b 4-way associative: 4! = 24 orderings: 6b But already to difficult to encode and decode fast. Maintain the position in an explicit field 2-way associative: Need 1b per word 4-way associative: Need 2b per word

Set Associative Cache Implementation of LRU Approximate
Use a recent bit. At every access, set all recent bits to zero, but for the accessed item. Or: go cyclically through the cache and reset the recent bit. If an item is accessed, set the recent bit.

Set Associative Cache Implementation of LRU: Example
4-way associative cache. Use 2b per item to indicate order. Access Cache Contents

Set Associative Cache 2-way set associative cache : : : : Cache Tag
Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Set Associative Cache Increased cache access time.
Checking tags in parallel adds to the time. Avoids some collisions. Decreasing benefit of increasing associativity level.

Set Associative Block Cache
Combines set associativity and blocking

Associative Memory An item can be anywhere in cache.
Tag needs to be complete address. Need to compare all tags in the cache to find the item.

Cache Misses Compulsory Conflict (collision) Capacity Invalidation
Cold start: first access ever to data. Conflict (collision) Multiple data items map to the same cache location. Can increase cache size. Can increase associativity Capacity Cache cannot contain all the blocks in the working set. Invalidation Data is updated through I/O

Invalidation Processor not the only one to update memory.
Memory can also be updated through direct I/O. Need to tell cache that a changed data item is invalid. Add a VALID Bit to all cache contents.

Main Memory Cache Architectures

Similar presentations

Presentation on theme: "Main Memory Cache Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Main Memory Cache Architectures

Similar presentations

Presentation on theme: "Main Memory Cache Architectures"— Presentation transcript:

Similar presentations

About project

Feedback