Main Memory Cache Architectures

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.

Modified from notes by Saeid Nooshabadi

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.

CMPE 421 Parallel Computer Architecture

Lecture 19: Virtual Memory

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.

CS161 – Design and Architecture of Computer

CMSC 611: Advanced Computer Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

CSE 351 Section 9 3/1/12.

Memory COMPUTER ARCHITECTURE

CAM Content Addressable Memory

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

How will execution time grow with SIZE?

Basic Performance Parameters in Computer Architecture:

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

CS61C : Machine Structures Lecture 6. 2

CS61C : Machine Structures Lecture 6. 2

Lecture 08: Memory Hierarchy Cache Performance

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter 6 Memory System Design

Performance metrics for caches

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

ECE232: Hardware Organization and Design

Performance metrics for caches

How can we find data in the cache?

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Main Memory Cache Architectures

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Performance metrics for caches

Some of the slides are adopted from David Patterson (UCB)

Cache - Optimization.

Cache Memory Rabi Mahapatra

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Main Memory Cache Architectures COEN 180 Main Memory Cache Architectures

Basics Processor speed about 100 – 300 times faster than main memory access. Use faster memory as a cache. Actually: Instruction queue (part of processor) L1 cache ~ 32 KB on processor chip L2 cache ~ 1 MB (L3 cache ~ 4 MB) Caches on DRAM processor chips

Basics Cache algorithms need to be implemented in hardware and be simple. MM is byte addressable, but only words are moved, typically, the last two bits of the address are not even transmitted. Hit rate needs to be high.

Basics Cache versus Main Memory Cache: Direct Mapped Cache: ... ABAB FFFF Cache ... ABAB FFFF Cache: Contains some data Fast Direct Mapped Cache: This item can only be in one cache line. Main Memory: Contains all the data Slow

Basics Average Access Time = (Hit Rate)*(Access to Cache) + (Miss Rate)*(Access to MM)

Your Turn Assume cache access for an on-chip cache is 5 nsec. Assume main memory access is 145 nsec. Access time for a miss is 5 nsec + 145 nsec. Calculate the access times for a hit rate of 50% 90% 99% Conclusion: hit rates need to be high. 77.50 nsec 19.50 nsec 6.45 nsec

Basics Main Memory contents Address can be 32b, MM word can be 32b, but addresses and contents are ontologically different

Virtual Memory Gives the impression of much more memory than there really is. Pages memory pages into and out of disk. Handled by MU (Memory Unit). Distinguish between virtual addresses and physical addresses.

Virtual Memory Virtual addresses are 32b long. Or 64b for a 64b processor. Physical addresses are smaller. Correspond to maximum MM-size. Since most MM is byte addressable, but data is moved in words (4B, 8B, ...), the least significant bits of physical address are not part of the address bus.

Virtual Memory Can use caches at the virtual memory level Using virtual memory addresses. Or at the physical memory level. Using physical memory addresses. If nothing is said, assume virtual memory addresses.

Cache Replacement Which items should be in the cache? Algorithm needs to be very fast and simple. Need to implement algorithm in hardware. Simplest scheme: If MM item is read or written, put it in the cache. Throw out old item.

Direct Mapped Cache Each item in MM can be located in only one position in cache. MM addresses typically refer to a single byte (an ASCII text character) For historical reasons Hard to change Physically, only complete words are accessed.

Direct Mapped Cache Address 0110 1100 1110 1110 0101 1010 1111 0010 Go to byte 10 (=2dec) In word: 0110 1100 1110 1110 0101 1010 1111 00

Direct Mapped Cache Address is split into Tag (highest order bits); Index; Byte in word address. Typically the two least significant bits for 4B per word.

Direct Mapped Cache Tag serves to identify the data item in the cache. Index is the address of the word in the cache.

Direct Mapped Cache MM[0110 1100 1110 1110 0101 1010 1111 0010] Index Tag The index tells us where the contents of MM[0110 1100 1110 1110 0101 1010 1111 0010] are stored in the cache. Namely at cache line (location) 01 1010 1111 00

Direct Mapped Cache Contents of main memory address 0110 1100 1110 1110 0101 1010 1111 0010 and of main memory address 1100 1111 0000 1110 0101 1010 1111 0010 would be stored at the same location in cache. To know which one is stored there, keep the tag with the contents.

Direct Mapped Cache 0110 1100 1110 1110 0101 1010 1111 0010 Tag: Identifies item in cache Index: Where the item is in the cache: Cache line / address Cache 0110 1100 1110 1110 01 : 0101 0101 0101 0101 0101 0101 0101 0101 Contents of MM[...]

Direct Mapped Cache Your Turn Why are the most significant bits of the address the tag and not the index? Answer: A whole region of main memory can be loaded into cache. Makes sense because of spatial locality. Neighboring MM addresses have different indices but the same tag. Otherwise, neighboring MM addresses have different tags and same index, that is, they are competing for the same cache location.

Direct Mapped Cache Example Memory words are 2B long. Memory contains 128B and is byte addressable 128 addressable items. 27 addresses. Memory addresses 7 b long. Cache contains 4 words. 2 b cache address = index Memory address split into 4b tag 2b index 1b Byte in word address.

Direct Mapped Cache Example Main Memory contents: 000 0000: FF 000 0001: FF 000 0010: 00 000 0011: 00 000 0100: 00 000 0101: 00 000 0110: FF 000 0111: FF 000 1000: AF 000 1001: AB ... ... Contents of MM: 2B MM address

Direct Mapped Cache Example Assume item MM[000 0010] is in cache. Cache contains complete MM line. Split address into tag, index, and Byte in Word address: Tag is 0000 Index is 01 Byte in Word is 0 000 0010 Byte in Word Tag Index

Direct Mapped Cache Example View of Cache Cache line Tag Byte 0 Byte 1 00 0000 FF FF 01 0000 00 00 10 0000 00 00 11 1100 AB CD Only this portion is stored. Cache line contains 2.5 B Cache line addresses are implicit.

Direct Mapped Cache Cache lines contain Contents Tags Some metadata (as we will see). Distinguish between cache capacity and cache storage needs. Difference is cache storage overhead.

Direct Mapped Cache Vocabulary Byte addressable: one address per byte. Cache lines: items stored at a single cache address (index).

Direct Mapped Cache Your Turn: Main Memory Contains 512 MB. 8 B in a word. Byte addressable What is the length of an address? Solution 512M = 29220 = 229 addressable items. Addresses are 29 bits long.

Direct Mapped Cache Your Turn cache line (8B + tags) Main Memory Contains 512 MB. 8 B in a word. Byte addressable Cache Contains 1 MB Cache line consists of 1 word (of 8B) How many cache lines? How long are indices? 1M / 8 = 128K = 217 cache lines. Indices are 17b long. Nr. cache lines

Direct Mapped Cache Your Turn MM address is 29 bits Index is 17 bits How is a MM address split up? Solution: 8 B in a word  3 bits for “Byte in Word”. 17 bits for index. 9 bits for tag. Tag: 9b Index: 17b Byte in Word: 3b

Direct Mapped Cache Your Turn What is the cache storage overhead? Solution Overhead per cache line is the tag. Cache line contains 8B contents. Cache line contains 9b tag. (Plus possibly metadata, which we ignore.) Overhead is 9b / 8B = 9/64 = 14.0625 %

Reads from a Cache Input is MM location Calculate cache line from MM location This is were the item might be. Use the tag to check whether this is the correct item.

Reads from a cache Assume memory address is 0110 1100 1110 1110 0101 1010 1111 0010 Go to cache line 01 1010 1111 00 Cache 0110 1100 1110 1110 01 : 0101 0101 0101 0101 0101 0101 0101 0101 They are: The result of the look-up is 0101 0101 0101 0101 0101 0101 0101 0101 This is a HIT. Check whether the tags are the same

Reads from a cache Assume memory address is 0110 1111 1110 1110 0101 1010 1111 0010 Go to cache line 01 1010 1111 00 Cache 0110 1100 1110 1110 01 : 0101 0101 0101 0101 0101 0101 0101 0101 They are not: The requested word is not in the cache. A MISS. Check whether the tags are the same

Reads from a Cache Miss Penalty: The added time necessary to find the word. In this case, go to main memory and satisfy it from there.

Here is the result, sorry it took so long. Reads from a Cache Give me MM[address] Miss: Go to cache Find out that it is a miss Go to main memory Processor Miss! Here is the result, sorry it took so long. Cache Main Memory Miss Penalty: Time to go to Main Memory.

Your Turn: ? Why don’t we send requests to both cache and MM at the same time. This way, cache access and MM access overlap. There is less miss penalty. Main memory would be overwhelmed with all these read requests.

Cache Writes A “write to cache” operation updates Contents, Tag field, If written item replaces another item instead of writing a new value. Meta data.

Write Policies Write-through: Copy-back: A write is performed to both the cache and to the main memory. Copy-back: A write is performed only to cache. If an item is replaced by another item, then the item to be replaced is copied back to main memory.

Write Through Processor Simultaneous update Cache Main Memory

Write-Through Cache and MM always contain the same contents. When an item is replaced by another one in the cache, there is no need for additional synchronization. Write traffic to both cache and MM. Processor Cache MM

Cache Operations Write-Through READ: Extract Tag and Index from Address. Go to the cache line given by Index. See whether the Tag matches the Tag stored there. If they match: Hit. Satisfy read from cache. If they do not match: Miss. Satisfy read from main memory. Also store item in cache. (Replacement policy, as we will see.)

Cache Operations Write-Through Extract Tag and Index from address. Write datum in cache at location given by Index. Reset the tag field in the cache line with Tag. Write datum in main memory.

Copy Back Writes only to cache. MM and cache are not in the same state after a write. Need to save values in the cache if item in cache is replaced.

Copy Back Write item MM[0000 0000 0000 1111 1111 1111 1111 1111]. Puts item MM[0000 0000 0000 1111 1111 1111 1111 1111]. into cache. Read item MM[1111 0000 0000 1111 1111 1111 1111 1111]. Both items have same index  latter item overwrites first item. First item not updated in cache: it is dirty. Need to write contents of MM[0000 0000 0000 1111 1111 1111 1111 1111] to MM before putting MM[1111 0000 0000 1111 1111 1111 1111 1111] into cache.

Copy Back Read item MM[0000 0000 0000 1111 1111 1111 1111 1111]. Puts item MM[0000 0000 0000 1111 1111 1111 1111 1111]. into cache. Read item MM[1111 0000 0000 1111 1111 1111 1111 1111]. Both items have same index  latter item overwrites first item. First item already in MM. It is clean. Need to write contents of MM[0000 0000 0000 1111 1111 1111 1111 1111] to MM before putting MM[1111 0000 0000 1111 1111 1111 1111 1111] into cache.

Copy Back Use a “dirty bit” to distinguish between clean and dirty items. When an item is put into cache, set the dirty bit to 0. (Item is clean.) When we write to an item in cache, set the dirty bit to 1. (Item is now dirty.) When we replace item in cache, read the dirty bit. If the dirty bit is 0, no synchronization is necessary. If the dirty bit is 1, write the contents of the item into MM before replacing it.

Copy Back vs. Write Through Less write traffic to MM. Reads can be slower. Possibly need to synchronize cache and MM if replaced item is dirty. 1b more overhead per cache line (dirty bit). Write Through Write traffic at MM can slow down MM speed. Higher miss penalty. Fast cache replacement  Fast reads.

Your Turn Use virtual memory addresses. Assume 32b = 4B words. Memory is byte addressable. What is the storage overhead for a cache with 2MB capacity?

Your Turn Virtual memory 32b addresses 2b for “Byte in Word” 19b index 2MB capacity, 4B per cache line 2M/4 = 512K = 219 cache lines. Index is 19b long. Virtual memory 32b addresses 2b for “Byte in Word” 19b index 11b tag (11 + 19 +2 = 32)

Your turn Copy-Through Cache overhead per line Direct mapped cache Cache line contains 32b data. Tag is 11b. Copy-Through Additional dirty bit. Cache overhead per line 12b / 32b = 37.5%.

Cache Misses Cache loading (when process starts) All data (incl. instructions) is in MM. All accesses are cache misses. Mandatory misses. Contention / Conflicts Process needs two (or more items) that map to the same cache location. Worst case: all accesses to these items are misses.

Block Cache Direct mapped cache only exploits temporal locality. Temporal locality: Items recently used are more likely to be reaccessed. MM typically is designed for larger accesses than a whole word. Block Cache moves several words (a block) into and out of cache. Exploits spatial locality. Spatial locality: Neighbors of recently accessed items are more likely to be accessed.

Blocks If there is a miss on a read, bring all words in the block into the cache. For a write operation: Write the word, bring other words into the cache. Or Write directly to main memory. Do not update cache.

Block Cache To look up data in the cache, we need to specify the word in the block. Four components: Tag Index (address of cache line) Word in block address. Byte in word address (if byte addressable) Word in Block Byte in Word Tag Address Index

Block Cache Block cache with copy back.

Block Cache Example Cache virtual 32b addresses. Memory is byte addressable. Memory is accessed in words of 4B. Cache line contains 4 words of 4B. Cache size is 4MB.

Block Cache Example Cache line contains 4 words of 4B. Cache has 4MB capacity. Cache line contains 16B data. 4MB/16B = 256 K = 218 cache lines. Index is 18 b long.

Block Cache Example There are four bytes per word. Use 2b for “Byte in Word” address Block consists of 4 words. Word in block address is 2b.

Block Cache Example Tag Index 18b Word in Block 2b Byte in Word 2b 10b ( = 32b - 18b - 2b - 2b) Word in Block 2b Byte in Word 2b Address Tag 10b Index 18b

Block Cache Example Read MM[0101 1111 1100 0011 0101 1100 0110 1000]. Tag is 01 0111 1111. Index is 000011010111000110. Word in block is 10 ( = 2dec) Byte in word is 00 ( = 0dec)

Block Cache Example Go to cache line 00 0011 0101 1100 0110. Check whether tag is 01 0111 1111. If it is (hit) Read first byte of third word That is, byte 1000 ( = 8dec ), i.e. ninth byte. If it is not (miss) Read all bytes from memory starting at MM[0101 1111 1100 0011 0101 1100 0110 0000] and finishing with MM[0101 1111 1100 0011 0101 1100 0110 1111].

Block Cache Uses spatial locality. Moving larger blocks between MM and cache is more efficient because of MM architecture. Can lead to more contention because there are less cache lines. Write implementations are more challenging.

Block Cache Your Turn Calculate the cache storage overhead for a virtual memory cache of storage capacity 512KB. Memory is byte-addressable. 4 words of 4B in block. Copy-Back

Block Cache Your Turn Cache line contains 4*4B = 16B data. 512KB / 16B = 512K / 16 = 219 – 4 = 215 cache lines. Index is 15b Address is split up Tag Index 15b Word in block 2b Byte in word 2b 13b = (32-15-2-2)b

Block Cache Your Turn Cache line stores 16B = 128b Overhead: Tag 13b Dirty bit 1b Total: 14b Storage overhead is 14b/128b = 10.9%

Block Cache Block Size Tradeoffs Larger blocks take better advantage of spatial locality. Larger blocks mean bigger miss penalty Takes longer to transfer the block to cache. Larger blocks can increase miss rate If there are too few blocks stored in the cache.

Increased Miss Penalty Block Cache Block Size Tradeoffs Miss Penalty Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Block Size Average Access Time Increased Miss Penalty & Miss Rate

Cache Misses Cache loading (when process starts) All data (incl. instructions) is in MM. All accesses are cache misses. Mandatory misses. Contention / Conflicts Process needs two (or more items) that map to the same cache location. Worst case: all accesses to these items are misses.

Set Associative Cache In a direct mapped cache, there is only one possible location for a datum in the cache. Contention between two (or more) popular data is possible, resulting in low hit rates. Solution: Place more than one item in a cache line.

Set Associative Cache A n-way set associative cache has a cache line consisting of n pairs of Tag + Datum. The larger the associativity n, the larger the hit rate. The larger the associativity n, the more complex the read, since all n tags need to be compared in parallel. Cache replacement is more difficult to implement.

Set Associative Cache

Set Associative Cache Reads Split up address into tag, index, byte in word Go to the cache line specified by index Test whether any of the tags match. If one does (hit) Read contents. Otherwise (miss) Go to main memory.

Set Associative Cache A cache with 4-way associativity Stores words. Byte addressable memory Cache capacity 8 MB.

Set Associative Cache Each cacheline contains 4*4B = 16B 8MB/16B = 512K = 219 cache lines. Index is 19b long. Byte in Word 2b. Tag is 11b.

Set Associative Cache Read to location 1111 0000 1010 0101 0011 1011 0000 0000 = 0xf0a53b00 Tag is 111 1000 0101 = 0x785 Index is 001 0100 1110 1100 0000 = 0x14ec0 Byte address is 00 Go to cache line 0x14ec0

Set Associative Cache Assume that cache line 0x14ec0 contains Compare all tags with 0x785. This can be done in parallel. If one checks out, then the item is in cache. Our access is a hit, return 00001ef4

Set Associative Cache Assume however the following cache line contents: We need to satisfy the request from main memory. We also need to load the item at address 0xf0a53b00. Now we have four choices.

Set Associative Cache Replacement policy The replacement policy decides which item we should replace. Remember, it must be implemented in hardware

Set Associative Cache Replacement policy: Pseudo-random. Replace clean items (with dirty bit zero) Because we do not need to copy back the item into main memory. Use Least Recently Used (LRU) Replaces the item that has been most recently used. Good, simple, proven heuristics

Set Associative Cache Implementation of LRU Maintain the ordering of accesses in a bit field: 2-way associative: Two orderings: 1b 4-way associative: 4! = 24 orderings: 6b But already to difficult to encode and decode fast. Maintain the position in an explicit field 2-way associative: Need 1b per word 4-way associative: Need 2b per word

Set Associative Cache Implementation of LRU Approximate Use a recent bit. At every access, set all recent bits to zero, but for the accessed item. Or: go cyclically through the cache and reset the recent bit. If an item is accessed, set the recent bit.

Set Associative Cache Implementation of LRU: Example 4-way associative cache. Use 2b per item to indicate order. Access Cache Contents

Set Associative Cache 2-way set associative cache : : : : Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Set Associative Cache Increased cache access time. Checking tags in parallel adds to the time. Avoids some collisions. Decreasing benefit of increasing associativity level.

Set Associative Block Cache Combines set associativity and blocking

Associative Memory An item can be anywhere in cache. Tag needs to be complete address. Need to compare all tags in the cache to find the item.

Cache Misses Compulsory Conflict (collision) Capacity Invalidation Cold start: first access ever to data. Conflict (collision) Multiple data items map to the same cache location. Can increase cache size. Can increase associativity Capacity Cache cannot contain all the blocks in the working set. Invalidation Data is updated through I/O

Invalidation Processor not the only one to update memory. Memory can also be updated through direct I/O. Need to tell cache that a changed data item is invalid. Add a VALID Bit to all cache contents.