Caches CSC/EE/CPE 3760 Dr. Timothy Heil WINTER 2018 OMH 232

Caches CSC/EE/CPE 3760 Dr. Timothy Heil WINTER 2018 OMH 232
Some slides/material courtesy Dr. Kevin Bolding. All copyright Seattle Pacific University. CSC/EE/CPE Dr. Timothy Heil WINTER OMH 232 MW 3:00 PM – 5:00 pm (206) OMH 246

Locality On an open-book exam, you may look up a formula in:
Amdahl’s law? Locality On an open-book exam, you may look up a formula in: Your memory A sheet of notes Course handouts The textbook More information  slower access Big is slow! Dilbert copyright Scott Adams.

Big is Slow, Also True for Computers
We have discussed a system with two kinds of memory Registers Close to CPU Small number of them Fast (1 clock cycle) Main memory “Far” from CPU Big Slow (hundreds of clock cycles) CPU Registers Store Load or I-Fetch Main Memory Assembly language programmers and compilers manage all transitions between registers and main memory via LW/SW

DRAM Access Times Today
Physical Memory 14ns Protocol & Latency +16ns CPU Memory Controller 10ns DRAM Array 20ns 20ns 10ns NOTE: 30ns Page Closed Access Latency. Page Hit: More like 16ns Total: = 90ns

Memory is Slow! ... LW Instruction Fetch Memory Access
IF RF M LW WB EX ... Instruction Fetch Memory Access DRAM access takes around 90ns At 1GHz, that’s 90 cycles At 3GHz, that’s 270 cycles! Since every instruction has to be fetched from memory, we lose big time We lose double when executing a load or store

Why Not SRAM? DRAM latency – 15ns SRAM latency – 0.5ns Hooray?
Size: one transistor, one capacitor SRAM latency – 0.5ns Hooray? But size (and thus cost) is much greater! And larger SRAM systems  worse access times Isn’t there anything better? Flash? Not really (yet!) Transistor

Solution: Caching! Keep frequently accessed data close to the CPU
Registers CPU Load or I-Fetch Store Main Memory (DRAM) Cache (SRAM) Keep frequently accessed data close to the CPU Rarely-used data in memory Migrate data back and forth as needed Challenge: Which addresses go in the cache? Should programmer have to worry about this? (No way!) Small Close Fast Big Far Slow Figure courtesy D. Patterson.

1 Two-Millionth! What to Cache? Suppose
Main Memory = 64GB Cache = 32KB What fraction of main memory can you put in cache? 1 Two-Millionth!

Locality Fortunately for us, almost all programs exhibit locality of access Spatial Locality – Data that is nearby more likely to be accessed Temporal Locality – Data that is recently used more likely to be accessed again We can use the properties to guess which data is likely to be used in the near future Prediction again! -- keep the data around most likely to be accessed

Locality Example for (i=0;i<a_len;i++) { A[i] = B[i] + C[i]; }
if (i<20) { z = i*i + 3*i -2; } q = A[i]; name = employee.name; rank = employee.rank; salary = employee.salary; Temporal locality Spatial Locality The program is very likely to access the same data again and again over time The program is very likely to access data that is close together

Cache Design Questions
5600 1000 3223 1004 23 1008 1122 1012 1016 32324 1020 845 1024 43 1028 976 1032 77554 1036 433 1040 7785 1044 2447 1048 775 1052 1056 Main Memory Fragment … How do we find addresses in the cache? What if an address is not in the cache? If the cache is full, which address gets replaced? What about writes? 5600 2447 43 Cache 1000 1016 1048 1028 …

Design Goals Functionally Correct Complete Fast lookup
Reads and writes should behave like they were all going to main memory Just faster! Complete Data may come from anywhere in main memory Fast lookup We have to look up data in the cache on every memory access – 1 processor clock Result in a good hit rate Exploit temporal locality: Cache recently accessed data Exploit spatial locality: Cache data nearby recently accessed data

Direct-Mapped Caches 6-bit Address Main Memory Cache 00 00 00 00 01 00
6-bit Address 5600 3223 23 1122 32324 845 43 976 77554 433 7785 2447 775 3649 Main Memory Direct-Mapped Caches Valid Tag Data Index Cache 00 01 10 11 5600 00 Y 775 11 Y 845 01 Y 33234 00 N In a direct-mapped cache: -Each memory address corresponds to one location in the cache -There are multiple different memory locations for each cache entry (four in this case) Tag Index Always zero (words)

Hits and Misses When the CPU reads from memory: Handling misses
Calculate the index and tag Read tag == tag in cache: Hit! Data exists in the cache, and read is fast Read tag != tag in cache: Miss Data not in the cache, and read is slow Handling misses Read the word from memory (slow), give it to the CPU Replace the current data with the new data Set tag, valid and data appropriately  Exploits temporal locality! The hit rate is the % of memory accesses that are hits Typically, hit rates are around 95% Miss rate is the % that are misses (100% - hit rate)

A Direct-Mapped Cache with 1024 Entries
12 31 2 11 1 Memory Address Index 10 Byte offset 20 Tag Tag Data Index V 1 2 1023 1022 ... One Block 20 32 1 = Hit! 0 = Miss Data

Example – 1024-entry DM-cache
1 Index- 10 bits 2 11 Tag- 20 bits 12 31 11153 4323 212 14 1 Tag Data Index V 2 323 998 1976 8941 1023 3 ... Assume the cache has been in use for awhile, so it’s not empty... 3 8764 LW $t3, 0x0000E00C($0) address = tag = 14 index = 3 byte offset=0 Hit: Data is byte address LB $t3, 0x ($0) (let’s assume the word at mem[0x ] = 8764) address = tag = 3 index = 1 byte offset=1 Miss: load word from mem[0x ] and write into cache at index 1

Cache Practice Problem #1
Let’s trace the first 3 memory accesses for the Direct Mapped cache First, we’ll need to determine the index, tag, and byte offset

Exploiting Spatial Locality
Address 5600 3223 23 1122 32324 845 43 976 77554 433 7785 2447 775 3649 Main Memory Spatial locality says that physically close data is likely to be accessed close together On a cache miss, don’t just grab the word needed, but also the words nearby Organize memory in multi-word cache blocks Memory transfers between cache and memory are always one full block Example of 4-word blocks. Each block is 16 bytes. On a miss, the cache copies the entire block that contains the desired word

Blocks The block size may be any power of 2: 1,2,4,8,16,… Data Tag V Word Cache Entry 3 Word 2 Word 1 Word 0 The requested word may be at any position within a block. One 4-word block All words in the same block have the same index and tag 1 2 13 14 31 Index 10 18 Address Tag 3 4 Block offset Byte offset

32KByte/4-Word Block D.M. Cache
32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks 32KByte/4-Word Block D.M. Cache 211=2K 15 31 4 14 2 3 1 Tag Index Byte offset 11 Block offset Tag Data (4-word Blocks) Index V 1 2 2047 2046 ... 17 17 Mux 3 2 1 Hit! 32 17 Data

Examples Byte offset (2 bits) Block offset (2 bits) Cache has 8 blocks Index (3 bits) Tag (3 bits) How do we calc.? 128-byte cache, 4-word blocks, 10 bit addresses, direct-mapped Miss Hit! Miss Miss Miss Miss Hit! Miss Byte offset:  We have 32-bit = 4-byte words  2 bits to choose between the 4 bytes (on a load-byte) Block offset:  We have 4 words per block (given), do lg(4)  Need 2 bits to choose which of the 4 words we want # entries in cache: (cache size)/(words/block)/(bytes/word) 128-byte cache. Each block (entry) holds 4 words = 16 bytes 128 bytes/16 bytes per block = 8 blocks Index: to index 8 blocks, we need an index of 3 bits (lg(8)) Tag: whatever is left over out of the 10-bit address = 3 bits V Tag Data Index 000: 001: 010: 011: 100: 101: 110: 111: 110 010 - 1 110 110 010 011 Direct-Mapped

Performance Miss rates for DEC 3100 (MIPS machine)
Separate 64KB Instruction/Data Caches (16K 1-word blocks) Benchmark Instruction Data miss Combined miss rate rate miss rate gcc 6.1% 2.1% 5.4% spice 1.2% 1.3% 1.2% Note: This isn’t just the average

Impact of Spatial Locality?
Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks) Benchmark Block Size Instruction Data miss Combined (words) miss rate miss rate gcc % 2.1% 5.4% gcc % 1.7% 1.9% spice % 1.3% 1.2% Notice that instruction miss rate went down by 5x! Think about instruction access patterns. Why would that be? spice % 0.6% 0.4%

What Block Size? Large block sizes help with spatial locality, but...
It takes time to read the memory in Larger block sizes increase the time for misses It uses up more memory bandwidth DRAM bandwidth is often a precious resource It reduces the number of blocks (entries) in the cache Number of blocks = cache size/block size Need to find a middle ground bytes works nicely Cache designers simulate caches to look at tradeoffs

What About Writes? What to do on a store (hit or miss)
Won’t do to just write it to the cache The cache would have a different (newer) value than main memory Simple Write-Through Write both the cache and memory Works correctly, but slowly Buffered Write-Through Write the cache Buffer a write request to main memory 1 to 10 buffer slots are typical

Fully Associative Caches
Direct Mapped 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: V Tag Data Index Tag Data V No Index Each address has only one possible location Address = Tag | Index | Block offset Address = Tag | Block offset

Comparison Fully associative caches provide much greater flexibility
Can pick the “best” block to throw out. Hmmm…. how would we pick that? Direct-mapped caches are more rigid Each miss, can only throw out one specific block What if two commonly accessed lines happen to map to the same cache block Drawbacks? Fully associative caches require a complete search through all the tags to see if there’s a hit Direct-mapped caches only need to look one place

Set Associative Caches – A Compromise
Divide cache into sets (power-of-2) Each set contains N blocks “ways” Called the associativity of the cache Each block address maps to one set Can be in any one of the ways Address lookup Must compare all tags in the set to see if the desired block is in the cache Typically done in parallel with N comparators Upon a miss, get to replace the “best” of the N blocks in that set Way 0 Way 1 V Tag Data Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Block X Block X Don’t mix up your sets and ways! A “set” is a mathematical term referring to a container of N unordered elements. = Hit/Miss Tag

(Associativity = Blocks in Cache)
Direct-Mapped (Associativity = 1) Set-Associative (Associativity = Blocks in Cache) Set-Associative (Associativity = N)

Examples Direct Mapped 2-Way 8-Way Size 32KB Block Size 4 word/16B
Byte Offset Bits 2 Word Offset Bits Blocks 2K Sets N/A 1K 256 Index Bits 11 10 8 Tag Bits 17 18 20 Address Bits 32 Notice that set-associativity makes caches slightly larger – more tag bits are needed.

Associativity Does Not Have to be a Power-of-2
Way 0 Way 1 Way 2 V Tag Data Set 0 … Set N 3-Way Size 48KB Block Size 4 word/16B Byte Offset Bits 2 Word Offset Bits Blocks 3K Sets 1K Index Bits 10 Tag Bits 18 Address Bits 32 Notice that set-associativity makes caches slightly larger – more tag bits are needed.

Example Byte offset (2 bits) Block offset (2 bits) Index (1-3 bits) Tag (3-5 bits) 128-byte cache, 4-word blocks, 10 bit addresses, 1-4 way associativity Miss Miss Miss Miss Miss Miss Miss Hit Hit Miss Miss Miss Miss Miss Hit V Tag Data Index Index V Tag Data V Tag Data Index 000: 001: 010: 011: 100: 101: 110: 111: 00: 01: 10: 11: 0: 1: 010 - 1 011 010 110 110 01001 - 1 11001 - 1 0100 - 01101 - 1 1100 1 - 1 1100 0110 Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

Performance Comparison
Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches (4K 4-word blocks) Benchmark Associativity Instruction Data miss Combined rate miss rate gcc Direct 2.0% 1.7% 1.9% gcc 2-way 1.6% 1.4% 1.5% gcc 4-way 1.6% 1.4% 1.5% spice Direct 0.3% 0.6% 0.4% spice 2-way 0.3% 0.6% 0.4% spice 4-way 0.3% 0.6% 0.4%

Logistics – Mon 3/4 HW #5 Due Tomorrow Prog #3 Due Wed HW #6 Assigned

Recap Four Cache Parameters From those we can derive
Address Bits (given) Size Block Size Associativity Associativity = 1 => Direct Mapped Associativity = # of Blocks => Fully Associative From those we can derive Number of Blocks Number of Sets

Cache Access Process Tag Index Offset
Split Address into Tag, Index and Block Offset Line is Valid Tag Matches Check each way in the set Use Index to read cache set Check for hit/miss Read Block from Memory Use Offset to Select Needed Word Replace Block in Cache Use Offset to Select Data

Set Associative Caches
Each block address maps to one set Can be in any one of the ways Replacement Policy When we miss, any particular way could get replaced Which one to pick? Any one will functionally work Good choices will improve the hit rate Way 0 Way 1 V Tag Data Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Block X Block X Don’t mix up your sets and ways! A “set” is a mathematical term referring to a container of N unordered elements. = Hit/Miss Tag

Block Replacement Hey kid, block #8080 won’t be used again in this program, kick that one out! Replacement Policy Performance Implementation Ideal (Oracle) Best Possible! Requires knowledge of the future Least recently used (LRU) Very Good Complex bookkeeping beyond about 4-way Random Not as bad as you might think No bookkeeping at all Approximate LRU Close to LRU, esp. for high assoc. Lots of simpler schemes. Lots of variations on Approximate LRU

Categorizing Misses Compulsory Misses Capacity Misses Conflict Misses
The first time a memory location is accessed, it is always a miss Also known as cold-start misses The only way to decrease compulsory misses is to increase the block size Capacity Misses Occur when a program is using more data than can fit in the cache Some misses will occur because the cache isn’t big enough Increasing the size of the cache solves this problem Conflict Misses Occur when a block forces out another block with the same index Increasing Associativity reduces conflict misses Worst in Direct-Mapped, non-existent in Fully Associative

How big should the cache be?
What are some pros and cons of bigger caches? Bigger -> Higher hit rate -> Better performance Bigger -> Slower -> Worse performance Bigger -> Bigger -> More chip real estate -> More $s Bigger -> More power Registers CPU Load or I-Fetch Store Cache Main Memory (DRAM) Power can also be a concern.

Multi-Level Caches Registers CPU The difference between a cache hit (1 cycle) and miss (100’s of cycles) is huge Introduce a series of larger, but slower caches to smooth out the difference L1 Cache: Typically 1/2 cycles L2 Cache: Typically 6 to 12 cycles L3 Cache: Typically 30 to 50 cycles L4 Cache: … the industry’s getting there… L1 Cache L2 Cache L3 Cache Main Memory (DRAM)

… Example: Intel Core i7 4770k Core 0 Core 3 L1 I-Cache 32KB, 8-way
L1 D-Cache 32KB, 8-way L1 I-Cache 32KB, 8-way L1 D-Cache 32KB, 8-way … L2 Cache 256KB, 8-way L2 Cache 256KB, 8-way L3 Cache 8MB, 16-way Memory

What about Instructions?
4 Result 1 Result Sh. Left 2 Add Add PCSrc BEQ Ctrl MemToReg MemRead Op:[31-26] MemWrite ALUOp ALUSrc RegWrite RegDest Rs:[25-21] Read reg. num A Registers Read reg num B Write reg num Write reg data Read reg data A Read reg data B Read reg num A Write Read Read address Rt:[20-16] Data Memory PC Read address 1 Zero Read data 1 Instruction [31-0] Result Write address Instruction Memory Rd: [15-11] Write data 1 Imm: [15-0] 16 32 ALU Ctrl sign extend 6 FC:[5-0]

What about instructions?
It is common to use two separate caches for Instructions and Data L1 I-Cache & L1 D-Cache This allows the CPU to access the I-cache at the same time it is accessing the D-cache “Harvard-Style Caches” CPU Main Memory (DRAM) L1 D-Cache L2 Cache L3 Cache L1 I-Cache

ARM Cortex A-8 Cache Performance
L1: 32KB 4-way set assoc. Separate inst./data L2: 1 MB 8-way set assoc. Unified instr./data Uses Minnespec SPECCPU 2000 Integer But with smaller inputs to reduce runtime

Core i7 920 Cache Performance
L1: 32KB 8-way set assoc. Separate inst./data L2: 256 KB 8-way set assoc. Unified instr./data L3: 8MB, 16-way set assoc. Uses full SPECCPU 2006

Caches CSC/EE/CPE 3760 Dr. Timothy Heil WINTER 2018 OMH 232

Similar presentations

Presentation on theme: "Caches CSC/EE/CPE 3760 Dr. Timothy Heil WINTER 2018 OMH 232"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Caches CSC/EE/CPE 3760 Dr. Timothy Heil WINTER 2018 OMH 232

Similar presentations

Presentation on theme: "Caches CSC/EE/CPE 3760 Dr. Timothy Heil WINTER 2018 OMH 232"— Presentation transcript:

Similar presentations

About project

Feedback