CS203A Graduate Computer Architecture Lecture 13 Cache Design

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Caches Vincent H. Berk October 21, 2005

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory

/ Computer Architecture and Design

COMP 740: Computer Architecture and Implementation

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

Lecture 9: Memory Hierarchy (3)

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院课件、作业、讨论网址：

11 Advanced Cache Optimizations

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

ECE 445 – Computer Organization

CPE 631 Lecture 06: Cache Design

CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.

CS252 Graduate Computer Architecture Lecture 4 Cache Design

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Advanced Computer Architectures

January 24, 2001 Prof. John Kubiatowicz

Lecture 14: Reducing Cache Misses

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: Cache Design

Memory Hierarchy.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Siddhartha Chatterjee

/ Computer Architecture and Design

CS-447– Computer Architecture Lecture 20 Cache Memories

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Cache Performance Improvements

10/18: Lecture Topics Using spatial locality

Presentation transcript:

CS203A Graduate Computer Architecture Lecture 13 Cache Design

Cache Organization to reduce miss rate Assume total cache size not changed: What happens if: Change Block Size: Change Associativity: 3) Change Compiler: Which of 3Cs is obviously affected? Ask which affected? Block size 1) Compulsory 2) More subtle, will change mapping

Larger Block Size (fixed size&assoc) Increased Conflict Misses Reduced compulsory misses What else drives up block size?

Associativity Conflict

3Cs Relative Miss Rate Conflict

Fast Hit Time + Low Conflict => Victim Cache How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS DATA Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy

Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Sequential prefetch Cache Pollution if unused! Extra block placed in “stream buffer” On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Data Prediction is difficult Prefetching relies on having extra memory bandwidth that can be used without penalty Question: What to prefetch and when to prefetch? Instruction prefetch is fine, but data?

Leave it to the Programmer? Software Prefetching Data Data Prefetch – Explicit prefetch instructions Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Prefetching comes in two flavors: Binding prefetch: Requests load directly into register. Must be correct address and register! Non-Binding prefetch: Load into cache. Can be incorrect. Faults? Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth

Summary: Miss Rate Reduction 3 Cs: Compulsory, Capacity, Conflict 0. Larger cache 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations Prefetching comes in two flavors: Binding prefetch: Requests load directly into register. Must be correct address and register! Non-Binding prefetch: Load into cache. Can be incorrect. Frees HW/SW to guess!

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

1. Reducing Miss Penalty: Read Priority over Write on Miss Write-through w/ write buffers => RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write-back want buffer to hold displaced blocks Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read

Write Buffers Write Buffers (for wrt-through) Write-back Buffers L1 buffers words to be written in L2 cache/memory along with their addresses. 2 to 4 entries deep all read misses are checked against pending writes for dependencies (associatively) allows reads to proceed ahead of writes can coalesce writes to same address Write-back Buffers between a write-back cache and L2 or MM algorithm move dirty block to write-back buffer read new block write dirty block in L2 or MM can be associated with victim cache (later) L1 to CPU Write buffer L2

Merging Write Buffer to Reduce Miss Penalty Write buffer to allow processor to continue while waiting to write to memory If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry If so, new data are combined with that entry Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory The Sun T1 (Niagara) processor, among many others, uses write merging

Write Merge

2. Reduce Miss Penalty: Early Restart and Critical Word First Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality => tend to want next sequential word, so not clear if benefit by early restart block

3. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires muliple memory banks (otherwise cannot support) Penium Pro allows 4 outstanding memory misses

Lock-Up Free Cache Using MSHR (Miss Status Holding Register)

Value of Hit Under Miss for SPEC Integer Floating Point 0->1 1->2 2->64 Base “Hit under n Misses” FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92

4: Add a second-level cache L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU Global Miss Rate is what matters

Comparing Local and Global Miss Rates 32 KByte 1st level cache; Increasing 2nd level cache Global miss rate close to single level cache rate provided L2 >> L1 Don’t use local miss rate L2 not tied to CPU clock cycle! Cost & A.M.A.T. Generally Fast Hit Times and fewer misses Since hits are few, target miss reduction Cache Size Log

Example For every 1000 instructions, 40 misses in L1 and 20 misses in L2; Hit cycle in L1 is 1, L2 is 10; Miss penalty from L2 to memory is 100 cycles; there are 1.5 memory references per instruction. What is AMAT and average stall cycles per instruction? AMAT = [1 + 40/1000 * (10 + 20/40 * 100) ] *cc = 3.4 cycles AMAT without L2 = 1 + 40/1000 * 100 = 5 cycles => An improvement of 1.6 cycles due to L2 Average stall cycles per instruction = 1.5 * 40/1000 * 10 + 1.5 * 20/1000 * 100 = 3.6 cycles Note: We have not distinguished reads and writes. Access L2 only on L1 miss, i.e. write back cache

L2 cache block size & A.M.A.T. 32KB L1, 8 byte path to memory

Reducing Miss Penalty Summary Four techniques Read priority over write on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between First attempts at L2 caches can make things worse, since increased worst case is worse

Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + 0 Priority to Read Misses + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 miss rate miss penalty

How to Improve Cache Performance? 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

1. Small and simple caches Small on-chip L1 caches – less access time Direct mapped cache faster because no hardware comparison between blocks, but higher miss ratio Compromise – Direct L1 cache and Set-associative L2 cache How to predict cache access time at the design stage? – Use CACTI program

Impact of Change in cc Suppose a processor has the following parameters: CPI = 2 (w/o memory stalls) mem access per instruction = 1.5 Compare AMAT and CPU time for a direct mapped cache and a 2-way set associative cache assuming: AMATd = hit time + miss rate * miss penalty = 1*1 + 0.014*75 = 2.05 ns AMAT2 = 1*1.25 + 0.01*75 = 2 ns < 2.05 ns Execution Times: CPUd = (CPI*cc + mem. stall time)*IC = (2*1 + 1.5*0.014*75)IC = 3.575*IC CPU2 = (2*1.25 + 1.5*0.01*75)IC = 3.625*IC > CPUd ! Change in cc affects all instructions while reduction in miss rate benefit only memory instructions. cc Hit cycle Miss penalty Miss rate Direct map 1ns 1 75 ns 1.4% 2-way associative 1.25ns(why?) 1.0%

Miss Penalty for Out-of-Order (OOO) Exe. Processor. In OOO processors, memory stall cycles are overlapped with execution of other instructions. Miss penalty should not include this overlapped part. mem stall cycle per instruction = mem miss per instruction x (total miss penalty – overlapped miss penalty) For the previous example. Suppose 30% of the 75ns miss penalty can be overlapped, what is the AMAT and CPU time? Assume using direct map cache, cc=1.25ns to handle out of order execution. AMATd = 1*1.25 + 0.014*(75*0.7) = 1.985 ns With 1.5 memory accesses per instruction, CPU time =( 2*1.25 + 1.5 * 0.014 * (75*0.7))*IC = 3.6025 IC < CPU2

Avg. Memory Access Time vs. Miss Rate Associativity reduces miss rate, but increases hit time due to increase in hardware complexity! Example: For on-chip cache, assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 128 1.10 1.17 1.18 1.20 (Red means A.M.A.T. not improved by more associativity)

Fast Hit times via Small and Simple Caches Index tag memory and then compare takes time  Small cache can help hit time since smaller memory takes less time to index E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip Simple  direct mapping Can overlap tag check with data transmission since no choice Access time estimate for 90 nm using CACTI model 4.0 Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches

2: Increasing Cache Bandwidth via Multiple Banks Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses E.g.,T1 (“Niagara”) L2 has 4 banks Banking works best when accesses naturally spread themselves across banks  mapping of addresses to banks affects behavior of memory system Simple mapping that works well is “sequential interleaving” Spread block addresses sequentially across banks E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; … Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase CS258 S99

11/27/2018 Hit Time Band-width Miss penalty Miss rate Technique Hit Time Band-width Miss penalty Miss rate HW cost/ complexity Comment Small and simple caches + – Trivial; widely used Way-predicting caches 1 Used in Pentium 4 Trace caches 3 Pipelined cache access Widely used Nonblocking caches Banked caches Used in L2 of Opteron and Niagara Critical word first and early restart 2 Merging write buffer Widely used with write through Compiler techniques to reduce cache misses Software is a challenge; some computers have compiler option Hardware prefetching of instructions and data 2 instr., 3 data Many prefetch instructions; AMD Opteron prefetches data Compiler-controlled prefetching Needs non-blocking cache; in many CPUs 11/27/2018