EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Caches Vincent H. Berk October 21, 2005
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
ECE 445 – Computer Organization
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer Systems Lecture 18 Memory Hierarchy Design Part II

EENG449b/Savvides Lec /13/04 Q1: Where can a Block Be Placed in a Cache?

EENG449b/Savvides Lec /13/04 Set Associatively Direct mapped = one-way set associative Fully associative = set associative with 1 set Most popular cache configurations in today’s processors –Direct mapped, 2-way set associative, 4-way set associative

EENG449b/Savvides Lec /13/04 Examples 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration: a) 8-byte block size, direct mapped 8-byte block size => 3 bits for byte-within block 32 KB / 8 B = 4 K Block in the cache => need 12 bits to index 32 bits – (12 + 3) bits = 17 bits remaining => need 17 bits for every tag Byte-within-blockindextag ……1

EENG449b/Savvides Lec /13/04 Examples 4-byte block size => 2 bits for byte-within block 32 KB / (4 B x 8) = 1 K Sets in the cache => need 10 bits to index 32 bits – (10 + 2) bits = 20 bits remaining => need 20 bits for every tag Byte-within-blockindextag …… 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration: a) 4-byte block size, 8-way set associative

EENG449b/Savvides Lec /13/04 Q2: How is a block found if it is in the cache? Selects the desired data from the block Selects the set Compared against for a hit If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag Block replacement

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache? 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 4-words block size => 2 bits for word-within block 16 words / (4 words) = 4 blocks in the cache => need 2 bits to index 6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 0 Set 0 Set 1 Set 2 Set

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set Block replacement

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set

EENG449b/Savvides Lec /13/04 Examples Processor contains a 16 word, 4-way associate cache, with a 1 word block size. Which of the following addresses will hit in the cache? (LRU Replacement) 0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 12 1-word block size => 0 bits for word-within block 16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index 6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag 24 Set 0 Set 1 Set 2 Set Block replacement

EENG449b/Savvides Lec /13/04 Address Breakdown Physical address is 44 bits wide, 36-bit block address and 6-bit offset Calculating cache index size Blocks are 64 bytes so offset needs 6 bits Tag size = 38 – 9 = 29 bits

EENG449b/Savvides Lec /13/04 How to Improve Cache Performance? Four main categories of optimizations 1.Reducing miss penalty - multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches 2.Reducing miss rate - larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations 2. Reduce the miss penalty or miss rate via parallelism - non-blocking caches, hardware prefetching and compiler prefetching 3. Reduce the time to hit in the cache - small and simple caches, avoiding address translation, pipelined cache access Last week Today

EENG449b/Savvides Lec /13/04 Reducing miss rate Way Predication: –Perform tag comparison with a single block in every set »Less comparisons -> simple hardware -> faster clock Pseudoassociative Caches: –Access proceeds as in a direct-mapped cache for a hit –If a miss, compare to a second entry for a match, where the second entry can be found fast

EENG449b/Savvides Lec /13/04 Reducing miss rate Compiler Optimization: –Loop Interchange & Blocking: »Exchange the nesting of the loops to make the code access the data in the order it is stored For (j=0 ->100) For (i=0->5000) Becomes x[i][j] = 2 * x[i][j] For (i=0 ->5000) For (j=0->100) x[i][j] = 2 * x[i][j] Maximize the use of the data before replacing it

EENG449b/Savvides Lec /13/04 Reducing miss Penalty Methods include: 1)Multi-level caches L2 Equations: AMAT=Hit Time L1 + Miss Rate L1 X Miss Penalty L1 Miss Penalty L1 =Hit Time L2 + Miss Rate L2 X Miss Penalty L2 AMAT=Hit Time L1 + Miss Rate L1 X (Hit Time L2 + Miss Rate L2 X Miss Penalty L2 ) Definitions: –Local Miss Rate- misses in this cache divided by the total number of accesses to this cache (Miss Rate L2 ) –Global Miss Rate- misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L2 X Miss Rate L1 )

EENG449b/Savvides Lec /13/04 Reducing miss Penalty Methods include: 2) Critical word first and early restart –Don’t wait for the full block to be loaded before restarting the CPU »Early restart – As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution »Critical word first – request the missed word first from memory and send it to the CPU as soon as it arrives; Let the CPU continue execution while filling the rest of the words in the block. –Very useful with large blocks –Spatial locality problem: we often want the next sequential word soon, so not always a benefit (early restart)

EENG449b/Savvides Lec /13/04 Reducing miss Penalty Methods include: 3) Prioritize read misses over writes »Write buffers offer RAW conflicts with main memory reads on cache misses »If simply wait for write buffer to empty might increase the read miss penalty by 50% »Check write buffer contents before read: if not conflict, let the memory access continue »Write back? Read miss may require write of dirty blocks Normal: write dirty block to memory and then do the read Instead, copy the dirty block to the write buffer, do the read and then do the write CPU stalls less since it can restart as soon as the read completes

EENG449b/Savvides Lec /13/04 Reducing miss Penalty 4) Merging the write buffer –CPU stalls if the write-back buffer is full –The buffer may contain an entry matching the address written to –If so, the writes are merged

EENG449b/Savvides Lec /13/04 Reducing miss Penalty 5) Victim caches: How to get the hit time of direct-mapped yet still avoid conflict misses? Add buffer to place data discarded from the cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4KB direct mapped data cache