Cache Performance 1 Computer Organization II ©2005-2013 CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 12 Reduce Miss Penalty and Hit Time
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Systems I Locality and Caching
ECE Dept., University of Toronto
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
CMPE 421 Parallel Computer Architecture
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Cache Memories.
COSC3330 Computer Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
ECE232: Hardware Organization and Design
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Cache Memories CSE 238/2038/2138: Systems Programming
Morgan Kaufmann Publishers Memory & Cache
CS 105 Tour of the Black Holes of Computing
ECE 445 – Computer Organization
Lecture 21: Memory Hierarchy
Cache Memories September 30, 2008
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Computer System Design Lecture 9
CS 3410, Spring 2014 Computer Science Cornell University
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Lecture 21: Memory Hierarchy
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Memory and Performance
Cache Memory and Performance
Presentation transcript:

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission from Complete Powerpoint Lecture Notes for Computer Systems: A Programmer's Perspective (CS:APP) Randal E. BryantRandal E. Bryant and David R. O'HallaronDavid R. O'Hallaron The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.

Cache Performance 2 Computer Organization II © CS:APP & McQuain Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. – Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure: Main memory I/O bridge Bus interface ALU Register file CPU chip System busMemory bus Cache memories

Cache Performance 3 Computer Organization II © CS:APP & McQuain General Cache Organization (S, E, B) E = 2 e lines (blocks) per set S = 2 s sets set line (block) Cache size: C = S x E x B data bytes 012 B-1 tag v B = 2 b bytes per cache block (the data) valid bit s e -1

Cache Performance 4 Computer Organization II © CS:APP & McQuain Cache Organization Types The "geometry" of the cache is defined by: S = 2 s the number of sets in the cache E = 2 e the number of lines (blocks) in a set B = 2 b the number of bytes in a line (block) E = 1 ( e = 0 )direct-mapped cache only one possible location in cache for each DRAM block S > 1 E = K > 1K -way associative cache K possible locations (in same cache set) for each DRAM block S = 1 (only one set)fully-associative cache E = # of cache blockseach DRAM block can be at any location in the cache

Cache Performance 5 Computer Organization II © CS:APP & McQuain Cache Performance Metrics miss rate:fraction of memory references not found in cache (# misses / # accesses) = 1 – hit rate Typical miss rates: -3-10% for L1 -can be quite small (e.g., < 1%) for L2, depending on cache size and locality hit time:time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical times: -1-2 clock cycle for L clock cycles for L2 miss penalty:additional time required for data access because of a cache miss typically cycles for main memory Trend is for increasing # of cycles… why?

Cache Performance 6 Computer Organization II © CS:APP & McQuain Calculating Average Access Time Let’s say that we have two levels of cache, backed by DRAM: -L1 cache costs 1 cycle to access and has miss rate of 10% -L2 cache costs 10 cycles to access and has miss rate of 2% -DRAM costs 80 cycles to access (and has miss rate of 0%) Then the average memory access time (AMAT) would be: 1 +always access L1 cache 0.10 * 10 +probability miss in L1 cache * time to access L * 0.02 * 80probability miss in L1 cache * probability miss in L2 cache * time to access DRAM = 2.16 cycles

Cache Performance 7 Computer Organization II © CS:APP & McQuain Calculating Average Access Time Alternatively, we could say: 0.90 * 1 +probability hit in L1 cache * time to access L * probability miss in L1 cache * 0.98 * (10 + 1)probability hit in L2 cache * time to access L1 then L * probability miss in L1 cache * 0.02 * probability miss in L2 cache * ( )time to access L1 then L2 then DRAM = 2.16 cycles

Cache Performance 8 Computer Organization II © CS:APP & McQuain Lets think about those numbers There can be a huge difference between the cost of a hit and a miss. Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: L1 cache hit time of 1 cycle L1 miss penalty of 100 cycles (to DRAM) Average access time: 97% L1 hits: 1 cycle * 100 cycles = 4 cycles 99% L1 hits: 1 cycle * 100 cycles = 2 cycles

Cache Performance 9 Computer Organization II © CS:APP & McQuain Measuring Cache Performance Components of CPU time: Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions:

Cache Performance 10 Computer Organization II © CS:APP & McQuain Cache Performance Example Given -Instruction-cache miss rate = 2% -Data-cache miss rate = 4% -Miss penalty = 100 cycles -Base CPI (with ideal cache performance) = 2 -Load & stores are 36% of instructions Miss cycles per instruction -Instruction-cache: 0.02 × 100 = 2 -Data-cache: 0.36 × 0.04 × 100 = 1.44 Actual CPI = = Ideal CPU is 5.44/2 =2.72 times faster -We spend 3.44/5.44 = 63% of our execution time on memory stalls!

Cache Performance 11 Computer Organization II © CS:APP & McQuain Reduce Ideal CPI What if we improved the datapath so that the average ideal CPI was reduced? -Instruction-cache miss rate = 2% -Data-cache miss rate = 4% -Miss penalty = 100 cycles -Base CPI (with ideal cache performance) = 1.5 -Load & stores are 36% of instructions Miss cycles per instruction will still be the same as before. Actual CPI = = Ideal CPU is 4.94/1.5 =3.29 times faster -We spend 3.44/4.94 = 70% of our execution time on memory stalls!

Cache Performance 12 Computer Organization II © CS:APP & McQuain Performance Summary When CPU performance increases -effect of miss penalty becomes more significant Decreasing base CPI -greater proportion of time spent on memory stalls Increasing clock rate -memory stalls account for more CPU cycles Can’t neglect cache behavior when evaluating system performance

Cache Performance 13 Computer Organization II © CS:APP & McQuain Multilevel Cache Considerations Primary cache – Focus on minimal hit time L-2 cache – Focus on low miss rate to avoid main memory access – Hit time has less overall impact Results – L-1 cache usually smaller than a single cache – L-1 block size smaller than L-2 block size

Cache Performance 14 Computer Organization II © CS:APP & McQuain Intel Core i7 Cache Hierarchy Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs L1 d-cache L1 i-cache L2 unified cache Core 3 … L3 unified cache (shared by all cores) Main memory Processor package L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L2 unified cache: 256 KB, 8-way, Access: 11 cycles L3 unified cache: 8 MB, 16-way, Access: cycles Block size: 64 bytes for all caches.

Cache Performance 15 Computer Organization II © CS:APP & McQuain I-7 Sandybridge

Cache Performance 16 Computer Organization II © CS:APP & McQuain What about writes? Multiple copies of data may exist: -L1 -L2 -DRAM -Disk Remember: each level of the hierarchy is a subset of the one below it. Suppose we write to a data block that's in L1. If we update only the copy in L1, then we will have multiple, inconsistent versions! If we update all the copies, we'll incur a substantial time penalty! And what if we write to a data block that's not in L1?

Cache Performance 17 Computer Organization II © CS:APP & McQuain What about writes? What to do on a write-hit? Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line) Need a dirty bit (line different from memory or not)

Cache Performance 18 Computer Organization II © CS:APP & McQuain What about writes? What to do on a write-miss? Write-allocate (load into cache, update line in cache) Good if more writes to the location follow No-write-allocate (writes immediately to memory) Typical combinations: Write-through + No-write-allocate Write-back + Write-allocate