Lecture 6. Cache #2 Prof. Taeweon Suh Computer Science & Engineering Korea University ECM534 Advanced Computer Architecture.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Components of a Computer
Performance of Cache Memory
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: Krste Asanovic & Vladimir Stojanovic
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
Computer Systems Organization: Lecture 11
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
Review: The Memory Hierarchy
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Systems I Locality and Caching
ECE Dept., University of Toronto
CS3350B Computer Architecture Winter 2015 Lecture 3.2: Exploiting Memory Hierarchy: How? Marc Moreno Maza [Adapted from.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Caching & Virtual Memory Systems Chapter 7  Caching l To address bottleneck between CPU and Memory l Direct l Associative l Set Associate  Virtual Memory.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,
CSIE30300 Computer Architecture Unit 9: Improving Cache Performance Hsin-Chou Chi [Adapted from material by and
CS.305 Computer Architecture Improving Cache Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.
1Chapter 7 Memory Hierarchies (Part 2) Outline of Lectures on Memory Systems 1. Memory Hierarchies 2. Cache Memory 3. Virtual Memory 4. The future.
Additional Slides By Professor Mary Jane Irwin Pennsylvania State University Group 3.
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
CS224 Spring 2011 Computer Organization CS224 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Spring 2011 With thanks to M.J. Irwin, D. Patterson,
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CPE232 Improving Cache Performance1 CPE 232 Computer Organization Spring 2006 Improving Cache Performance Dr. Gheith Abandah [Adapted from the slides of.
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
CSE431 L20&21 Improving Cache Performance.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 20&21. Improving Cache Performance Mary Jane.
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Set-Associative Caches Instructors: Randy H. Katz David A. Patterson
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Cache Memory and Performance
CSC 4250 Computer Architectures
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
ECE 445 – Computer Organization
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
ECE 445 – Computer Organization
Morgan Kaufmann Publishers
Cache - Optimization.
Presentation transcript:

Lecture 6. Cache #2 Prof. Taeweon Suh Computer Science & Engineering Korea University ECM534 Advanced Computer Architecture

Korea Univ Performance Execution time is composed of CPU execution cycles and memory stall cycles Assuming that cache hit does not require any extra CPU cycle for execution (that is, the MA stage takes 1 cycle upon a hit), CPU time = #insts × CPI × T = clock cycles x T = (CPU execution clock cycles + Memory-stall clock cycles) x T Memory stall cycles come from cache misses Memory stall clock cycles = Read-stall cycles + Write-stall cycles For simplicity, assume that a read miss and a write miss incur the same miss penalty Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty) 2

Korea Univ Impacts of Misses on Performance A processor with  I-cache miss rate = 2% (98% hit rate)  Perfect D-cache (100% hit rate)  Miss penalty = 100 cycles  Base CPI with an ideal cache = 1 What is the execution cycle on the processor? CPU time = clock cycles x T = (CPU execution clock cycles + Memory-stall clock cycles) x T Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty)  Assume that # of instructions executed is n  CPU execution clock cycles = n  Memory stall clock cycles I$: n x 0.02 × 100 = 2 n  Total execution cycles = n + 2 n = 3 n 3 3X slower even with 98% hit rate, compared with ideal case! Memory CPU I$D$

Korea Univ Impacts of Misses on Performance A processor with  I-cache miss rate = 2% (98% hit rate)  D-cache miss rate = 4% (96% hit rate) Loads ( lw ) or stores ( sw ) are 36% of instructions  Miss penalty = 100 cycles  Base CPI with an ideal cache = 1 What is the execution cycle on the processor? CPU time = clock cycles x T = (CPU execution clock cycles + Memory-stall clock cycles) x T Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty)  Assume that # of instructions executed is n  CPU execution clock cycles = n  Memory stall clock cycles I$: n x 0.02 × 100 = 2 n D$: ( n x 0.36) × 0.04 × 100 = 1.44 n Total memory stall cycles = 2 n n = 3.44 n  Total execution cycles = n n = 4.44 n 4 Memory CPU I$D$

Korea Univ Average Memory Access Time (AMAT) Hit time is also important for performance  Hit time is not always less than 1 CPU cycle  For example, the hit latency of Core 2 Duo’s L1 cache is 3 cycles! To capture the fact that the time to access data for both hits and misses affect performance, designers sometimes use average memory access time (AMAT), which is the average time to access memory considering both hits and misses AMAT = Hit time + Miss rate x Miss penalty Example  CPU with 1ns clock (1GHz)  Hit time = 1 cycle  Miss penalty = 20 cycles  Cache miss rate = 5% 5 AMAT = × 20 = 2ns (2 cycles for each memory access) Memory CPU I$D$

Korea Univ Improving Cache Performance How to increase the cache performance?  Simply increase the cache size? Surely, it would improve hit rate But, a larger cache will have a longer access time At some point, the longer access time will beat the improvement in hit rate, leading to a decrease in performance We’ll explore two different techniques AMAT = Hit time + Miss rate x Miss penalty  The first technique reduces the miss rate by reducing the probability that different memory blocks will contend for the same cache location  The second technique reduces the miss penalty by adding additional cache levels to the memory hierarchy 6

Korea Univ Reducing Cache Miss Rates (#1) Alternatives for more flexible block placement  Direct mapped cache maps a memory block to exactly one location in cache  Fully associative cache allows a memory block to be mapped to any cache block  n-way set associative cache divides the cache into sets, each of which consists of n different locations (ways) A memory block is mapped to a unique set (specified by the index field) and can be placed in any way of that set (so, there are n choices) 7

Korea Univ Associative Cache Example 8 2-way 4 sets Only 1 set 8 sets

Korea Univ Spectrum of Associativity For a cache with 8 entries 9

Korea Univ 4-Way Set Associative Cache 16KB cache  4-way set associative  4B (32-bit) per block  How many sets are there? Byte offset Data TagV Data Tag V Data Tag V Data Tag V Index Index 20 Tag 32 Data 4x1 mux Hit Way 0Way 1 Way 2Way 3 32 block address

Korea Univ Associative Caches Fully associative  Allow a memory block to go in any cache entry  Requires all cache entries to be searched  Comparator per entry (expensive) n-way set associative  Each set contains n entries  (Block address) % (#sets in cache) determines which set to search  Search all n entries in a indexed set  n comparators (less expensive) 11

Korea Univ Associativity Example Assume that there are 3 small different caches with 4 one- word blocks, and CPU generates 8-bit address to cache  Direct mapped  2-way set associative  Fully associative Find the number of misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, 8  We assume that the sequence is for all memory reads ( lw ) 12

Korea Univ Associativity Example 13 Direct mapped cache DataTagV , 8, 0, 6, 8 index tag index 0000 Byte addressBlock addressCache IndexHit or Miss? b’ (0x00)b’ (0x00)0 b’ (0x20)b’ (0x08)0 b’ (0x00)b’ (0x00)0 b’ (0x18)b’ (0x06)2 b’ (0x20)b’ (0x08)0 Mem[0] tag miss Mem[8]Mem[0] Mem[8] Mem[6] block address

Korea Univ Associativity Example (Cont) 14 2-way set associative DataTagV , 8, 0, 6, 8 index tag index Byte addressBlock addressCache IndexHit or Miss? b’ (0x00)b’ (0x00)0 b’ (0x20)b’ (0x08)0 b’ (0x00)b’ (0x00)0 b’ (0x18)b’ (0x06)0 b’ (0x20)b’ (0x08)0 Mem[0] tag miss hit miss Mem[8] Mem[6] DataTagV block address

Korea Univ Associativity Example (Cont) 15 Fully associative DataTagV 0 0, 8, 0, 6, 8 index tag Byte addressBlock addressCache IndexHit or Miss? b’ (0x00)b’ (0x00)0 b’ (0x20)b’ (0x08)0 b’ (0x00)b’ (0x00)0 b’ (0x18)b’ (0x06)0 b’ (0x20)b’ (0x08)0 Mem[0] tag miss hit miss hit Mem[8]Mem[6] DataTagV 0 1 DataTagV 0 DataTagV 0 1 block address

Korea Univ Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation 16 Data from Hennessy & Patterson, Computer Architecture, 2003 Largest gains are achieved when going from direct mapped to 2-way (more than 20% reduction in miss rate)

Korea Univ Range of Set Associative Caches For a fixed size cache, the increase by a factor of 2 in associativity doubles the number of blocks per set (the number of ways) and halves the number of sets  Example: An increase of associativity from 4 ways to 8 ways decreases the size of the index by 1 bit and increases the size of the tag by 1 bit 17 Block offsetByte offsetIndexTag Decreasing associativity Fully associative (only one set) Tag is all the bits except Block offset and byte offset Direct mapped (only one way) Smaller tags, only a single comparator Increasing associativity Selects the setUsed for tag comparisonSelects the word in the block address Block address

Korea Univ Replacement Policy Upon a miss, which way’s block should be picked for replacement?  Direct mapped cache has no choice  Set associative cache prefers non-valid entry if there is one. Otherwise, choose among entries in the set Least-recently used (LRU)  Choose the one unused for the longest time  Hardware should keep track of when each way’s block was used relative to the other blocks in the set  Simple for 2-way (takes one bit per set), manageable for 4-way, too hard beyond that Random  Gives approximately the same performance as LRU with high- associativity cache 18

Korea Univ Example Cache Structure with LRU 2-way set associative data cache 19 Data Tag V Data Tag VDD LRU Set 0 Set 1 Set 2 … Set n-3 Set n-2 Set n-1 Way 0 Way 1

Korea Univ Reducing Cache Miss Penalty (#2) Use multiple levels of caches  Primary cache (Level-1 or L1) attached to CPU Small, but fast  Level-2 (L2) cache services misses from primary cache Larger, slower, but faster than L3  Level-3 (L3) cache services misses from L2 cache Largest, slowest, but faster than main memory  Main memory services L3 cache misses Advancement in semiconductor technology allows enough room on the die for L1, L2, and, L3 caches  L2 and L3 caches are typically unified, meaning that it holds both instructions and data 20 CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB) L3 Cache (8MB) L2 Cache (256KB) Memory

Korea Univ Core i7 Example 21 Core i7 L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB) L3 Cache (8MB) - Shared L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB)..

Korea Univ Multilevel Cache Performance Example Given  CPU base CPI = 1  Clock frequency = 4GHz  Miss rate/instruction = 2%  Main memory access time = 100ns With just L1 cache  L1 $ access time = 0.25 ns (1 cycle)  L1 miss penalty 100ns/0.25ns = 400 cycles  CPI × 400 = 9 CPI (= 98% x 1 CPI + 2% x ( )) 22 Now add L2 cache  L2 $ access time = 5ns  Global miss rate to main memory = 0.5% L1 miss with L2 hit  L1 miss penalty = 5ns/0.25ns = 20 cycles L1 miss with L2 miss  Extra penalty = 400 cycles CPI × × 400 = 3.4 CPI (= x (75% x % x (20+400))) Performance ratio = 9/3.4 = 2.6 Addition of L2 cache in this example achieves 2.6x speedup

Korea Univ Multilevel Cache Design Considerations Design considerations for L1 and L2 caches are very different  L1 should focus on minimizing hit time in support of a shorter clock cycle L1 is smaller (i.e., faster) but have higher miss rate  L2 (L3) should focus on reducing miss rate and reducing L1 miss penalty High associativity and large cache size For the L2 (L3) cache, hit time is less important than miss rate Miss penalty of L1 cache is significantly reduced by the presence of L2 and L3 caches 23 L2 Cache (256KB) CPU Core Reg File L1 I$ (32KB) L1 D$ (32KB) Main Memory AMAT = Hit time + Miss rate x Miss penalty

Korea Univ Two Machines’ Cache Parameters 24 Intel NehalemAMD Barcelona L1 cache organization & size Split I$ and D$; 32KB for each per core; 64B blocks Split I$ and D$; 64KB for each per core; 64B blocks L1 associativity4-way (I), 8-way (D) set assoc.; ~LRU replacement 2-way set assoc.; LRU replacement L1 write policywrite-back, write-allocate L2 cache organization & size Unified; 256KB (0.25MB) per core; 64B blocks Unified; 512KB (0.5MB) per core; 64B blocks L2 associativity8-way set assoc.; ~LRU16-way set assoc.; ~LRU L2 write policywrite-back L2 write policywrite-back, write-allocate L3 cache organization & size Unified; 8192KB (8MB) shared by cores; 64B blocks Unified; 2048KB (2MB) shared by cores; 64B blocks L3 associativity16-way set assoc.32-way set assoc.; evict block shared by fewest cores L3 write policywrite-back, write-allocatewrite-back; write-allocate

Korea Univ Software Techniques Misses occur if sequentially accessed array elements come from different cache lines  To avoid the cache misses, try to exploit the cache structure in programming to reduce the miss rate Code optimizations (No hardware change!)  Rely on programmers or compilers  Loop interchange In nested loops, outer loop becomes inner loop and vice versa  Loop blocking Partition large array into smaller blocks, thus fitting the accessed array elements into cache size Enhance the reuse of cache blocks 25

Korea Univ Loop Interchange 26 j=0 i=0 /* Before */ for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2*x[i][j] /* After */ for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2*x[i][j] j=0 i=0 Improved cache efficiency Row-major ordering What is the worst that could happen? Slide from Prof Sean Lee in Georgia Tech

Korea Univ Cache-friendly Code In C, a 2-dimensional array is stored according to row-major ordering  Example: int A[10][8] 27 for (i=0 ; i<n; i++) for (j=0 ; j<n; j++) sum += a[i][j]; A Cache Line (block) Data cache A[0][0]A[0][1]A[0][2]A[0][3]A[0][5]A[0][6]A[0][7]A[0][4] A[3][0]A[3][1]A[3][2]A[3][3]A[3][5]A[3][6]A[3][7]A[3][4] A[1][0]A[1][1]A[1][2]A[1][3]A[1][5]A[1][6]A[1][7]A[1][4] A[2][1]A[2][0]A[2][3]A[2][0]A[2][5]A[2][6]A[2][7]A[2][4]... A[0][0] A[0][1] A[0][2] A[0][3] A[0][4] A[0][5] A[0][6] A[0][7] A[1][0] A[1][1] A[1][2] A[1][3] A[1][4] A[1][5] A[1][6] A[1][7] A[2][0] A[2][1]... memory

Korea Univ Example (Loop Interchange) 1 #include 2 3 #define n 1024 // 1K 4 #define m 1024 // 1K 5 6 int main() 7 { 8 int i, j, count; 9 long long int a[n][m]; 10 long long int sum; for (i=0; i<n; i++) 13 for (j=0; j<m; j++) 14 a[i][j] = i+j ; sum = 0; for (count=0; count <1000; count++) { for (j=0; j<m; j++) 21 for (i=0; i<n; i++) 22 sum += a[i][j] ; } printf("sum is %lld, 0x%llx\n", sum, sum); return 0; 29 } 28 Run the program on Ubuntu  To compile the program, use gcc %gcc loop_interchange.c -o loop_interchange  To measure the execution time, use the time utility %time./loop_interchange  To collect cache statistics, use the perf utility %perf stat –e LLC-load- misses –e LLC-store- misses./loop_interchange  Repeat the above steps by exchanging the for loops in red

Korea Univ Backup 29

Korea Univ Why Loop Blocking? 30 /* Before */ for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) x[i][j] = y[i][k]*z[k][j]; i k k j y[i][k]z[k][j] i X[i][j] Does not exploit locality! Modified Slide from Prof. Sean Lee in Georgia Tech

Korea Univ Loop Blocking (Loop Tilting) Partition the loop’s iteration space into many smaller chunks and e nsure that the data stays in the cache until it is reused 31 i k k j y[i][k] z[k][j] i j X[i][j] Modified Slide from Prof. Sean Lee in Georgia Tech /* After */ for (jj=0; jj<N; jj=jj+B) // B: blocking factor for (kk=0; kk<N; kk=kk+B) for (i=0; i<N; i++) for (j=jj; j< min(jj+B,N); j++) for (k=kk; k< min(kk+B,N); k++) x[i][j] += y[i][k]*z[k][j]; B kk jj k k k k j j ① ①① ① ①①① ② ② ② ② ② ② ② ③ ④ ⑤ ⑧ ⑥ ⑦ ⑨ ⑨ ⑨ ⑨ ③ ③ ③ ④④④ ⑤ ⑤ ⑤ ⑥ ⑥ ⑥ ⑦⑦⑦ ⑧⑧⑧ ⑨⑨⑨

Korea Univ Two Machines’ Cache Parameters 32 Intel P4AMD Opteron L1 organizationSplit I$ and D$ L1 cache size8KB for D$, 96KB for trace cache (~I$) 64KB for each of I$ and D$ L1 block size64 bytes L1 associativity4-way set assoc.2-way set assoc. L1 replacement~ LRULRU L1 write policywrite-throughwrite-back L2 organizationUnified L2 cache size512KB1024KB (1MB) L2 block size128 bytes64 bytes L2 associativity8-way set assoc.16-way set assoc. L2 replacement~LRU L2 write policywrite-back

Korea Univ Impacts of Cache Performance A processor with  I-cache miss rate = 2%  D-cache miss rate = 4% Loads ( lw ) or stores ( sw ) are 36% of instructions  Miss penalty = 100 cycles  Base CPI with an ideal cache = 2 What is the execution cycle on the processor? CPU time = CPU clock cycles x T = (CPU execution clock cycles + Memory-stall clock cycles) x T Memory stall clock cycles = (memory accesses/program) x (miss rate) x (miss penalty)  Assume that # of instructions executed is I  CPU execution clock cycles = 2I  Memory stall clock cycles I$: I x 0.02 × 100 = 2I D$: (I x 0.36) × 0.04 × 100 = 1.44I Total memory stall cycles = 2I I = 3.44I  Total execution cycles = 2I I = 5.44I 33 Memory CPU I$D$

Korea Univ Impacts of Cache Performance (Cont) How much faster a processor with a perfect cache?  Execution cycle time of the CPU with a perfect cache = 2I  Thus, the CPU with the perfect cache is 2.72x (=5.44I/2I) faster! If you design a new computer with CPI = 1, how much is the system with the perfect cache faster?  Total execution cycles = 1I I = 4.44I  Therefore, the CPU with the perfect cache is 4.44x faster!  Then, the amount of execution time spent on memory stalls have risen from 63% (=3.44/5.44) to 77% (= 3.44/4.44) Thoughts  2% of instruction cache misses with the miss penalty of 100 cycle increases CPI by 2!  Data cache misses also increase CPI dramatically  As CPU is made faster, the amount of time spent on memory stalls will take up an increasing fraction of the execution time 34