Cache Performance Improvements

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
The University of Adelaide, School of Computer Science
Caches Vincent H. Berk October 21, 2005
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
EE Architecture of Digital Systems Lecture 3 Cache Memory
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CES 524 May 6 Eleven Advanced Cache Optimizations (Ch 5) parallel architectures (Ch 4) Slides adapted from Patterson, UC Berkeley.
1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Memory Hierarchy Design
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses Siddhartha Chatterjee Fall 2000.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2014 年 1.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Lecture 12: Design with Genetic Algorithms Memory I
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
CPS220 Computer System Organization Lecture 18: The Memory Hierarchy
Soner Onder Michigan Technological University
Associativity in Caches Lecture 25
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
Lecture 9: Memory Hierarchy (3)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:
11 Advanced Cache Optimizations
5.2 Eleven Advanced Optimizations of Cache Performance
CPE 631 Lecture 06: Cache Design
CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.
CS252 Graduate Computer Architecture Lecture 4 Cache Design
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Graduate Computer Architecture Lec 8 – Caches
January 24, 2001 Prof. John Kubiatowicz
Lecture 14: Reducing Cache Misses
CS203A Graduate Computer Architecture Lecture 13 Cache Design
CPE 631 Lecture 05: CPU Caches
Memory Hierarchy.
Siddhartha Chatterjee
/ Computer Architecture and Design
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
11 Advanced Cache Optimizations
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Cache Performance Improvements Average memory-access time = Hit time + Miss rate x Miss penalty Cache optimizations Reducing the miss rate Reducing the miss penalty Reducing the hit time

Cache Performance Equations CPUtime = (CPU execution cycles + Mem stall cycles) * Cycle time Mem stall cycles = Mem accesses * Miss rate * Miss penalty CPUtime = IC * (CPIexecution + Mem accesses per instr * Miss rate * Miss penalty) * Cycle time Misses per instr = Mem accesses per instr * Miss rate CPUtime = IC * (CPIexecution + Misses per instr * Miss penalty) * Cycle time

Types of Cache Misses Compulsory Capacity Conflict (collision) First reference or cold start misses Capacity Working set is too big for the cache Fully associative caches Conflict (collision) Many blocks map to the same block frame (line) Affects Set associative caches Direct mapped caches

3Cs Absolute Miss Rates Conflict 2:1 cache rule

Reducing the 3Cs Cache Misses Larger block size Higher associativity Victim Caches Pseudo-associative caches Hardware prefetching Compiler controlled prefetching Compiler optimizations Execution time is the only final measure

1. Larger Block Size Effects of larger block sizes Reduction of compulsory misses Spatial locality Increase of miss penalty (transfer time) Reduction of number of blocks Potential increase of conflict misses Latency and bandwidth of lower-level memory High latency and bandwidth => large block size Small increase in miss penalty

Example

2. Higher Associativity Eight-way set associative is good enough 2:1 Cache Rule: Miss Rate of direct mapped cache size N = Miss Rate 2-way cache size N/2 Higher Associativity can increase Clock cycle time Hit time for 2-way vs. 1-way external cache +10%, internal + 2%

3. Victim Caches CPU Data Victim Cache Fully associative, small cache Address Data Data in out =? Tag Data Victim Cache =? Write buffer Fully associative, small cache Reduces conflict misses without impairing clock rate Lower level memory

4. Pseudo-Associative Caches Fast hit time of direct mapped and lower conflict misses of 2-way set-associative cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline design is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC Hit time Pseudo hit time Miss penalty

Pseudo Associative Cache CPU Address Data Data in out Data 1 1 Tag =? 3 2 2 =? Write buffer Lower level memory

Example Compare: direct-mapped, 2-way set associative, and pseudo associative. Use AMAT It takes two extra cycles to find an entry in the alternative location (one cycle to check and one cycle to swap) Miss rate pseudo = Miss rate 2-way Miss penalty pseudo = Miss penalty 1-way + 1 Hit time pseudo = Hit time 1-way + Alternate hit rate pseudo * 2 Alternate hit rate pseudo = Hit rate 2-way – Hit rate 1-way = Miss rate 1-way – Miss rate 2-way

5. Hardware Prefetching Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too: 1 data stream buffer gets 25% misses from 4KB DM cache; 4 streams get 43% For scientific programs: 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty

6. Software Prefetching Compiler inserts data prefetch instructions Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC) Special prefetching instructions cannot cause faults; a form of speculative execution Nonblocking cache: overlap execution with prefetch Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth

7. Compiler Optimizations Avoid hardware changes Instructions Profiling to look at conflicts between groups of instructions Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Merging Arrays /* Before: 2 sequential arrays */ int key[SIZE]; int val[SIZE]; /* After: 1 array of stuctures */ struct merge { int key; int val; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improved spatial locality

Loop Interchange /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses instead of striding through memory every 100 words; improved spatial locality Same number of executed instructions

Loop Fusion /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve temporal locality

Blocking (1/2) Two Inner Loops: /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1){ r = 0; for (k = 0; k < N; k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 3 NxNx4 => no capacity misses Idea: compute on BxB submatrix that fits

Blocking (2/2) B called Blocking Factor /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j=j+1){ r = 0; for(k=kk; k<min(kk+B-1,N);k =k+1) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }; B called Blocking Factor

Compiler Optimization Performance

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate 1. Larger Block Size 2. Higher Associativity 3. Victim Cache 4. Pseudo-Associativity 5. HW Prefetching Instr, Data 6. SW Prefetching Data 7. Compiler Optimizations Danger of concentrating on just one parameter when evaluating performance