Lecture 9: Memory Hierarchy (3)

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)

The University of Adelaide, School of Computer Science

Caches Vincent H. Berk October 21, 2005

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

EE Architecture of Digital Systems Lecture 3 Cache Memory

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.

Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.

Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses Siddhartha Chatterjee Fall 2000.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院通信邮箱：提交作业邮箱： 2014 年 1.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Lecture 12: Design with Genetic Algorithms Memory I

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory

/ Computer Architecture and Design

COMP 740: Computer Architecture and Implementation

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CPS220 Computer System Organization Lecture 18: The Memory Hierarchy

CSE 351 Section 9 3/1/12.

Associativity in Caches Lecture 25

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院课件、作业、讨论网址：

11 Advanced Cache Optimizations

5.2 Eleven Advanced Optimizations of Cache Performance

Morgan Kaufmann Publishers

CPE 631 Lecture 06: Cache Design

CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.

CS252 Graduate Computer Architecture Lecture 4 Cache Design

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Graduate Computer Architecture Lec 8 – Caches

Advanced Computer Architectures

January 24, 2001 Prof. John Kubiatowicz

Lecture 14: Reducing Cache Misses

Lecture 08: Memory Hierarchy Cache Performance

CS203A Graduate Computer Architecture Lecture 13 Cache Design

CPE 631 Lecture 05: CPU Caches

Adapted from slides by Sally McKee Cornell University

Memory Hierarchy.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Siddhartha Chatterjee

/ Computer Architecture and Design

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

11 Advanced Cache Optimizations

Cache Performance Improvements

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Lecture 9: Memory Hierarchy (3) Michael B. Greenwald Computer Architecture CIS 501 Fall 1999 2

Improving Cache Performance Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. 14

Reducing Misses Techniques so far: 1) Increasing Block Size 2) Increasing Associativity 3) Victim Caches

4. Reducing Misses via “Pseudo-Associativity” How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC Hit Time Pseudo Hit Time Miss Penalty Time 32

5. Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in “stream buffer”, On miss check stream buffer 1 block catches 15-25%, 4 blocks 50%, 16 blocks 72% [Jouppi90] Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches

5. Reducing Misses by Hardware Prefetching of Instructions & Data Prefetching relies on having extra memory bandwidth that can be used without penalty What if inadequate memory bandwidth? Slows down actual demand misses. If all prefetches end up being referenced, no net loss. That would require a 100% hit rate on prefetched blocks.

6. Reducing Misses by Software Prefetching Data Prefetch explicit blocks Known they will be referenced. Compiler models cache; likely cache miss.(algorithms: fixed horizon, aggressive, forestall [Kimbrel et.al. OSDI96]) Special instructions: Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Release (implemented?) Kimbel et.al.: outperforms demand fetching even with optimal replacement Prefetching instructions cannot cause faults; a form of speculative execution 34

6. Reducing Misses by Software Prefetching Data Requires “non-blocking cache”: must be able to proceed in parallel with prefetch Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth Must prefetch early enough so satisfied before reference. But too early can increase either conflict or capacity misses. Timing critical. Prefetching more important than optimizing release (intuition: if you release incorrect line, good prefetch can recover!). 34

7. Reducing Misses by Compiler/Software Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows 35

7. Reducing Misses by Compiler/Software Optimizations Explicit management by programmer: (MP3D: Cheriton, Goosen, et. al.) Alignment: data structures start on cache line boundaries Performance measurement: pad data to avoid conflict with instructions. Improve both spatial and temporal locality However, increases the total memory footprint of application although decreases instantaneous cache footprint Cause performance degradation of TLB, etc. 35

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality -- both in same cache line

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

Blocking Example Two Inner Loops: /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[ ] Read N elements of 1 row of y[ ] repeatedly Write N elements of 1 row of x[ ] Capacity Misses a function of N & Cache Size: 3 NxNx4 => no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits

Blocking Example B called Blocking Factor /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B +N2 Conflict Misses Too?

Blocking Can be used for registers, too, with small B. Reduces loads/stores. Typically, blocking is used to reduce capacity misses, but small blocks can sometimes reduce conflict misses too.

Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

Summary of Compiler Optimizations to Reduce Cache Misses (by hand) Can’t apply blindly, technique strongly application dependent

Summary 4 Cs: Compulsory, Capacity, Conflict, Coherence 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations Remember danger of concentrating on just one parameter when evaluating performance

Improving Cache Performance Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. 14

1. Reducing Miss Penalty: Read Priority over Write on Miss Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read

2. Reduce Miss Penalty: Subblock Placement Don’t have to load full block on a miss Have valid bits per subblock to indicate valid Valid Bits Subblocks

2. Reduce Miss Penalty: Subblock Placement Originally invented to reduce tag storage by increasing block size This increases wasted space (fragmentation) but reduces transfer time. Best when block size is chosen optimally and sub-block size chosen to reduce transfer time.

3. Reduce Miss Penalty: Early Restart and Critical Word First Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block

4. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires out-of-order execution CPU. Can benefit Prefetch “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses

5th Miss Penalty & L2 Cache L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) Global Miss Rate is what matters

Comparing Local and Global Miss Rates 32 KByte 1st level cache; Increasing 2nd level cache Global miss rate close to single level cache rate provided L2 >> L1 Don’t use local miss rate L2 not tied to CPU clock cycle! Cost & A.M.A.T. Generally Fast Hit Times and fewer misses Since hits are few, target miss reduction Linear Cache Size Log Cache Size