Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Slides:



Advertisements
Similar presentations
Lecture 12 Reduce Miss Penalty and Hit Time
Advertisements

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Computer Organization and Architecture
ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.
Chapter 12 Pipelining Strategies Performance Hazards.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Reducing Hit Time Small and simple caches Way prediction Trace caches
Associativity in Caches Lecture 25
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Chapter 5 Memory CSE 820.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 15: Memory Design
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache Performance Improvements
Lecture 14: Cache Performance
Presentation transcript:

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory latencies increase cache throughput increase memory bandwidth

Spring 2003CSE P5482 Handling a Cache Miss the Old Way (1) Send the address to the next level of the hierarchy (2) Signal a read operation (3) Wait for the data to arrive (4) Update the cache entry with data*, rewrite the tag, turn the valid bit on, clear the dirty bit (unless an instruction cache) (5) Resend the memory address; this time there will be a hit. * There are variations: fill the cache block, then send the requested word to the CPU send the requested word to the CPU as soon as it arrives at the cache (early restart) requested word is sent from memory first; then the rest of the block follows (requested word first) Early restart and requested word first have a lower miss penalty, because they return execution control to the CPU earlier.

Spring 2003CSE P5483 Non-blocking Caches Non-blocking cache (lockup-free cache) allows the CPU to continue executing instructions while a miss is handled some processors allow only 1 outstanding miss (“hit under miss”) some processors allow multiple misses outstanding (“miss under miss”) miss status holding registers (MSHR) hardware structure for tracking outstanding misses physical address of the block which word in the block destination register number (if data) mechanism to merge requests to the same block mechanism to insure accesses to the same location execute in program order

Spring 2003CSE P5484 Non-blocking Caches Non-blocking cache (lockup-free cache) can be used with both in-order and out-of-order processors in-order processors stall when an instruction that uses the load data is the next instruction to be executed (non-blocking loads) out-of-order processors can execute instructions after the load consumer how do non-blocking caches improve memory system performance?

Spring 2003CSE P5485 Victim Cache Victim cache small fully-associative cache contains the most recently replaced blocks of a direct-mapped cache alternative to 2-way set-associative cache check it on a cache miss swap the direct-mapped block and victim cache block how do victim caches improve memory system performance? why do victim caches work?

Spring 2003CSE P5486 Sub-block Placement Divide a block into sub-blocks sub-block = unit of transfer on a cache miss valid bit/sub-block Misses: block-level miss: tags didn’t match sub-block-level miss: tags matched, valid bit was clear + the transfer time of a sub-block + fewer tags than if each sub-block were a block - less implicit prefetching how does sub-block placement improve memory system performance?

Spring 2003CSE P5487 Pseudo-set associative Cache Pseudo-set associative cache access the cache if miss, invert the high-order index bit & access the cache again + miss rate of 2-way set associative cache + close to access time of direct-mapped cache -increase in hit time (relative to 2-way associative) if always try slow hit first predict which is the fast-hit block put the fast hit block in the same location; swap blocks if wrong how does pseudo-set associativity improve memory system performance?

Spring 2003CSE P5488 Pipelined Cache Access Pipelined cache access simple 2-stage pipeline access data transfer tag check & hit/miss logic with the shorter how do pipelined caches improve memory system performance?

Spring 2003CSE P5489 Mechanisms for Prefetching Prefetching + reduces the number of misses stream buffers where prefetched instructions/data held if requested block in the stream buffer, then cache access is cancelled for a prefetch

Spring 2003CSE P54810 Trace Cache Trace cache contents cache blocks contain instructions from the dynamic instruction stream + a more efficient use of I-cache space + reduces the number of misses by fetching statically noncontiguous instructions in a single cycle cache tag is high branch address bits + predictions for all branches within the block 2 next addresses (target & fall-through code) Assessing a trace cache assess trace cache + branch predictor, BTB, TLB, I-cache in parallel compare PC & prediction history of the current branch instruction to the trace cache tag hit: I-cache fetch ignored miss: use the I-cache start constructing a new trace

Spring 2003CSE P54811 Cache-friendly Compiler Optimizations Exploit spatial locality schedule for array misses hoist first load to a cache block Improve spatial locality group & transpose makes portions of vectors that are accessed together lie in memory together loop interchange so inner loop follows memory layout Improve temporal locality loop fusion do multiple computations on the same portion of an array tiling (also called blocking) do all computation on a small block of memory that will fit in the cache

Spring 2003CSE P54812 Effect of Tiling

Spring 2003CSE P54813 Tiling Example /* before */ for (i=0; i<n; i=i+1) for (j=0; j<n; j=j+1){ r = 0; for (k=0; k<n; k=k+1) { r = r + y[i,k] * z[k,j]; } x[i,j] = r; }; /* after */ for (jj=0; jj<n; jj=jj+B) for (kk=0; kk<n; kk=kk+B) for (i=0; i<n; i=i+1) for (j=jj; j<min(jj+B-1,n); j=j+1) { r = 0; for (k=kk; k<min(kk+B-1,n); k=k+1) {r = r + y[i,k] * z[k,j]; } x[i,j] = x[i,j] + r; };

Spring 2003CSE P54814 Memory Banks Interleaved memory: multiple memory banks word locations are assigned across banks interleaving factor: number of banks a single address can access all banks at once + get more data for one transfer: increases memory bandwidth data is probably used (why?) + similar implementation to not banking memory -larger chip capacity means fewer banks -power issue

Spring 2003CSE P54815 Memory Banks Independent memory banks different banks can be accessed at once, with different addresses allows parallel access, possibly parallel data transfer multiple memory controllers, one to do each access different controllers cannot access the same bank cheaper than dual porting how do independent memory banks improve memory system performance?

Spring 2003CSE P54816 Machine Comparison

Spring 2003CSE P54817 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory latencies increase cache throughput increase memory bandwidth

Wrap-up Victim cache TLB Hardware or compiler-based prefetching Cache-conscious compiler optimizations Coupling a write-through memory update policy with a write buffer Handling the read miss before replacing a block with a write-back memory update policy Sub-block placement Non-blocking caches Merging requests to the same cache block in a non-blocking cache Requested word first or early restart Cache hierarchies Virtual caches Pipelined cache accesses Pseudo-set associative cache Banked or interleaved memories Independent memory banks Wider bus