1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Virtual Memory Hardware Support
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
ENGS 116 Lecture 131 Caches and Virtual Memory Vincent H. Berk October 31 st, 2008 Reading for Today: Sections C.1 – C.3 (Jouppi article) Reading for Monday:
CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Lecture 19: Virtual Memory
Computer Architecture Memory Management Units Iolanthe II - Reefed down, heading for Great Barrier Island.
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Memory Hierarchy and Cache Design (4). Reducing Hit Time 1. Small and Simple Caches 2. Avoiding Address Translation During Indexing of the Cache –Using.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses Siddhartha Chatterjee Fall 2000.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
COMP SYSTEM ARCHITECTURE CACHES IN SYSTEMS Sergio Davies Feb/Mar 2014COMP25212 – Lecture 4.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CMSC 611: Advanced Computer Architecture
COMP SYSTEM ARCHITECTURE
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
CPE 631 Lecture 06: Cache Design
CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
CS203A Graduate Computer Architecture Lecture 13 Cache Design
Memory Hierarchy.
Siddhartha Chatterjee
Virtual Memory Overcoming main memory size limitation
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Translation Buffers (TLB’s)
Cache Performance Improvements
Translation Buffers (TLBs)
Presentation transcript:

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software prefetching, and virtually indexed cache

2 Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching Alpha fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty

3 Stream Buffer Diagram DataTags Direct mapped cache one cache block of data Stream buffer tag and comp tag head +1 a a a a from processorto processor tail Shown with a single stream buffer (way); multiple ways and filter may be used next level of cache Source: Jouppi ICS’90

4 Victim Buffer Diagram DataTags Direct mapped cache one cache block of data tag and comp from proc to proc Proposed in the same paper: Jouppi ICS’90 next level of cache Victim cache, fully associative

5 Reducing Misses by Software Prefetching Data Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Prefetching comes in two flavors: Binding prefetch: Requests load directly into register.  Must be correct address and register! Non-Binding prefetch: Load into cache.  Can be incorrect. Frees HW/SW to guess! Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth

6 Improving Cache Performance 3.Reducing miss penalty or miss rates via parallelism Non-blocking caches Hardware prefetching Compiler prefetching 4.Reducing cache hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches 1. Reducing miss rates Larger block size larger cache size higher associativity victim caches way prediction and Pseudoassociativity compiler optimization 2. Reducing miss penalty Multilevel caches critical word first read miss first merging write buffers

7 Fast hits by Avoiding Address Translation CPU TB $ MEM VA PA Conventional Organization CPU $ TB MEM VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU $TB MEM VA PA Tags PA Overlap cache access with VA translation: requires cache index to remain invariant across translation VA Tags L2 $

8 Fast hits by Avoiding Address Translation Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache Every time process is switched logically must flush the cache; otherwise get false hits  Cost is time to flush + “compulsory” misses from empty cache Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O use physical addresses and must interact with cache, so need virtual address Antialiasing solutions HW guarantees covers index field & direct mapped, they must be unique; called page coloring Solution to cache flush Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process

9 Fast Cache Hits by Avoiding Translation: Process ID impact Black is uniprocess Light Gray is multiprocess when flush cache Dark Gray is multiprocess when use Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB

10 Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address If a direct mapped cache is no larger than a page, then the index is physical part of address can start tag access in parallel with translation so that can compare to physical tag Limits cache to page size: what if want bigger caches and uses same trick? Higher associativity moves barrier to right Page coloring Compared with virtual cache used with page coloring? Page Address Page Offset Address Tag Index Block Offset

11 Pipelined Cache Access For multi-issue, cache bandwidth affects effective cache hit time Queueing delay adds up if cache does not have enough read/write ports Pipelined cache accesses: reduce cache cycle time and improve bandwidth Cache organization for high bandwidth Duplicate cache Banked cache Double clocked cache

12 Pipelined Cache Access Alpha Data cache design The cache is 64KB, 2-way associative; cannot be accessed within one-cycle One-cycle used for address transfer and data transfer, pipelined with data array access Cache clock frequency doubles processor frequency; wave pipelined to achieve the speed

13 Trace Cache Trace: a dynamic sequence of instructions including taken branches Traces are dynamically constructed by processor hardware and frequently used traces are stored into trace cache Example: Intel P4 processor, storing about 12K mops

14 Summary of Reducing Cache Hit Time Small and simple caches: used for L1 inst/data cache Most L1 caches today are small but set- associative and pipelined (emphasizing throughput?) Used with large L2 cache or L2/L3 caches Avoiding address translation during indexing cache Avoid additional delay for TLB access

15 What is the Impact of What We’ve Learned About Caches? : Speed = ƒ(no. operations) 1990 Pipelined Execution & Fast Clock Rate Out-of-Order execution Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses) What does this mean for Compilers? Operating Systems? Algorithms? Data Structures?

16 Cache Optimization Summary TechniqueMPMRHTComplexity Multilevel cache+2 Critical work first+2 Read first+1 Merging write buffer +1 Victim caches++2 Larger block-+0 Larger cache+-1 Higher associativity+-1 Way prediction+2 Pseudoassociative+2 Compiler techniques+0 miss rate miss penalty

17 Cache Optimization Summary TechniqueMPMRHTComplexity Nonblocking caches+3 Hardware prefetching+2/3 Software prefetching++3 Small and simple cache-+0 Avoiding address translation+2 Pipeline cache access+1 Trace cache+3 hit time miss penalty