Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

Slides:



Advertisements
Similar presentations
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Caching Basics CS Memory Hierarchies Takes advantage of locality of reference principle –Most programs do not access all code and data uniformly,
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Lecture 15 Calculating and Improving Cache Perfomance
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CMSC 611: Advanced Computer Architecture
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Lecture 14: Reducing Cache Misses
Lecture: Cache Innovations, Virtual Memory
Performance metrics for caches
Performance metrics for caches
Performance metrics for caches
Siddhartha Chatterjee
Lecture: Cache Hierarchies
Performance metrics for caches
Cache - Optimization.
Cache Memory Rabi Mahapatra
Cache Performance Improvements
Performance metrics for caches
Presentation transcript:

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani

Contents  Cache Basics  Types of Cache misses  Cost of Cache misses  How to remove the cache misses  Larger Block size  Adding Associativity (Reducing Conflict Misses) Miss Cache Victim Cache.. An Improvement over miss cache  Removing Capacity Misses and Compulsory Misses Prefetch Technique Stream Buffers  Conclusion

Mapping (Block Address) modulo (Number of cache blocks in the cache) Cache is accessed using lower order bits. e.g Memory address between (0001) and map to locations 001 and 101 in cache. Data is addressed using tag (higher order bits of address)

Direct Mapped Cache

Cache Terminology  Cache Hit  Cache Miss  Miss Penalty : The miss penalty is the time to replace the block in the upper level with corresponding block from the lower level.

 In a direct-Mapped cache, there is only one place the newly requested item and hence only one choice of what to replace.

Types of Misses Compulsory —The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) –Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Size ) –Conflict —If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in N-way Associative) –Coherence Misses: Result of invalidation to preserve multiprocessor cahce consistency.

Conflict Misses account for Between 20% to 40% of of all direct-mapped cache misses

Cost of Cache Misses Cycle time has been decreasing much faster than memory access time. Average number of machine cycles per instruction has been decreasing dramatically. This two effects can results in miss cost. Eg : Cache miss on VAX11/780 only cost 60% of the average instruction execution. If every instruction had cache miss then machine performance can go down by %60.

How to Reduce the Cache Miss  Increase Block Size  Increase Associativity  Use a Victim Cache  Use a Pseudo Associative Cache  Hardware Prefetching  Compiler-Controlled Prefetching  Compiler Optimizations

Increasing Block size One way to reduce the miss rate is to increase the block size –Reduce compulsory misses - why? Take advantage of spacial locality However, larger blocks have disadvantages –May increase the miss penalty (need to get more data) –May increase hit time (need to read more data from cache and larger mux) –May increase conflict and capacity misses.

Adding Associativity tag and comparator one cache line of data when a miss occur,data is returned to DM and miss cache Each time the upper cache and miss cache is probed From processor To processor From next lower cache tagdata MRU entry Fully-associative miss cache LRU entry

Performance of Miss cache Replaces a long off-chip miss penalty with a short one-cycle on-chip miss. Data conflict misses more removed

Disadvantage of Miss Cache  Waste of storage space in the miss cache due to duplication of data.

Victim Cache An improvement over miss cache. Loads victim line instead of requested line. In case of miss contents of DM cache and victim cache are swapped.

The effect of DM cache size on victim cache performance DM size increase, likelyhood of conflict miss removed by victim cache reduces

Reducing Capacity and Compulsory Misses Use prefetch technique 1.prefetch always 2.prefetch on miss 3.tagged prefetch

Prefetch always prfetches always after every reference. On miss prefetch on miss always fetches the next line. In tagged prefetch each block has a tag bit associated with it. When a block is fetched its tag bit set is set zero and one when it is used While block undergoes this change a new block is fetched.

Stream buffers Start prefetch before tag transition

Stream buffer consist of a series of entries, each consisting of a tag, an available bit, and a data line. On a miss it fetches successive line at the miss target. Lines after the line requested are placed in buffer which avoid populating the cache with the data which is not needed.

Multi-Way Stream Buffers ▪ only remove 25% of data cache miss  interleaved stream of data from different sources ▪ four stream buffer in parallel ▪ instruction stream unchanged ▪ twice the performance of the single stream buffer

Stream buffer Vs Prefetch Feasible to Implement Lower latency Extra hardware required by stream buffers is comparable with additional tag required by tagged prefetch.

Stream buffer performance vs.cache size Only data stream buffer performance improve as cache size increase It can contain data for reference pattern that access several sets of data.

Conclusion Miss cache beneficial in removing data cache miss and conflict misses. Victim cache is an improvement over Miss cache that saves the victim of the cache miss instead of target. stream buffer reduces capacity,compulsory miss Multiway stream buffers are set of stream buffers that can prefetch down several stream concurrently.

References Improving Direct-Mapped Cache Performance by the Addition of a small Fully-Associative Cache and Prefetch Buffers Norman P. Jouppi Computer Organization and design Patterson D. and Hennesy J.