Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
February 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CENG709 Computer Architecture and Operating Systems Lecture 7 - Memory Hierarchy-II Murat Manguoglu Department of Computer Engineering Middle East Technical.
CMSC 611: Advanced Computer Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSE 351 Section 9 3/1/12.
Associativity in Caches Lecture 25
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
Bojian Zheng CSCD70 Spring 2018
Krste Asanovic Electrical Engineering and Computer Sciences
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Electrical and Computer Engineering
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CSC3050 – Computer Architecture
Cache - Optimization.
Cache Performance Improvements
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Cache (Memory) Performance Optimization

Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate (e.g., larger cache) reduce the miss penalty (e.g., L2 cache) reduce the hit time The simplest design strategy is to design the largest primary cache without slowing down the clock or adding pipeline stages Design the largest primary cache without slowing down the clock Or adding pipeline stages.

Compulsory: first-reference to a block a.k.a. cold start misses -misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program -misses that would occur even under perfect placement & replacement policy Conflict: misses that occur because of collisions due to block-placement strategy -misses that would not occur with full associativity

Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement – A valid bit added to units smaller than the full block, called sub-locks – Only read a sub-lock on a miss – If a tag matches, is the word in the cache? Main reason for sub-block placement is to reduce tag overhead.

-Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit -Design data RAM that can perform read and write in one cycle, restore old value after tag miss -Hold write data for store in single buffer ahead of cache, write cache data during next store’s tag check -Need to bypass from write buffer if read matches write buffer tag

Speculate on future instruction and data accesses and fetch them into cache(s) – Instruction accesses easier to predict than data accesses Varieties of prefetching – Hardware prefetching – Software prefetching – Mixed schemes What types of misses does prefetching affect?

Usefulness – should produce hits Timeliness – not late and not too early Cache and bandwidth pollution

Instruction prefetch in Alpha AXP – Fetch two blocks on a miss; the requested block and the next consecutive block – Requested block placed in cache, and next block in instruction stream buffer

Prefetch-on-miss accessing contiguous blocks Tagged prefetch accessing contiguous blocks

What property do we require of the cache for prefetching to work ?

Restructuring code affects the data block access sequence – Group data accesses together to improve spatial locality – Re-order data accesses to improve temporal locality Prevent data from entering the cache – Useful for variables that are only accessed once Kill data that will never be used – Streaming data exploits spatial locality but not temporal locality

What type of locality does this improve?

Upon a cache miss – 4 clocks to send the address – 24 clocks for the access time per word – 4 clocks to send a word of data Latency worsens with increasing block size Need 128 or 116 clocks, 128 for a dumb memory.

Alpha AXP bits wide memory and cache.

Banks are often 1 word wide Send an address to all the banks How long to get 4 words back? * 4 clocks = 44 clocks from interleaved memory.

Send an address to all the banks How long to get 4 words back? = 32 clocks from main memory for 4 words.

Consider a 128-bank memory in the NEC SX/3 where each bank can service independent requests