Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Slides:



Advertisements
Similar presentations
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Caching Basics CS Memory Hierarchies Takes advantage of locality of reference principle –Most programs do not access all code and data uniformly,
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Lecture 20 Last lecture: Today’s lecture: Types of memory
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Soner Onder Michigan Technological University
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
ECE 445 – Computer Organization
Advanced Computer Architectures
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Update : about 8~16% are writes
Cache Memory Rabi Mahapatra
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto subset of blocks Set associative

How is a block found? Cache has an address tag for each block Tags are checked in parallel for a match Also has a valid bit Processor address: Block AddressBlock OffsetTagIndex Identifies data in block Identifies setIdentifies block

Which block should be replaced on a miss? Direct mapped: –Simple (there can only be one!) Associative caches: –Choice involved –Three techniques Random Least-recently used (LRU) –Often only approximated FIFO (approximates LRU)

Random vs LRU (16kB cache) Miss Rate (%)

Random vs LRU (256kB cache) Miss Rate (%)

What happens on a write? Reads predominate –Instruction fetches, more loads than stores –MIPS instruction mix: 10% stores 37% loads Writes: 7% of memory traffic, 21% of data traffic Amdahl’s Law: We can’t ignore them!

Write Strategy Must complete checking tags before starting to write –Read can sometimes proceed safely while tags are checked Must modify only part of the block –Reads can read more than is required

Write Strategy Two main approaches Dirty bit Write through Cache Main Memory CPU Write back Cache Main Memory × CPU

Advantages Write back –Writes occur at cache speeds –Only one memory access after multiple writes Lower memory bandwidth Write through –Efficient read misses –Simple implementation –Memory and cache are consistent Good for multiprocessors Good for multi- processors!

Optimising Write Through Reduce write stalls –Write buffer Processor continues while write buffer updates memory

Handling Write Misses Write allocate –Fetch block into cache on miss –Good with write back No-write allocate –Memory is updated without loading block into cache –Good with write through

Alpha Data Cache Data cache –64kB –64-byte blocks –2-way set associative –Write back Write allocate –Victim Buffer (similar to Write Buffer) 8 blocks

Alpha Data Cache Hit

Data Cache Uses FIFO (one bit per set) If victim buffer is full, CPU must stall Write miss: –Write allocate –Similar to read miss

Performance Hit –3 cycles Three cycle load delay Miss –9ns to transfer data from next level (6 667MHz)

Alpha Instruction Cache Instruction cache –Separate from data cache –64kB

Separate Caches Doubles available bandwidth –Prevents fetch unit stalling on data accesses Caches can be optimised separately –UltraSPARC Data cache: 16kB, direct mapped, 2 × 16-byte sub- blocks Instruction cache: 16kB, 2-way set associative, 32- byte blocks

Unified Caches Hold both data and instructions Miss rates for instructions are much lower than for data (an order of magnitude) Unified cache may have slightly better overall miss rate –16kB data cache: 11.4% –16kB instruction cache: 0.4% –32kB unified cache: 3.18% 3.24% BUT: extra cycle stall for unified cache: average memory access time is slower (4.44 rather than 4.24 cycles)

5.3. Cache Performance Miss rate can be misleading –See last example! Better measure is average memory access time = Hit time + Miss rate × Miss penalty

Performance Issues Cache is very significant factor –Example: CPU time increased by 4 Particularly for: –Low CPI machines –Fast clock speeds Simplicity of direct mapped cache may give faster clock rate

Miss Penalty and Out of Order Execution Processor may be able to do useful work during cache miss Makes analysis of cache performance very difficult! Can have a significant impact

Improving Cache Performance Very important topic –1600 papers in 6 years! (2 nd Edition) –5000 papers in 13 years! (3 rd Edition)

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism –Reduce hit time AMAT = Hit time + Miss rate × Miss penalty

5.4. Reducing Miss Penalty Traditionally, focus on miss rate But, cost of miss penalties is increasing dramatically

Multi-level Caches Two caches –A small, fast one close to the CPU –A big, slower one between the first cache and memory L1 cache Main Memory CPU L2 Cache

Second-level caches Complicates analysis

Analysis of two-level caches Local miss rate –Number of misses / number of accesses to this cache –Artificially high for L2 cache Global miss rate –Number of misses / number of accesses by CPU –Miss rate L1 × Miss rate L2 for L2 cache

Design of two-level caches Second level cache should be large –Minimises local miss rate –Big blocks are more feasible (reducing miss rate) Multilevel inclusion property –All data in L1 is also in L2 –Useful for multiprocessor consistency Can be enforced at L2

Early restart & critical word first Minimise CPU waiting time Early restart –As soon as requested word arrives send it to CPU Critical word first –Request required word from memory first then fill rest of cache block

Prioritising read misses Write-through caches normally make use of a write buffer Problem: may lead to RAW hazards Solution: stall read miss until write buffer empties –May be as much as 50% increase in read miss Better solution: check write buffer for conflict

Prioritising read misses Write-back caches –Long read misses due to writing back dirty block Solution: –Write buffer –Handle read miss then write back the dirty block –Need to do the same conflict checking (or stall for the write buffer to drain)

Merging Write Buffer Write buffers merge data being written to the same area of memory Benefits: –More efficient use of buffer –Reduces stalls due to write buffer being full

Victim Caches Small (  5 entries), fully associative cache on the refill path –Holds recently discarded blocks Temporal locality –Experiment (4kB, direct-mapped cache): 4-entry victim cache Removed 20% to 95% of conflict misses AMD Athlon: 8 entry victim cache