Associativity in Caches Lecture 25

Slides:

Advertisements

Similar presentations

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)

CSCI206 - Computer Organization & Programming

CMSC 611: Advanced Computer Architecture

CSE 351 Section 9 3/1/12.

Cache Performance Samira Khan March 28, 2017.

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

CS 704 Advanced Computer Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

The Hardware/Software Interface CSE351 Winter 2013

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Caches III CSE 351 Autumn 2017 Instructor: Justin Hsia

Appendix B. Review of Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Consider a Direct Mapped Cache with 4 word blocks

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers

Lecture: Cache Hierarchies

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

Lecture 23: Cache, Memory, Virtual Memory

Chapter 5 Memory CSE 820.

Lecture 08: Memory Hierarchy Cache Performance

Lecture 22: Cache Hierarchies, Memory

Lecture: Cache Innovations, Virtual Memory

CPE 631 Lecture 05: Cache Design

Performance metrics for caches

Performance metrics for caches

CDA 5155 Caches.

Adapted from slides by Sally McKee Cornell University

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia

Lecture 22: Cache Hierarchies, Memory

Lecture 11: Cache Hierarchies

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

CS 3410, Spring 2014 Computer Science Cornell University

Lecture: Cache Hierarchies

CSC3050 – Computer Architecture

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Performance metrics for caches

Cache - Optimization.

Cache Memory Rabi Mahapatra

Cache Performance Improvements

Principle of Locality: Memory Hierarchies

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Lecture 14: Cache Performance

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Associativity in Caches Lecture 25 CDA 5155 Associativity in Caches Lecture 25

New Topic: Memory Systems Cache 101 – review of undergraduate material Associativity and other organization issues Advanced designs and interactions with pipelines Tomorrow’s cache design (power/performance) Advances in memory design Virtual memory (and how to do it fast)

Direct-mapped cache Memory Address 01011 Cache V d tag data 78 23 29 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 120 10 123 44 71 16 150 141 162 28 173 214 Block Offset (1-bit) 18 33 21 98 Line Index (2-bit) 33 181 28 129 Tag (2-bit) 19 119 200 42 210 66 Compulsory Miss: First reference to memory block Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line 225 74

2-way set associative cache Memory Address 01101 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 120 10 123 44 71 16 150 141 162 28 173 214 Block Offset (unchanged) 18 33 21 98 1-bit Set Index 33 181 28 129 Larger (3-bit) Tag 19 119 200 42 210 66 Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size. 225 74

Effects of Varying Cache Parameters Total cache size: block size  # sets  associativity Positives: Should decrease miss rate Negatives: May increase hit time Increased area requirements

Effects of Varying Cache Parameters Bigger block size Positives: Exploit spatial locality ; reduce compulsory misses Reduce tag overhead (bits) Reduce transfer overhead (address, burst data mode) Negatives: Fewer blocks for given size; increase conflict misses Increase miss transfer time (multi-cycle transfers) Wasted bandwidth for non-spatial data

Effects of Varying Cache Parameters Increasing associativity Positives: Reduces conflict misses Low-assoc cache can have pathological behavior (very high miss) Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Decreased improvements past 4- or 8- way.

Effects of Varying Cache Parameters Replacement Strategy: (for associative caches) How is the evicted line chosen? LRU: intuitive; difficult to implement with high assoc; worst case performance can occur (N+1 element array) Random: Pseudo-random easy to implement; performance close to LRU for high associativity Optimal: replace block that has its next reference farthest in the future; Belady replacement; hard to implement 

Other Cache Design Decisions Write Policy: How to deal with write misses? Write-through / no-allocate Total traffic? Read misses  block size + writes Common for L1 caches back by L2 (esp. on-chip) Write-back / write-allocate Needs a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses)  block size + dirty-block-evictions  block size Common for L2 caches (memory bandwidth limited) Variation: Write validate Write-allocate without fetch-on-write Needs sub-block cache with valid bits for each word/byte

Other Cache Design Decisions Write Buffering Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) Important for write-through caches since write traffic is frequent Write-back buffer Holds evicted (dirty) lines for Write-back caches Also allows reads to have priority on the L2 or memory bus. Usually only needs a small buffer

Adding a Victim cache V d tag data (Direct mapped) V d tag data (fully associative) 1101001 0000 0001 0010 0011 Victim cache (4 lines) 0100 0101 0110 Ref: 11010011 Ref: 01010011 0111 1000 010 110 1001 Small victim cache adds associativity to “hot” lines Blocks evicted from direct-mapped cache go to victim Tag compares are made to direct mapped and victim Victim hits cause lines to swap from L1 and victim Not very useful for associative L1 caches 1010 1011 1100 1101 1110 1111

Hash-Rehash Cache V d tag data (Direct mapped) 11010011 01010011 110

Hash-Rehash Cache V d tag data (Direct mapped) 11010011 01010011 11010011 01010011 01000011 Allocate? Miss Rehash miss 110

Hash-Rehash Cache V d tag data (Direct mapped) R 11010011 01010011 11010011 01010011 01000011 110 Miss Rehash miss 010

Hash-Rehash Cache V d tag data (Direct mapped) R 11010011 01010011 11010011 01010011 01000011 11000011 110 Miss Rehash Hit! 010

Hash-Rehash Cache Calculating performance: Primary hit time (normal Direct mapped) Rehash hit time (sequential tag lookups) Block swap time? Hit rate comparable to 2-way associative.

Compiler support for caching Array Merging (array of structs vs. 2 arrays) Loop interchange (row vs. column access) Structure padding and alignment (malloc) Cache conscious data placement Pack working set into same line Map to non-conflicting address is packing impossible

Prefetching Already done – bring in an entire line assuming spatial locality Extend this… Next Line Prefetch Bring in the next block in memory as well a miss line (very good for Icache) Software prefetch Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1

Calculating the Effects of Latency Does a cache miss reduce performance? It depends on whether there are critical instructions waiting for the result

Calculating the Effects of Latency It depends on whether critical resources are held up Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. Non-blocking: Allows later references to access cache while miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed.