CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings of ISCA-17 1990

Cache Basics DECODERDECODER Way 1Way 2 Data arrayTag array Set Address Mux Comparator

Multiplexing M

Banking Sets get distributed Words/Ways get distributed Banking reduces acces time per bank and overall power Allows multiple accesses without true multiporting Wordline Bitline

Virtual Memory A single physical address (A) can map to multiple virtual addresses (X, Y) The CPU provides addresses X and Y and the cache must make sure that both map to the same cache location Naive solution: perform virtual-to-physical translation (TLB) before accessing the cache

Page Coloring To identify potential cache locations and initiate the RAM look-up, only index bits are needed If OS ensures that virtual index bits always match physical index bits, you can start RAM look-up before completing TLB look-up When both finish, use newly obtained physical address for the tag comparison (note: can’t use virtual address for tag comparison Virtually-indexed, Physically-tagged

Memory Wall Year : Clock speed : Memory latency in seconds : in cycles : 1997 0.75 GHz 50+20ns 53 cycles 2011 10 GHz 16ns 160 cycles Improves by 10%/year Clock speed has traditionally improved by 50%/year, but will improve by only ~20%/year in the future

Bottlenecks

Conflict Misses Direct-mapped caches have lower access times, but suffer from conflict misses Most conflict misses are localized to a few sets -- an associativity of 1.2 is desirable?

Victim Caches Every eviction from L1 gets put in the victim cache (VC and L1 are exclusive) Victim cache associative look-up can happen in parallel with L1 look-up – VC hit results in a swap L1 Victim cache

Results The cache and line size influence the percentage of misses attributable to conflicts 15-entry victim cache eliminates half the conflict misses – reduction in total cache misses is less than 20%

Prefetch Techniques Prefetch on miss fetches multiple lines for every cache miss Tagged prefetch waits till a prefetched line is touched before bringing in more lines Prefetch deals with capacity and compulsory misses, but causes cache pollution

Stream Buffers On a cache miss, fill the stream buffer with contiguous cache lines When you read the top of the queue, bring in the next line If the top-of-q does not service a miss, the stream buffer flushes and starts from scratch L1 Stream buffer Sequential lines

Results Eight entries are enough to eliminate most capacity and compulsory misses 72% of I-cache misses and 25% of D-cache misses are eliminated Multiple stream buffers help eliminate 43% of D-cache misses Large cache lines minimize stream buffer impact (stream buffer removes 10% of D-cache misses for 128B cache line size)

Potential Improvements Relax the top-of-q constraint for the stream buffer Maintain a stride value to detect non-sequential accesses

Bottlenecks Again For 4KB caches, 16B lines

Harmonic and Arithmetic Means HM of IPC = N / (1/IPC a + 1/ IPC b + 1/ IPC c ) = N / (CPI a + CPI b + CPI c ) = 1 / AM of CPI Weight each benchmark as if they all execute one instruction If you want to assume each benchmark executes for the same time, HM of CPI or AM of IPC is appropriate

Next Week’s Paper “Memory Dependence Prediction Using Store Sets”, Chrysos and Emer, ISCA-25, 1998

Title Bullet

CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

Similar presentations

Presentation on theme: "CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

Similar presentations

Presentation on theme: "CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings."— Presentation transcript:

Similar presentations

About project

Feedback