EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5

Memory pyramid Disk (Many GB) Memory (128MB – fewGB) L2 Cache (½-32MB) L1 Cache (several KB) Reg 100s bytes Cache Design 101 1 cycle access (early in pipeline) 1-3 cycle access 6-15 cycle access 50-300 cycle access Millions cycle access!

Direct-mapped cache 29 123 150 162 18 33 19 210 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 Cache V d tag data Memory 78 120 71 173 21 28 200 225 0 0 0 0 Address 01101 218 44 141 28 33 181 119 66 23 10 16 214 98 129 42 74 Block Offset (1-bit) Line Index (2-bit) Tag (2-bit) Compulsory Miss: first reference to memory block Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line

2-way set associative cache 29 123 150 162 18 33 19 210 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 Cache V d tag data Memory 78 120 71 173 21 28 200 225 0 0 0 0 Address 01101 218 44 141 28 33 181 119 66 23 10 16 214 98 129 42 74 Block Offset (unchanged) 1-bit Set Index Larger (3-bit) Tag Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size.

Effects of Varying Cache Parameters Total cache size: block size  # sets  associativity –Positives: Should decrease miss rate –Negatives: May increase hit time Increased area requirements

Effects of Varying Cache Parameters Bigger block size –Positives: Exploit spatial locality ; reduce compulsory misses Reduce tag overhead (bits) Reduce transfer overhead (address, burst data mode) –Negatives: Fewer blocks for given size; increase conflict misses Increase miss transfer time (multi-cycle transfers) Wasted bandwidth for non-spatial data

Effects of Varying Cache Parameters Increasing associativity –Positives: Reduces conflict misses Low-assoc cache can have pathological behavior (very high miss) –Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Decreased improvements past 4- or 8- way.

Effects of Varying Cache Parameters Replacement Strategy: (for associative caches) LRU: intuitive; difficult to implement with high assoc; worst case performance can occur (N+1 element array) Random: Pseudo-random easy to implement; performance close to LRU for high associvity Optimal: replace block that has next reference farthest in the future; hard to implement

Other Cache Design Decisions Write Policy: How to deal with write misses? –Write-through / no-allocate Total traffic? Read misses  block size + writes Common for L1 caches back by L2 (esp. on-chip) –Write-back / write-allocate Needs a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses)  block size + dirty-block-evictions  block size Common for L2 caches (memory bandwidth limited) –Variation: Write validate Write-allocate without fetch-on-write Needs sub-block cache with valid bits for each word/byte

Other Cache Design Decisions Write Buffering –Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) –Important for write-through caches since write traffic is frequent Write-back buffer –Holds evicted (dirty) lines for Write-back caches Also allows reads to have priority on the L2 or memory bus. Usually only needs a small buffer Ref: Eager Writeback Caches

Adding a Victim cache V d tag data (Direct mapped) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 V d tag data (fully associative) 0 0 0 1101101001 010 Small victim cache adds associativity to “hot” lines Blocks evicted from direct-mapped cache go to victim Tag compares are made to direct mapped and victim Victim hits cause lines to swap from L1 and victim Not very useful for associative L1 caches Victim cache (4 lines) Ref: 11010011 Ref: 01010011

Hash-Rehash Cache V d tag data (Direct mapped) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 110 11010011 01010011 11010011

Hash-Rehash Cache V d tag data (Direct mapped) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 110 11010011 01010011 01000011 Allocate? 11010011 Miss Rehash miss

Hash-Rehash Cache V d tag data (Direct mapped) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 110 11010011 01010011 01000011 11010011 Miss Rehash miss R 010

Hash-Rehash Cache V d tag data (Direct mapped) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 110 11010011 01010011 01000011 11010011 11000011 Miss Rehash Hit! R 010

Hash- Rehash Cache Calculating performance: –Primary hit time (normal Direct mapped) –Rehash hit time (sequential tag lookups) –Block swap time? –Hit rate comparable to 2-way associative.

Compiler support for caching Array Merging (array of structs vs. 2 arrays) Loop interchange (row vs. column access) Structure padding and alignment (malloc) Cache conscious data placement –Pack working set into same line –Map to non-conflicting address is packing impossible

Prefetching Already done – bring in an entire line assuming spatial locality Extend this… Next Line Prefetch –Bring in the next block in memory as well a miss line (very good for Icache) Software prefetch –Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1

Calculating the Effects of Latency Does a cache miss reduce performance? –It depends on whether there are critical instructions waiting for the result

Calculating the Effects of Latency –It depends on whether critical resources are held up –Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. –Non-blocking: Allows later references to access cache while miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed.

P4 Overview (Todd’s slides) Latest iA32 processor from Intel –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the P6 microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

Comparison Between Pentium III and Pentium 4

Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded  ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions

Execution Pipeline

Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers

Data Stream of Pentium 4 Processor

On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:

L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected

L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per second @ 1.5GHz

L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

Similar presentations

Presentation on theme: "EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

Similar presentations

Presentation on theme: "EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5."— Presentation transcript:

Similar presentations

About project

Feedback