Lecture 11 & 12: Caches Cache overview 4 Hierarchy questions More on Locality Please bring these slides to the next lecture!

Exam Tuesday 2/17 –7:15-9:15 –1504 GGBrown –Study old exams!!! Exam Q&A –Sunday 2/15 4-6pm in 1670 Beyster –90% will be old exam questions –Bring paper copies of the old exams. –In class on Tuesday similar Q&A.

Class project Project restrictions –The I-cache and D-cache size is limited to the size it has in the tarball. –Can’t have more CDBs then degree of super scalarness. –Must use our multiplier (P2) M-token –Get one from the computer showcase if you haven’t. Milestone 1: It is due on Thursday 2/24 (7 days after exam!) by 9pm Should have high-level design worked on over the weekend. Module you hand in should be: –Self testing, well written. –Be aware of sample (single) RS.

Memory pyramid Disk (Many GB) Memory (128MB – fewGB) L2 Cache (½-32MB) L1 Cache (several KB) Reg 100s bytes Cache Design 101 1 cycle access (early in pipeline) 1-4 cycle access 6-15 cycle access 100-500 cycle access Millions cycle access!

Cache 1 1 a : a hiding place especially for concealing and preserving provisions or implements b : a secure place of storage 3 : a computer memory with very short access time used for storage of frequently used instructions or data -- called also cache memory 1 From Merriam-Webster on-line Cache overview

Locality of Reference Principle of Locality: –Programs tend to reuse data and instructions near those they have used recently. –Temporal locality: recently referenced items are likely to be referenced in the near future. –Spatial locality: items with nearby addresses tend to be referenced close together in time. sum = 0; for (i = 0; i < n; i++) sum += a[i]; *v = sum; Locality in Example: Data –Reference array elements in succession (spatial) Instructions –Reference instructions in sequence (spatial) –Cycle through loop repeatedly (temporal)

Caching: The Basic Idea Main Memory –Stores words A–Z in example Cache –Stores subset of the words 4 in example –Organized in lines Multiple words To exploit spatial locality Access –Word must be in cache for processor to access Big, Slow Memory A B C Y Z Small, Fast Cache A B G H Processor

Direct-mapped cache 29 123 150 162 18 33 19 210 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 Cache V d tag data Memory 78 120 71 173 21 28 200 225 0 0 0 0 Address 01101 218 44 141 28 33 181 119 66 23 10 16 214 98 129 42 74 Block Offset (1-bit) Line Index (2-bit) Tag (2-bit) C ompulsory Miss: first reference to memory block C apacity Miss: Working set doesn’t fit in cache C onflict Miss: Working set maps to same cache line 3-C’s Cache overview

2-way set associative cache 29 123 150 162 18 33 19 210 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 Cache V d tag data Memory 78 120 71 173 21 28 200 225 0 0 0 0 Address 01101 218 44 141 28 33 181 119 66 23 10 16 214 98 129 42 74 Block Offset (unchanged) 1-bit Set Index Larger (3-bit) Tag Impact on the 3C’s? Cache overview

Parameters Total cache size –( block size  # sets  associativity) Associativity (Number of “ways”) Block size (bytes per block) Number of sets Cache overview

Performance Measures –Miss rate % of memory refereces which are not found in the cache. A related measure is #misses per 1000 instructions –Average memory access time MR*T Miss + (1-MR)* T Hit –T Hit & T Miss --Access time for a hit or miss But what do we want to measure? –Impact on program execution time. What are some flaws of using –Miss Rate? –Ave. Memory Access Time? –Program execution time? –Misses per 1000 instructions? Cache overview

Effects of Varying Cache Parameters Total cache size? –Positives: Should decrease miss rate –Negatives: May increase hit time Increased area requirements Increased power (mainly static) –Interesting paper: »Krisztián Flautner, Nam Sung Kim, Steve Martin, David Blaauw, Trevor N. Mudge: Drowsy Caches: Simple Techniques for Reducing Leakage Power. ISCA 2002: 148-157 Cache overview

Effects of Varying Cache Parameters Bigger block size? –Positives: Exploit spatial locality ; reduce compulsory misses Reduce tag overhead (bits) Reduce transfer overhead (address, burst data mode) –Negatives: Fewer blocks for given size; increase conflict misses Increase miss transfer time (multi-cycle transfers) Wasted bandwidth for non-spatial data Cache overview

Effects of Varying Cache Parameters Increasing associativity –Positives: Reduces conflict misses Low-assoc cache can have pathological behavior (very high miss) –Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Minimal improvements past 4- or 8- way. Cache overview

Effects of Varying Cache Parameters Replacement Strategy: (for associative caches) –LRU: intuitive; difficult to implement with high assoc; worst case performance can occur (N+1 element array) –Random: Pseudo-random easy to implement; performance close to LRU for high associativity –Optimal: replace block that has next reference farthest in the future; hard to implement (need to see the future) Cache overview

Effects of Varying Cache Parameters Write Policy: How to deal with write misses? –Write-through / no-allocate Total traffic? Read misses  block size + writes Common for L1 caches back by L2 (esp. on-chip) –Write-back / write-allocate Needs a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses)  block size + dirty-block-evictions  block size Common for L2 caches (memory bandwidth limited) –Variation: Write validate Write-allocate without fetch-on-write Needs sub-block cache with valid bits for each word/byte

4 Hierarchy questions Where can a block be placed? How do you find a block (and know you’ve found it)? Which block should be replaced on a miss? What happens on a write? Cache overview

So from here… We need to think in terms of both the hierarchy questions as well as performance. –We often will use Average Access Time as a predictor of the impact on execution time. But we will try to keep in mind they may not be the same thing! Even all these questions don’t get at everything! Cache overview

Set Associative as a change from Direct Mapped Impact of being more associative? –MR? T Miss ? T Hit ? Hierarchy questions: –Where can a block be placed? –How do you find a block (and know you’ve found it)? –Which block should be replaced on a miss? –What happens on a write? 4 Hierarchy questions

Hash cache Idea: –Grab some bits from the tag and use them, as well as the old index bits, to select a set. –Simplest version would be if N sets, grab the 2N lowest order bits after the offset and XOR them in groups of 2. Impact: –Impact of being more associative? MR? T Miss ? T Hit ? –Hierarchy questions: Where can a block be placed? How do you find a block (and know you’ve found it)? Which block should be replaced on a miss? What happens on a write? 4 Hierarchy questions

Skew cache Idea: –As hash cache but a different and independent hashing function is used for each way. Impact: –Impact of being more associative? MR? T Miss ? T Hit ? –Hierarchy questions: Where can a block be placed? How do you find a block (and know you’ve found it)? Which block should be replaced on a miss? What happens on a write? 4 Hierarchy questions

Victim cache Idea: –A small fully-associative cache (4-8 lines typically) that is accessed in parallel with the main cache. This victim cache is managed as if it were an L2 cache (even though it is as fast as the main L1 cache). Impact: –Impact of being more associative? MR? T Miss ? T Hit ? –Hierarchy questions: Where can a block be placed? How do you find a block (and know you’ve found it)? Which block should be replaced on a miss? What happens on a write? 4 Hierarchy questions

Critical Word First Idea: –For caches where the line size is greater than the word size, send the word which causes the miss first Impact: –Impact of being more associative? MR? T Miss ? T Hit ? –Hierarchy questions: Where can a block be placed? How do you find a block (and know you’ve found it)? Which block should be replaced on a miss What happens on a write? 4 Hierarchy questions

Lots and lots of other things How do I really do replacement on highly- associative caches (4+ ways)? We’ve skipped over writing

Pseudo-LRU replacement # of bits needed to maintain order among N items? So for N=16 we need: ____ bits. Any better ideas? 45

Psuedo LRU Way 0Way 1Way 2Way 3 General theme: On a hit or replacement, switch all the bits to point away from you. Replace the one pointed to. 0 0 0 1 1 1

Multi-lateral caches (1/3) There are two cache structures, the main cache and the companion buffer. –The main cache is set-associative –The companion buffer can hold any line Any given access can be put in either structure and both need to be searched. Set 0 Set 1 Way 0Way 1 Main cache Companion buffer

Multi-lateral caches (2/3) Lots of cache schemes can be viewed as multi-lateral caches –Victim caches –Assist caches Some things are super-sets of multi-lateral caches –Skew caches (only use one line of one of the ways) –Exclusive Multi-level caches (at least in structure)

Multi-lateral caches (3/3) Why is this term important? –First the idea of having two structures in parallel has an attractive sound to it. Bypassing 1 is brave, but keeping low-locality accesses around for only a short time seems smart. –Of course you have to identify low-locality accesses… Can somehow segregate data streams (much as an I vs. D cache does) may improve performance. –Some ideas include keeping the heap, stack, and global variables in different structures! 1 This means not putting the data into the cache

3C’s model Break cache misses into three categories –Compulsory miss –Capacity miss –Conflict miss Compulsory –The block in question had never been accessed before. Capacity –A fully-associative cache of the same size would also have missed this access given the same reference stream. Conflict –That fully-associative cache would have gotten a hit.

3C’s example Consider the “stream” of blocks 0, 1, 2, 0, 2, 1 –Given a direct-mapped cache with 2 lines, which would hit, which would miss? –Classify each type of miss.

3C’s – sum-up. What’s the point? –Well, if you can figure out what kind of misses you are getting, you might be able to figure out how to solve the problem. How would you “solve” each type? What are the problems?

Reference stream A memory reference stream is an n-tuple of addresses which corresponds to n ordered memory accesses. –A program which accesses memory locations 4, 8 and then 4 again would be represented as (4,8,4).

Locality of reference The reason small caches get such a good hit-rate is that the memory reference steam is not random. –A given reference tends to be to an address that was used recently. This is called temporal locality. –Or it may be to an address that is near another address that was used recently. This is called spatial locality. Therefore, keeping recently accessed blocks in the cache can result in a remarkable number of hits. –But there is no known way to quantify the amount of locality in the reference stream.

Stack distance – A measure of locality Consider the reference stream (0, 8, 4, 4, 12, 16, 32, 4) If the cache line size is 8 bytes, the block reference stream is (0, 1, 0, 0, 1, 2, 4, 0) Now define the stack distance of a reference to be the number of unique block addresses between the current reference and the previous reference to the same block number. In this case ( , , 1, 0, 1, , , 3)

Stack distance – A measure of locality (2) Memory reference stream 08441216324 Block reference stream 01001240 Stack distance  101  3 Number non-infinite accessesCumulative stack distance

Stack distances of the SPEC benchmarks – cumulative

Stack distances of selected SPECfp benchmarks

Why is this interesting? It is possible to distinguish locality from conflict. –Pure miss-rate data of set-associative caches depends upon both the locality of the reference stream and the conflict in that stream. It is possible to make qualitative statements about locality –For example, SPECint has a higher degree of locality than SPECfp.

Fully-associative caches A fully-associative LRU cache consisting of n lines will get a hit on any memory reference access with a stack distance of n-1 or less. –Fully associative caches of size n store the last n uniquely accessed blocks. –This means the locality curves are simply a graph of the hit rate on a fully associative cache of size n-1.

Direct-mapped caches Consider a direct-mapped cache of n cache lines and the block reference stream (x, y, x). –The second reference to x will be a hit unless y is mapped to the same cache line as x. –If x and y are independent, the odds x being a hit is (n-1)/n In general if there are m unique accesses between the two references to x, the odds of a hit are:

Expected hit rate at a given stack distance for a 128-line cache

Verification – direct mapped 128-entry cache SPECfpSPECint

Why is this interesting? It is possible to see exactly why more associative caches do better than less associative caches –It also becomes possible to see when they do worse. –The area under the curve is always equal to the number of cache lines. It provides an expectation of performance. –If things are worse, there must be excessive conflict. –If things are better, it is likely due to spatial locality.

The 3C’s model The 3’Cs model describes misses as conflict, capacity or compulsory by comparing a direct-mapped cache to a fully- associative cache. –Those accesses in the gray area are conflict misses. –The in the green area are capacity misses. –The blue area is where both caches get a hit. –The yellow area is ignored by the 3C’s model. Perhaps a “conflict hit”? The 3C’s model is much more limited. It cannot distinguish between expected conflict and excessive conflict.

64-line DM 64-line FA 128-line FA 128-line DM

The filtering of locality gcc after a 64KB direct-mapped cache

The filtering of locality (2) Notice that there fewer references at the lowest stack distances than at the highest. –Yet it is at the lowest stack distances where highly-associative caches concentrate their power. –The locality seen by the L2 cache is fundamentally different than that seen by the L1 cache A large enough cache can make this effect go away…

Measuring non-random conflict Combining the cache and locality models makes it possible to predict a hit rate. –On these reference streams the cache tends to do better than predicted. Although some benchmarks, like mgrid, see significantly worse performance. BenchmarkAverage predicted hit rate Average actual hit rate SPECint90.81%91.38% SPECfp83.19%83.84%

Measuring non-random conflict (2) A more advanced technique, using a hash- cache, allows us to roughly quantify the amount of excessive conflict and scant conflict. –This can be useful when deciding if a hash cache is appropriate. –It is also useful to provide feedback to the compiler about its data-layout choices. Other compilers (gcc for example) tend to have a higher degree of excessive conflict. –So this technique may also be able to tell us something about compliers.

Understanding non-standard caches 128-line direct-mapped component and a 6-line victim component

Some review Consider the access pattern A, B, C, A. Assume the three accesses are all independently randomly placed with uniform probability –In a direct-mapped cache with 8 lines, what is the probability of a miss? –A two-way associative cache with 4 lines? –A victim cache of 1 one backing up a direct-mapped cache of 4 lines? –A skew cache with 8 lines? What is bogus about the above assumptions?

VIPT caches (Done on board)

Lecture 11 & 12: Caches Cache overview 4 Hierarchy questions More on Locality Please bring these slides to the next lecture!

Similar presentations

Presentation on theme: "Lecture 11 & 12: Caches Cache overview 4 Hierarchy questions More on Locality Please bring these slides to the next lecture!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 11 & 12: Caches Cache overview 4 Hierarchy questions More on Locality Please bring these slides to the next lecture!

Similar presentations

Presentation on theme: "Lecture 11 & 12: Caches Cache overview 4 Hierarchy questions More on Locality Please bring these slides to the next lecture!"— Presentation transcript:

Similar presentations

About project

Feedback