Presentation on theme: "Modified from notes by Saeid Nooshabadi"— Presentation transcript:
1 Modified from notes by Saeid Nooshabadi email@example.com COMP Microprocessors and Embedded Systems Lectures 12: Cache Memory - IIIModified from notes by Saeid NooshabadiSome of the slides are adopted from David Patterson (USB)
3 So we create a memory hierarchy: ReviewWe would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible.So we create a memory hierarchy:each successively higher level contains “most used” data from next lower levelexploits temporal localityLocality of reference is a Big Idea
4 Big Idea ReviewMechanism for transparent movement of data among levels of a storage hierarchyset of address/value bindingsaddress => index to set of candidatescompare desired address with tagservice hit or missload new block and binding on missValidTag0x0-30x4-70x8-b0xc-f123...abcdaddress: tag index offset
5 Types of Cache Misses (#1/2) Compulsory Missesoccur when a program is first startedcache does not contain any of that program’s data yet, so misses are bound to occurcan’t be avoided easily, so won’t focus on these in this course
6 Types of Cache Misses (#2/2) 123123456789ABCDEFTypes of Cache Misses (#2/2)Conflict Missesmiss that occurs because two distinct memory addresses map to the same cache locationtwo blocks (which happen to map to the same location) can keep overwriting each otherbig problem in direct-mapped cacheshow do we lessen the effect of these?
7 Dealing with Conflict Misses Solution 1: Make the cache size biggerfails at some pointSolution 2: Multiple distinct blocks can fit in the same Cache Index?123456789AB1Let’s go back to our 4-byte direct mapped cache and increase its block size to 4 byte. Now we end up have one cache entries instead of 4 entries.What do you think this will do to the miss rate? Well the miss rate probably will go to hell. It is true that if an item is accessed, it is likely that it will be accessed again soon.But probably NOT as soon as the very next access so the next access will cause a miss again. So what we will end up is loading data into the cache but the data will be forced out by another cache miss before we have a chance to use it again.This is called the ping pong effect: the data is acting like a ping pong ball bouncing in and out of the cache. It is one of the nightmares scenarios cache designer hope never happens.We also defined a term for this type of cache miss, cache miss caused by different memory location mapped to the same cache index. It is called Conflict miss.There are two solutions we can use to reduce the conflict miss. The first one is to increase the cache size.The second one is to increase the number of cache entries per cache index. Let me show you what I mean.
8 Fully Associative Cache (#1/4) Memory address fields:Tag: same as beforeOffset: same as beforeIndex: non-existentWhat does this mean?no “rows”: any block can go anywhere in the cachemust compare with all tags in entire cache to see if data is there
9 Fully Associative Cache (#2/4) Fully Associative Cache (e.g., 32 B block)compare tags in parallelByte Offset:Cache DataB 0431Cache Tag (27 bits long)ValidB 1B 31Cache Tag=:
10 Fully Associative Cache (#3/4) CAM (Content Addressable memory): A RAM Cell with in-built comparator
11 Fully Associative Cache (#4/4) Benefit of Fully Assoc Cacheno Conflict Misses (since data can go anywhere)Drawbacks of Fully Assoc Cacheneed hardware comparator for every single entry: if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible
12 Third Type of Cache Miss Capacity Missesmiss that occurs because the cache has a limited sizemiss that would not occur if we increase the size of the cachesketchy definition, so just get the general ideaThis is the primary type of miss for Fully Associate caches.
13 N-Way Set Associative Cache (#1/5) Memory address fields:Tag: same as beforeOffset: same as beforeIndex: points to the correct “row” (called a set in this case)So what’s the difference?each set contains multiple blocksonce we’ve found correct set, must compare with all tags in that set to find our data
14 N-Way Set Associative Cache (#2/5) 2-Way Set Associative Cache Organisationindexdata RAMtag RAMcomparemuxaddressdatahitdecoder123456789ABCDEF
15 N-Way Set Associative Cache (#3/5) Summary:cache is direct-mapped with respect to setseach set is fully associativebasically N direct-mapped caches working in parallel: each has its own valid bit and data
16 N-Way Set Associative Cache (#4/5) Given memory address:Find correct set using Index value.Compare Tag with all Tag values in the determined set.If a match occurs, it’s a hit, otherwise a miss.Finally, use the offset field as usual to find the desired data within the desired block.
17 N-Way Set Associative Cache (#5/5) What’s so great about this?even a 2-way set assoc cache avoids a lot of conflict misseshardware cost isn’t that bad: only need N comparatorsIn fact, for a cache with M blocks,it’s Direct-Mapped if it’s 1-way set associt’s Fully Assoc if it’s M-way set assocso these two are just special cases of the more general set associative design
21 Reading MaterialSteve Furber: ARM System On-Chip; 2nd Ed, Addison-Wesley, 2000, ISBN: Chapter 10.
22 Block Replacement Policy (#1/2) Direct-Mapped Cache: index completely specifies which position a block can go in on a missN-Way Set Assoc (N > 1): index specifies a set, but block can occupy any position within the set on a missFully Associative: block can be written into any positionQuestion: if we have the choice, where should we write an incoming block?
23 Block Replacement Policy (#2/2) Solution:If there are any locations with valid bit off (empty), then usually write the new block into the first one.If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.
24 Block Replacement Policy: LRU LRU (Least Recently Used)Idea: cache out block which has been accessed (read or write) least recentlyPro: temporal locality => recent past use implies likely future use: in fact, this is a very effective policyCon: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this
25 Block Replacement Example We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem):0, 2, 0, 1, 4, 0, 2, 3, 5, 4How many hits and how many misses will there for the LRU block replacement policy?
26 Block Replacement Example: LRU set 0set 1Addresses 0, 2, 0, 1, 4, 0, ...0: miss, bring into set 0 (loc 0)lruset 0set 122: miss, bring into set 0 (loc 1)lru2set 0set 10: hit2lruset 0set 11: miss, bring into set 1 (loc 0)lru1lruset 0set 11lru44: miss, bring into set 0 (loc 1, replace 2)lruset 0set 141lru0: hit
27 Ways to Reduce Miss Rate Larger cachelimited by cost and technologyhit time of first level cache < cycle timeMore places in the cache to put each block of memory - associativityfully-associativeany block any linek-way set associatedk places for each blockdirect map: k=1
28 Design against a performance model Big IdeaHow chose between options of associativity, block size, replacement policy?Design against a performance modelMinimize: Average Access Time= Hit Time x Hit Rate + Miss Penalty x Miss Rateinfluenced by technology and program behaviorCreate the illusion of a memory that is large, cheap, and fast - on average
29 Avg mem access time = 1 + 0.05 x 20 = 2 cycle ExampleAssumeHit Time = 1 cycleMiss rate = 5%Miss penalty = 20 cyclesAvg mem access time = x = 2 cycle
30 Improving Miss Penalty When caches first became popular, Miss Penalty ~ 10 processor clock cyclesToday 1000 MHz Processor (1 ns per clock cycle) and 100 ns to go to DRAM 100 processor clock cycles!MEML1L2DRAMProcSolution: another cache between memory and the processor cache: Second Level (L2) Cache
31 Analyzing Multi-level Cache Hierarchy DRAMProcL1L2L2 hittimeL2 Miss RateL2 Miss PenaltyL1 hittimeL1 Miss RateL1 Miss PenaltyAvg Mem Access Time =L1 Hit Time + L1 Miss Rate * L1 Miss PenaltyL1 Miss Penalty = L2 Hit Time + L2 Miss Rate* L2 Miss PenaltyAvg Mem Access Time =L1 Hit Time + L1 Miss Rate * (L2 Hit Time +L2 Miss Rate * L2 Miss Penalty)
32 L2 miss rate is fraction of L1 misses that also miss in L2 Typical ScaleL1:size: tens of KBhit time: complete in one clock cyclemiss rates: 1-5%L2:size: hundreds of KBhit time: few clock cyclesmiss rates: 10-20%L2 miss rate is fraction of L1 misses that also miss in L2why so high?
33 Avg mem access time = 1 + 0.05 x 20 = 2 cycle ExampleAssumeL1 Hit Time = 1 cycleL1 Miss rate = 5%L2 Hit Time = 5 cyclesL2 Miss rate = 15% (% L1 misses that miss)L2 Miss Penalty = 100 cyclesL1 miss penalty = * 100 = 20Avg mem access time = x = 2 cycle
34 Example: Without L2 Cache AssumeL1 Hit Time = 1 cycleL1 Miss rate = 5%L1 Miss Penalty = 100 cyclesAvg mem access time = x = 6 cycles3x faster with L2 cache
35 What to Do on a Write Hit? Write-through Write-back update the word in cache block and corresponding word in memoryWrite-backupdate word in cache blockallow memory word to be “stale”=> add ‘dirty’ bit to each line indicating that memory needs to be updated when block is replaced=> OS flushes cache before I/O !!! So that cache values become the same as memory values changed by I/OPerformance trade-offs?
36 Things to Remember (#1/2) Caches are NOT mandatory:Processor performs arithmeticMemory stores dataCaches simply make data transfers go fasterEach level of memory hierarchy is just a subset of next higher levelCaches speed up due to temporal locality: store data used recentlyBlock size > 1 word speeds up due to spatial locality: store words adjacent to the ones used recently
37 Things to Remember (#2/2) Cache design choices:size of cache: speed v. capacitydirect-mapped v. associativefor N-way set assoc: choice of Nblock replacement policy2nd level cache?Write through v. write back?Use performance model to pick between choices, depending on programs, technology, budget, ...