2 Memory ChallengesIdeally one desires a huge amount of very fast memory for little cost, but:Fast memory is expensiveCheap memory is slowThe solution on a fixed budget for memory is a hierarchyA small amount of very fast memory (Think SRAM)A medium amount of slower memory (Think DRAM)A large amount of slower yet memory (Think Disk)Comparing:Technology Access Time Cost/GBSRAM – 5 ns $4,000 - $10,000DRAM – 70 ns $100 - $200Disk – 20 ms $ $2Recall: We used 200ps or 0.2 ns in our pipeline study. Why the difference?
3 The “Memory Wall” Logic vs DRAM speed gap continues to grow Clocks per DRAM accessClocks per instruction
4 Philosophically How does one UTILIZE the very fast memory effectively? Think “The Principal of Locality”Temporal Locality (Close in Time)Memory that has been accessed recently is most likely to be accessed soonerSpatial Locality (Close in location)Memory that is close to memory that has been accessed recently is most likely to be accessed soonerOrganize memory in blocksKeep blocks likely to be used soon in the very fast memoryKeep the next most likely blocks in medium fast memoryKeep those not likely to be used soon in slower memory
6 Cache memory What is cache? How is it organized? A small amount of very high speed memory between the “main memory” and the CPUHow is it organized?Organized in a number of uniform sized blocks of memory that have a high likelihood of being used.How is kept “current”?When a block in main memory is more likely to be needed, that block replaces a block in the cache.How do we know it is needed?An access fails to find the word in the cacheWhere does it get placed in the cache?Likely in place of the last used blockHow do we rate the performance of the cache?Based upon Hit rates and Miss ratesShould there be Instruction Caches and Data Caches?
7 Hierarchical Memory Organization Registers are the fastestCache is the fastest “Memory” - SRAMDRAM makes good main memoryDisk is best for the rest (majority)
8 (Relative) size of the memory at each level The Memory HierarchyTake advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technologyInclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SMProcessor4-8 bytes (words)1 to 4 blocks1,024+ bytes (disk sector = page)8-32 bytes (block)Increasing distance from the processor in access timeL1$L2$Main MemorySecondary Memory(Relative) size of the memory at each level
9 The Memory Hierarchy: Pictorially Temporal Locality (Locality in Time):Keep most recently accessed data items closer to the processorSpatial Locality (Locality in Space): Move blocks consisting of contiguous words to the upper levelsLower LevelMemoryTo ProcessorUpper LevelMemoryHow does the memory hierarchy work? Well it is rather simple, at least in principle.In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon.In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it.+1 = 15 min. (X:55)Blk XFrom ProcessorBlk Y
10 The Memory Hierarchy: Terminology Hit: data is in some block in the upper level (Blk X)Hit Rate: the fraction of memory accesses found in the upper levelHit Time: Time to access the upper level which consists ofRAM access time + Time to determine hit/missMiss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y)Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level Time to deliver the block the processorHit Time << Miss PenaltyLower LevelMemoryUpper LevelTo ProcessorFrom ProcessorBlk XBlk YA HIT is when the data the processor wants to access is found in the upper level (Blk X).The fraction of the memory access that are HIT is defined as HIT rate.HIT Time is the time to access the Upper Level where the data is found (X). It consists of:(a) Time to access this level.(b) AND the time to determine if this is a Hit or Miss.If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level.By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate.This miss penalty also consists of two parts:(a) The time it takes to replace a block (Blk Y to BlkX) in the upper level.(b) And then the time it takes to deliver this new block to the processor.It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy.
11 How is the Hierarchy Managed? registers memoryby compiler (or programmer?)cache main memoryby the cache controller hardwaremain memory disksby the operating system (virtual memory)virtual to physical address mapping assisted by the hardware (TLB)by the programmer (files)
12 (block address) modulo (# of blocks in the cache) Two questions to answer (in hardware):Q1: How do we know if a data item is in the cache?Q2: If it is, how do we find it?Direct Mapped CachingFor each item of data at the lower level, there is exactly one location in the cache where it might be - so lots of items at the lower level must share locations in the upper levelAddress mapping:(block address) modulo (# of blocks in the cache)First consider block sizes of one word
13 Caching: A Simple First Example Main Memory0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xxTwo low order bits define the byte in the word (32-b words)CacheIndexValidTagData00011011Q2: How do we find it?Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)Q1: Is it there?Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cacheFor class handout(block address) modulo (# of blocks in the cache)
14 Caching: A Simple First Example Main Memory0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xxTwo low order bits define the byte in the word (32b words)CacheIndexValidTagData00011011Q2: How do we find it?Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)Q1: Is it there?Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cacheFor lectureValid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block(block address) modulo (# of blocks in the cache)
15 Direct Mapped Cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid12343415For class handout
16 Direct Mapped Cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not validmiss1miss2miss3miss00 Mem(0)00 Mem(0)00 Mem(0)00 Mem(1)00 Mem(0)00 Mem(1)00 Mem(2)00 Mem(1)00 Mem(2)00 Mem(3)4miss3hit4hit15miss01400 Mem(0)00 Mem(1)00 Mem(2)00 Mem(3)01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)For lecture11158 requests, 6 misses
17 MIPS Direct Mapped Cache Example One word/block, cache size = 1K wordsByte offset20Tag10IndexHitData32DataIndexTagValid12.10211022102320Let’s use a specific example with realistic numbers: assume we have a 1 K word (4Kbyte) direct mapped cache with block size equals to 4 bytes (1 word).In other words, each block associated with the cache tag will have 4 bytes in it (Row 1).With Block Size equals to 4 bytes, the 2 least significant bits of the address will be used as byte select within the cache block.Since the cache size is 1K word, the upper 32 minus 10+2 bits, or 20 bits of the address will be stored as cache tag.The rest of the (10) address bits in the middle, that is bit 2 through 11, will be used as Cache Index to select the proper cache entryTemporal!What kind of locality are we taking advantage of?
18 Handling Cache Hits Read hits (I$ and D$) Write hits (D$ only) this is what we want – no challengesWrite hits (D$ only)What is the problem here?Strategiesallow cache and memory to be inconsistentwrite the data only into the cache block (write-back the cache contents to the next level in the memory hierarchy when that cache block is “evicted”)need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evictedrequire the cache and memory to be consistentalways write the data into both the cache block and the next level in the memory hierarchy (write-through) so don’t need a dirty bitwrites run at the speed of the next level in the memory hierarchy – so slow! – or can use a write buffer, so only have to stall if the write buffer is full
19 Read / Write Strategies Read Through: Word read from memoryNo Read Through: Word word from cache after block is read from memoryWrite Through: Word written to both Cache and MemoryWrite Back: Word written only to CacheWrite Allocate: Block is loaded on a write miss, followed by a write hitWrite No Allocate: Block is modified on a write miss but not loadedWrite Hit Policy Write Miss PolicyWrite Through Write AllocateWrite Through * Write No Allocate *Write Back * Write Allocate *Write Back No Write Allocate
20 Write Buffer for Write-Through Caching CacheProcessorDRAMwrite bufferWrite buffer between the cache and main memoryProcessor: writes data into the cache and the write bufferMemory controller: writes contents of the write buffer to memoryThe write buffer is just a FIFOTypical number of entries: 4Works fine if store frequency (w.r.t. time) << 1 / DRAM write cycleMemory system designer’s nightmareWhen the store frequency (w.r.t. time) → 1 / DRAM write cycle leading to write buffer saturationOne solution is to use a write-back cache; another is to use an “L2” cacheWe really didn't write to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene.The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions.Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory.If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait.A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and when that happens your processor will be running at DRAM cycle time--very very slow.The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache.Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back.
21 Another Reference String Mapping Consider the main memory word reference stringStart with an empty cache - all blocks initially marked as not valid4444For class handout
22 Another Reference String Mapping Consider the main memory word reference stringStart with an empty cache - all blocks initially marked as not validmiss4missmiss4miss0140001400 Mem(0)00 Mem(0)01 Mem(4)00 Mem(0)miss4missmiss4miss014000140001 Mem(4)00 Mem(0)01 Mem(4)00 Mem(0)For class handout8 requests, 8 missesPing pong effect due to conflict misses - two memory locations that map into the same cache block
23 Sources of Cache Misses Compulsory (cold start or process migration, first reference):First access to a block, “cold” fact of life, not a whole lot you can do about itIf you are going to run “millions” of instruction, compulsory misses are insignificantConflict (collision):Multiple memory locations mapped to the same cache locationSolution 1: increase cache size or block lengthSolution 2: increase associativityCapacity:Cache cannot contain all blocks accessed by the programSolution: increase cache sizeWhat about the relationship between cache size and block length?(Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program.The solution to reduce the Capacity miss rate is simple: increase the cache size.Here is a summary of other types of cache miss we talked about.First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program.Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location.There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity.For example, say using a 2-way set associative cache instead of directed mapped cache.But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone.Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache.+2 = 43 min. (Y:23)
24 Handling Cache Misses Read misses (I$ and D$) Write misses (D$ only) stall the entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resumeWrite misses (D$ only)stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume or (normally used in write-back caches)Write allocate – just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall or (normally used in write-through caches with a write buffer)No-write allocate – skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full; must invalidate the cache block since it will be inconsistent (now holding stale data)Let’s look at our 1KB direct mapped cache again.Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select.After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory?If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss?True, the principle of spatial locality implies that we are likely to access them soon.But the type of access we are going to do is likely to be another write.So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss.If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid.This bring us to the topic of sub-blocking.
25 Multiword Block Direct Mapped Cache Four words/block, cache size = 1K wordsByte offsetHitData32Block offset20Tag8IndexDataIndexTagValid12.25325425520to take advantage for spatial locality want a cache block that is larger than word word in size.What kind of locality are we taking advantage of?
26 Taking Advantage of Spatial Locality Let cache block hold more than one wordStart with an empty cache - all blocks initially marked as not valid12343For class handout415
27 Taking Advantage of Spatial Locality Let cache block hold more than one wordStart with an empty cache - all blocks initially marked as not validmiss1hit2miss00 Mem(1) Mem(0)00 Mem(1) Mem(0)00 Mem(1) Mem(0)00 Mem(3) Mem(2)3hit4miss3hit015400 Mem(1) Mem(0)00 Mem(1) Mem(0)01 Mem(5) Mem(4)00 Mem(3) Mem(2)00 Mem(3) Mem(2)00 Mem(3) Mem(2)For lecture4hit15miss01 Mem(5) Mem(4)01 Mem(5) Mem(4)11151400 Mem(3) Mem(2)00 Mem(3) Mem(2)8 requests, 4 misses
28 Miss Rate vs Block Size vs Cache Size Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)
29 Increased Miss Penalty Block Size TradeoffLarger block sizes take advantage of spatial locality butIf the block size is too big relative to the cache size, the miss rate will go upLarger block size means larger miss penaltyLatency to first word in block + transfer time for remaining wordsAverageAccessTimeIncreased Miss Penalty& Miss RateBlock SizeMissRateExploits Spatial LocalityFewer blockscompromisesTemporal LocalityBlock SizeMissPenaltyBlock SizeAs I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality.But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty.As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block.Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality.However, once you pass certain point, your miss rate actually goes up.As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty.But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well.In general, Average Memory Access Time= Hit Time + Miss Penalty x Miss Rate
30 Multiword Block Considerations Read misses (I$ and D$)Processed the same as for single word blocks – a miss returns the entire block from memoryMiss penalty grows as block size growsEarly restart – datapath resumes execution as soon as the requested word of the block is returnedRequested word first – requested word is transferred from the memory to the cache (and datapath) firstNonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier missWrite misses (D$)Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall timeEarly restart works best for instruction caches (since it works best for sequential accesses) – if the memory system can deliver a word every clock cycle, it can return words just in time. But if the processor needs another word from a different block before the previous transfer is complete, then the processor will have to stall until the memory is no longer busy.Unless you have a nonblocking cache that come in two flavorsHit under miss – allow additional cache hits during a miss with the goal of hiding some of the miss latencyMiss under miss – allow multiple outstanding cache misses (need a high bandwidth memory system to support it)
31 Cache Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of timeTemporal Locality: Locality in TimeSpatial Locality: Locality in SpaceThree major categories of cache misses:Compulsory misses: sad facts of life. Example: cold start missesConflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect!Capacity misses: increase cache sizeCache design spacetotal size, block size, associativity (replacement policy)write-hit policy (write-through, write-back)write-miss policy (write allocate, write buffers)Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space.So far, we have covered three major categories of cache misses.Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you.Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both.Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger.There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer.The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory.No fancy replacement policy is needed for the direct mapped cache. That is what caused direct mapped cache trouble to begin with – only one place to go in the cache causing conflict misses.
32 Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, thenCPU time = IC × CPI × CC= IC × (CPIideal + Memory-stall cycles) × CCCPIstallMemory-stall cycles come from cache misses (a sum of read-stalls and write-stalls)Read-stall cycles = reads/program × read miss rate × read miss penaltyWrite-stall cycles = (writes/program × write miss rate × write miss penalty)+ write buffer stallsFor write-through caches, we can simplify this toMemory-stall cycles = miss rate × miss penaltyReasonable write buffer depth (e.g., four or more words) and a memory capable of accepting writes at a rate that significantly exceeds the average write frequency means write buffer stalls are smallAverage Memory Access time = Hit Time + Miss Rate x Miss Penalty
33 Impacts of Cache Performance Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI)The memory speed is unlikely to improve as fast as processor cycle time. When calculating CPIstall, the cache miss penalty is measured in processor clock cycles needed to handle a missThe lower the CPIideal, the more pronounced the impact of stallsA processor with a CPIideal of 2, a 100 cycle miss penalty, 36% load/store instr’s, and 2% I$ and 4% D$ miss ratesMemory-stall cycles = 2% × % × 4% × 100 = 3.44So CPIstalls = = 5.44What if the CPIideal is reduced to 1? 0.5? ?What if the processor clock rate is doubled (doubling the miss penalty)?For ideal CPI = 1, then CPIstall = 4.44 and the amount of execution time spent on memory stalls would have risen from 3.44/5.44 = 63% to 3.44/4.44 = 77%For miss penalty of 200, memory stall cycles = 2% % x 4% x 200 = 6.88 so that CPIstall = 8.88This assumes that hit time is not a factor in determining cache performance. A larger cache would have a longer access time (if a lower miss rate), meaning either a slower clock cycle or more stages in the pipeline for memory access.
34 Reducing Cache Miss Rates #1 Allow more flexible block placementIn a direct mapped cache a memory block maps to exactly one cache blockAt the other extreme, could allow a memory block to be mapped to any cache block – fully associative cacheA compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)(block address) modulo (# sets in the cache)