Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
Published byModified over 4 years ago
Presentation on theme: "Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory."— Presentation transcript:
Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory gap
Processor-Memory Gap µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Performance Source: lecture handouts Prof. John Kubiatowicz, CS252 – U.C.Berkeley
Because… Memory speed is a limiting factor in performance Caches are small and fast Caches leverage the principle of locality Temporal locality: data that has been referenced recently tends to be re-referenced soon Spatial locality: data close (in the address space) to recently referenced data tends to be referenced soon
Review Cache block: minimum unit of information that can be present in the cache (several contiguous memory positions) Cache hit: requested data can be found in cache Cache miss: requested data cannot be found in cache The four design questions: Where can a block be placed? How can a block be found? Which block should be replaced? What happens on a write?
Where can a block be placed? 0 2 4 3 6 5 7 1 Suppose we need to place block 10 Directly mapped (1-way): 10 mod 8 = 2 2-way set associative: 10 mod 4 = set 2 4-way set associative: 10 mod 2 = set 0 fully associative (8-way, in this case): anywhere Placement set = address mod (# sets) Where (# sets) = (cache size)/(# ways)
How can a block be found? Look at the address! Block AddressBlock Offset TagIndexBlock Offset determines set (no index in fully associative caches) determines offset in block block unique id “primary key”
Which block should be replaced? Random Least Recently Used (LRU) True LRU may be too costly to implement in hardware (requires a stack) Simplified LRU First in, First out (FIFO)
What happens on a write? Write through: every time a block is written, the new value is propagated to the next memory level Easier to implement Makes displacement simple and fast Reads never have to wait for a displacement to finish Writes may have to wait use a write buffer Write back: new value is propagated to the next memory level only when block is displaced Makes writes fast Uses less memory bandwidth Dirty bit may save additional bandwidth no need to write clean blocks Saves power
What happens on a write? (cont.) Write allocate: fetch on write Entire block is brought into cache No write allocate: write around Written word is sent to next memory level Write policy and write miss policy are independent, but usually: Write back write allocate Write through no write allocate
Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Unified cache Split caches D-cacheI-cache Size32KB16KB Miss rate1.99%0.64%6.47% Hit timeI:1 / D:211 Miss penalty50 75% of accesses are instruction references Which system is faster?
Solution: AMAT(split) = 0.75*(1+0.64%*50) + 0.25*(1+6.47%*50) AMAT(split) = 2.05 AMAT(unified) = 0.75*(1+1.99%*50) + 0.25*(2+1.99%*50) AMAT(unified) = 2.24 Miss Rate(split) = 0.75*0.64% + 0.25*6.47% = 2.10% Miss Rate(unified) = 1.99% Although split has a higher miss rate, it is faster on avg!
Processor Performance CPU time = (proc cyc + mem stall cyc)*(clk cyc time) proc cyc = IC*CPI mem stall cyc = (mem accesses)*(miss rate)*(miss penalty) CPI (proc)2.0 Miss penalty50 cyc Miss rate2% Mem ref/inst1.33 What is the total CPU time including the caches, in function of IC and clk cyc time? CPU time = (IC*2.0 + IC*1.33*.02*50)*(clk cyc time) mem stall cyc CPU time = IC*(clk cyc time)*3.33
Processor Performance AMAT has large impact on performance If CPI decreases, mem stall cyc represents a larger fraction of total cycles If clock cycle time decreases, mem stall cyc represents more cycles Note: in ooo execution processors, part of the memory access latency is overlapped with computation
Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing hit time: Small and simple caches No address translation Pipelined cache access Trace caches
Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing miss rate: Larger block size Larger cache size Higher associativity Way prediction or pseudo-associative caches Compiler optimizations (code/data layout)
Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing miss penalty: Multilevel caches Critical word first Read miss before write miss Merging write buffers Victim caches
Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing miss rate and miss penalty: Increase parallelism Non-blocking caches Prefetching Hardware Software