Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

Similar presentations


Presentation on theme: "Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014"— Presentation transcript:

1 Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

2 Computation I pg 2 Memory Hierarchy, why? Users want large and fast memories! SRAM access times are 1 – 10 ns DRAM access times are ns Disk access times are 5 to 10 million ns, but it’s bits are very cheap Get best of both worlds: fast and large memories: –build a memory hierarchy CPU Level 1 Level 2 Level n Size Speed

3 Computation I pg 3 Memory recap We can build a memory – a logical k × m array of stored bits. Usually m = 8 bits / location n bits address k = 2 n locations m bits data / entry Address Space: number of locations (usually a power of 2) Addressability: m: number of bits per location (e.g., byte-addressable)

4 Computation I pg 4 SRAM: –value is stored with a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge on capacitor (must be refreshed) –very small but slower than SRAM (factor of 5 to 10) –charge leakes => refresh needed Memory element: SRAM vs DRAM

5 Computation I pg 5 Latest Intel: i7 Ivy Bridge, 22 nm -Sandy Bridge 32nm -> 22 nm -- incl graphics, USB3, etc.; 3 levels of cache

6 Computation I pg 6 Exploiting Locality Locality = principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality : nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) –block: minimum unit of data –hit: data requested is in the upper level –miss: data requested is not in the upper level block $ lower level upper level

7 Computation I pg 7 Cache operation Memory / Lower level Cache / Higher level block / line tagsdata

8 Computation I pg 8 Mapping: cache address is memory address modulo the number of blocks in the cache Direct Mapped Cache

9 Computation I pg 9 Q:What kind of locality are we taking advantage of in this example? Direct Mapped Cache Byte offset ValidTagDataIndex Tag Index HitData Address (bit positions)

10 Computation I pg 10 This example exploits (also) spatial locality (having larger blocks): Direct Mapped Cache Address (bit positions )

11 Computation I pg 11 Read hits –this is what we want! Read misses –stall the CPU, fetch block from memory, deliver to cache, restart the load instruction Write hits: –can replace data in cache and memory (write-through) –write the data only into the cache (write-back the cache later) Write misses: –read the entire block into the cache, then write the word (allocate on write miss) –do not read the cache line; just write to memory (no allocate on write miss) Hits vs. Misses

12 Computation I pg 12 Splitting first level cache Use split Instruction and Data caches –Caches can be tuned differently –Avoids dual ported cache CPU I$ D$ I&D $ Main Memory L1L2

13 Computation I pg 13 Let’s look at cache&memory performance T exec = N cycles T cycle = N inst CPI T cycle with CPI = CPI ideal + CPI stall CPI stall = %reads missrate read misspenalty read + %writes missrate write misspenalty write or: T exec = (N normal-cycles + N stall-cycles ) T cycle with N stall-cycles = N reads missrate read misspenalty read + N writes missrate write misspenalty write (+ Write-buffer stalls )

14 Computation I pg 14 Performance example (1) Assume application with: –Icache missrate 2% –Dcache missrate 4% –Fraction of ld-st instructions = 36% –CPI ideal (i.e. without cache misses) is 2.0 –Misspenalty 40 cycles Calculate CPI taking misses into account CPI = CPI stall CPI stall = Instruction-miss cycles + Data-miss cycles Instruction-miss cycles = N instr x 0.02 x 40 = 0.80 N instr Data-miss cycles = N instr x %ld-st x 0.04 x 40 CPI = 3.36 Slowdown: 1.68 !!

15 Computation I pg 15 Performance example (2) 1. What if ideal processor had CPI = 1.0 (instead of 2.0) Slowdown would be 2.36 ! 2. What if processor is clocked twice as fast => penalty becomes 80 cycles CPI = 4.75 Speedup = N.CPIa.Tclock / (N.CPIb.Tclock/2) = 3.36 / (4.75/2) Speedup is not 2, but only 1.41 !!

16 Computation I pg 16 Improving cache / memory performance Ways of improving performance: –decreasing the miss ratio (avoiding conflicts): associativity –decreasing the miss penalty: multilevel caches –Adapting block size: see earlier slides –Note: there are many more ways to improve memory performance (see e.g. master course 5MD00)

17 Computation I pg 17 How to reduce CPIstall ? CPI stall = %reads missrate read misspenalty read + %writes missrate write misspenalty write Reduce missrate: Larger cache –Avoids capacity misses –However: a large cache may increase T cycle Larger block (line) size –Exploits spatial locality: see previous lecture Associative cache –Avoids conflict misses Reduce misspenalty: Add 2 nd level of cache

18 Computation I pg 18 Decreasing miss ratio with associativity block 2 blocks / set 4 blocks / set 8 blocks / set

19 Computation I pg 19 An implementation: 4 way associative

20 Computation I pg 20 Performance of Associative Caches 1 KB 2 KB 8 KB

21 Computation I pg 21 Further Cache Basics cache_size = N sets x Associativity x Block_size block_address = Byte_address DIV Block_size in bytes index size = Block_address MOD Nsets Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently tag index block offset block address … 2 1 0bit 31 …

22 Computation I pg 22 Comparing different (1-level) caches (1) Assume –Cache of 4K blocks –4 word block size –32 bit address Direct mapped (associativity=1) : –16 bytes per block = 2^4 –32 bit address : 32-4=28 bits for index and tag –#sets=#blocks/ associativity : log2 of 4K=12 : 12 for index –Total number of tag bits : (28-12)*4K=64 Kbits 2-way associative –#sets=#blocks/associativity : 2K sets –1 bit less for indexing, 1 bit more for tag –Tag bits : (28-11) * 2 * 2K=68 Kbits 4-way associative –#sets=#blocks/associativity : 1K sets –1 bit less for indexing, 1 bit more for tag –Tag bits : (28-10) * 4 * 1K=72 Kbits

23 Computation I pg 23 Comparing different (1-level) caches (2) 3 caches consisting of 4 one-word blocks: Cache 1 : fully associative Cache 2 : two-way set associative Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8

24 Computation I pg 24 Direct Mapped Block addressCache Block 00 mod 4=0 66 mod 4=2 88 mod 4=0 Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3 0missMem[0] 8missMem[8] 0missMem[0] 6missMem[0]Mem[6] 8missMem[8]Mem[6] Coloured = new entry = miss

25 Computation I pg 25 2-way Set Associative: 2 sets Block addressCache Block 00 mod 2=0 66 mod 2=0 88 mod 2=0 Address of memory block Hit or miss SET 0 entry 0 SET 0 entry 1 SET 1 entry 0 SET 1 entry 1 0MissMem[0] 8MissMem[0]Mem[8] 0HitMem[0]Mem[8] 6MissMem[0]Mem[6] 8MissMem[8]Mem[6] LEAST RECENTLY USED BLOCK (so all in set/location 0)

26 Computation I pg 26 Fully associative (4 way assoc., 1 set) Address of memory block Hit or miss Block 0Block 1Block 2Block 3 0MissMem[0] 8MissMem[0]Mem[8] 0HitMem[0]Mem[8] 6MissMem[0]Mem[8]Mem[6] 8HitMem[0]Mem[8]Mem[6]

27 Computation I pg 27 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, FIFO, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)

28 Computation I pg 28 Classifying Misses: the 3 Cs The 3 Cs: –Compulsory—First access to a block is always a miss. Also called cold start misses misses in infinite cache –Capacity—Misses resulting from the finite capacity of the cache misses in fully associative cache with optimal replacement strategy –Conflict—Misses occurring because several blocks map to the same set. Also called collision misses remaining misses

29 Computation I pg 29 3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size not changed What happens if we: 1) Change Block Size: Which of 3Cs is obviously affected? compulsory 2) Change Cache Size: Which of 3Cs is obviously affected? capacity misses 3) Introduce higher associativity : Which of 3Cs is obviously affected? conflict misses

30 Computation I pg 30 3Cs Absolute Miss Rate (SPEC92) Conflict Miss rate per type

31 Computation I pg 31 Second Level Cache (L2) Most CPUs –have an L1 cache small enough to match the cycle time (reduce the time to hit the cache) –have an L2 cache large enough and with sufficient associativity to capture most memory accesses (reduce miss rate) L2 Equations, Average Memory Access Time (AMAT): AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) Definitions: –Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) –Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2 )

32 Computation I pg 32 Second Level Cache (L2) Suppose processor with base CPI of 1.0 Clock rate of 500 Mhz Main memory access time : 200 ns Miss rate per instruction primary cache : 5% What improvement with second cache having 20ns access time, reducing miss rate to memory to 2% ? Miss penalty : 200 ns/ 2ns per cycle=100 clock cycles Effective CPI=base CPI+ memory stall per instruction = ? –1 level cache : total CPI=1+5%*100=6 –2 level cache : a miss in first level cache is satisfied by second cache or memory Access second level cache : 20 ns / 2ns per cycle=10 clock cycles If miss in second cache, then access memory : in 2% of the cases Total CPI=1+primary stalls per instruction +secondary stalls per instruction Total CPI=1+5%*10+2%*100=3.5 Machine with L2 cache : 6/3.5=1.7 times faster

33 Computation I pg 33 Second Level Cache Global cache miss is similar to single cache miss rate of second level cache provided L2 cache is much bigger than L1. Local cache rate is NOT good measure of secondary caches as it is function of L1 cache. Global cache miss rate should be used.

34 Computation I pg 34 Second Level Cache

35 Computation I pg 35 Make reading multiple words easier by using banks of memory It can get a lot more complicated... How to connect the cache to next level?


Download ppt "Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014"

Similar presentations


Ads by Google