Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Similar presentations


Presentation on theme: "Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)"— Presentation transcript:

1 Chapter 5 Memory III CSE 820

2 Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

3 Michigan State University Computer Science and Engineering Larger Block Size Reduces compulsory misses through spatial locality But, –miss penalty increases: higher bandwidth helps –miss rate can increase: fixed cache size + larger blocks means fewer blocks in the cache

4 Michigan State University Computer Science and Engineering Notice the “U” shape: some is good, too much is bad.

5 Michigan State University Computer Science and Engineering Larger Caches Reduces capacity misses But –Increased hit time –Increased cost ($) Over time, L2 and higher cache size increases

6 Michigan State University Computer Science and Engineering Higher Associativity Reduces miss rates with fewer conflicts But –Increased hit time (tag check) Note –An 8-way associative cache has close to the same miss rate as fully associative

7 Michigan State University Computer Science and Engineering Way Prediction Predict which way of a L1 cache will be accessed next –Alpha 21264 correct prediction is 1 cycle incorrect prediction is 3 cycles –SPEC95 prediction is 85% correct

8 Michigan State University Computer Science and Engineering Compiler Techniques Reduce conflicts in I-cache: 1989 study showed reduced misses by 50% for a 2KB cache and by 75% for an 8KB cache D-cache performs differently

9 Michigan State University Computer Science and Engineering Compiler data optimizations Loop Interchange Before for (j = … for (i = … x[i][j] = 2 * x[i][j] After for (i = … for (j = … x[i][j] = 2 * x[i][j] Improved Spatial Locality

10 Michigan State University Computer Science and Engineering Blocking: Improve Spatial Locality Before After

11 Michigan State University Computer Science and Engineering Miss Rate and Miss Penalty Reduction via Parallelism

12 Michigan State University Computer Science and Engineering Nonblocking Caches Reduces stalls on cache miss A blocking cache refuses all requests while waiting for data A nonblocking cache continues to handle other requests while waiting for data on another request Increases cache controller complexity

13 Michigan State University Computer Science and Engineering NonBlocking Cache (8K direct L1; 32 byte blocks)

14 Michigan State University Computer Science and Engineering Hardware Prefetch Fetch two blocks: desired + next “Next” goes into “stream buffer” on fetch check stream buffer first Performance –Single-instruction stream buffer caught 15% to 25% of L1 misses –4-instruction stream buffer caught 50% –16-instruction stream buffer caught 72%

15 Michigan State University Computer Science and Engineering Hardware Prefetch Data prefetch –Single-data stream buffer caught 25% of L1 misses –4-data stream buffer caught 43% –8-data stream buffers caught 50% to 70% Prefetch from multiple addresses UltraSPARCIII handles 8 prefetches calculates “stride” for next prediction

16 Michigan State University Computer Science and Engineering Software Prefetch Many processors such as Itanium have prefetch instructions Remember they are nonfaulting

17 Michigan State University Computer Science and Engineering Hit Time Reduction

18 Michigan State University Computer Science and Engineering Small, Simple Caches Time –Indexing –Comparing tag Small  indexing is fast Simple  direct allows tag comparison in parallel with data load  L2 with tag on chip with data off chip

19 Michigan State University Computer Science and Engineering Time vs cache size & organization

20 Michigan State University Computer Science and Engineering Perspective on previous graph Same: –1ns clock is 10 -9 sec/clockCycle –1 GHz is 10 9 clockCycles/sec Therefore, –2ns clock is 500 MHz –4ns clock is 250 MHz Conclude that small differences in ns represents a large difference in MHz

21 Michigan State University Computer Science and Engineering Virtual vs Physical Address in L1 Translating from virtual address to physical address as part of cache access takes time on critical path Translation is needed for both index and tag Making the common case fast suggests avoiding translation for hits (misses must be translated)

22 Michigan State University Computer Science and Engineering Why are L1 caches physical? (almost all) Security (Protection): page-level protection must be checked on access (protection data can be copied into cache) Process switch can change virtual mapping requiring cache flush (or Process ID) [see next slide] Synonyms: two virtual addresses for same (shared) physical address

23 Michigan State University Computer Science and Engineering Virtually-addressed cache context-switch cost

24 Michigan State University Computer Science and Engineering Hybrid: virtually indexed, physically tagged Index with the part of the page offset that is identical in virtual and physical addresses i.e. the index bits are a subset of the page-offset bits In parallel with indexing, translate the virtual address to check the physical tag Limitation: direct-mapped cache ≤ page size (determined by address bits) set-associative caches can be bigger since fewer bits are needed for index

25 Michigan State University Computer Science and Engineering Example Pentium III –8 KB pages with 16KB 2-way set-associative cache IBM 3033 –4KB pages with 64KB 16-way set-associative cache (note that 8-way is sufficient, but 16-way is needed to keep index bits sufficiently small)

26 Michigan State University Computer Science and Engineering Trace Cache Pentium 4 NetBurst architecture I-cache blocks are organized to contain instruction traces including predicted taken branches instead of organized around memory addresses Advantage over regular large cache blocks which contain branches and, hence, many unused instructions e.g. AMD Athlon 64-byte blocks contain 16-24 x86 instructions with 1-in-5 being branches Disadvantage: complex addressing

27 Michigan State University Computer Science and Engineering Trace Cache P4 trace cache (I-cache) is placed after decode and branch predict so it contains –μops –only desired instructions Trace cache contains 12K μops Branch predict BTB is 4K (33% improvement over PIII)

28 Michigan State University Computer Science and Engineering

29 Michigan State University Computer Science and Engineering Summary (so far) Figure 5.26 summarizes all

30 Michigan State University Computer Science and Engineering Main-memory Main-memory modifications can help cache miss penalty by bringing words faster from memory –Wider path to memory brings in more words at a time, e.g. one address request brings in 4 words (reduces overhead) –Interleaved memory can allow memory to respond faster


Download ppt "Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)"

Similar presentations


Ads by Google