Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Computer Organization CS224 Fall 2012 Lessons 45 & 46

The Memory Hierarchy  Common principles apply at all levels of the memory hierarchy (caches, TLB, virtual memory) l Based on notions of caching  At each level in the hierarchy l Block placement l Finding a block l Replacement on a miss l Write policy §5.5 A Common Framework for Memory Hierarchies The BIG Picture

Block Placement  Determined by associativity l Direct mapped =1-way associative (1 block per set) -One choice for placement l n-way set associative (n blocks per set) -n choices within a set l Fully associative (1 set, all cache blocks in that set) -Any location  Higher associativity reduces miss rate (e.g. Fig 5.30) l Increases complexity, cost, and access time

Finding a Block  Hardware caches l Reduce comparisons to reduce cost  Virtual memory l Full table lookup makes full associativity feasible l Benefit in reduced miss rate AssociativityLocation methodTag comparisons Direct mappedIndex1 n-way set associative Use index, then search entries within the set n Fully associativeSearch all entries#entries Full lookup table0

Replacement (on a miss)  Caches (with associativity > 1) l Least recently used (LRU) -Complex and costly hardware for high associativity l Random -Close to LRU, easier to implement  Virtual memory (fully associative) l LRU approximation with hardware support

Write Policy  Write-through l Update both upper and lower levels l Simplifies replacement, but may require write buffer  Write-back l Update upper level only l Update lower level when block is replaced l Need to keep more state (e.g. dirty bit)  Virtual memory l Only write-back is feasible, given disk write latency  Lowest level of cache l Generally uses write-back, since rate of CPU stores > DRAM write rate [sw: 7.6%, sh: 0.1%, sb: 0.6% for SPECint2006] l Main memory generally can’t write an item every 1/8.3% = 12 instructions, so there is no write buffer size which will allow write- through to work

Sources of Misses  Compulsory misses (aka cold start misses) l First access to a block  Capacity misses l Due to finite cache size l A replaced block is later accessed again  Conflict misses (aka collision misses) l In a non-fully associative cache l Due to competition for entries in a set l Would not occur in a fully associative cache of the same total size

Cache Design Trade-offs Design changeEffect on miss rateNegative performance effect Increase cache sizeDecrease capacity misses May increase access time Increase associativityDecrease conflict misses May increase access time Increase block sizeDecrease compulsory misses (and other misses) because of ⁭ in spatial locality Increases miss penalty. For very large block size, may increase miss rate. See Figure 5.30, p. 519 and Figure 5.31, p. 524

Multilevel On-Chip Caches §5.10 Real Stuff: The AMD Opteron X4 and Intel Nehalem Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache Intel Nehalem 4-core processor

2-Level TLB Organization Intel NehalemAMD Opteron X4 Virtual addr48 bits Physical addr44 bits48 bits Page size4KB, 2/4MB L1 TLB (per core) L1 I-TLB: 128 entries for small pages, 7 per thread (2 threads per core) for large pages L1 D-TLB: 64 entries for small pages, 32 for large pages Both 4-way, LRU replacement L1 I-TLB: 48 entries L1 D-TLB: 48 entries Both fully associative, LRU replacement L2 TLB (per core) Single L2 TLB: 512 entries 4-way, LRU replacement L2 I-TLB: 512 entries L2 D-TLB: 512 entries Both 4-way, round-robin LRU TLB missesHandled in hardware

3-Level Cache Organization Intel NehalemAMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write- back/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write- back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8-way, approx LRU replacement, write- back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU replacement, write- back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16-way, replacement n/a, write- back/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles n/a: data not available

Miss Penalty Reduction  Return requested word first l Then back-fill rest of block  Non-blocking cache l Used with out-of-order processors l Hides miss latency, allowing processing to continue l Hit under miss: allow hits to proceed l Miss under miss: allow multiple outstanding misses  Hardware pre-fetch: instructions and data l Predicts in HW, works best for data arrays in loops  Opteron X4: bank-interleaved L1 D-cache l 8 banks, 2 concurrent 128-bit accesses per cycle l Bank-interleaved memory is much cheaper than multiported DRAM See Figure 5.40, p. 532

Pitfalls  Byte vs. word addressing. l Example: 32-byte direct-mapped cache, 4-byte blocks, 8 blocks -Byte 36 maps to block 1 (addr 100100, index 001) -Word 36 maps to block 4 (addr 100100, index 100)  Ignoring memory system effects when writing or generating code l Example: iterating over rows vs. columns of arrays (p. 544 code) l Large strides (decrements/increments) result in poor locality l Optimizing compilers (and good code-writers) re-organize programs, to increase spatial and temporal locality §5.11 Fallacies and Pitfalls

Pitfalls  In multiprocessor with shared L2 or L3 cache l Less associativity than cores results in conflict misses l More cores  need to increase associativity  Using AMAT to evaluate performance of out-of-order processors l Ignores effect of non-blocked accesses l Instead, evaluate performance by simulation

Concluding Remarks  Fast memories are small, large memories are slow l We really want fast, large memories  l Caching gives this illusion  Principle of locality l Programs use a small part of their memory space frequently  Memory hierarchy l L1 cache  L2 cache  …  DRAM memory  disk  Memory system design is critical for multiprocessors §5.12 Concluding Remarks

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Similar presentations

Presentation on theme: "Computer Organization CS224 Fall 2012 Lessons 45 & 46."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Similar presentations

Presentation on theme: "Computer Organization CS224 Fall 2012 Lessons 45 & 46."— Presentation transcript:

Similar presentations

About project

Feedback