Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.

Similar presentations


Presentation on theme: "CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University."— Presentation transcript:

1 CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University And Slides Supplied by the textbook publisher Read Sections: 5.3, 5.7, 5.8

2 CPE432 Chapter 5A.2Dr. W. Abu-Sufah, UJ  Read hits (I$ and D$) l this is what we want!  Write hits (D$ only) 1.require the cache and memory to be consistent -always write the data into both the cache block and the next level in the memory hierarchy (write-through policy) -writes run at the speed of the next level in the memory hierarchy – so slow! – -or -can use a write buffer and stall only if the write buffer is full Handling Cache Hits

3 CPE432 Chapter 5A.3Dr. W. Abu-Sufah, UJ  Write hits (D$ only) (continued) 2.allow cache and memory to be inconsistent -On a write hit, write the data only into the cache block -write-back policy: Write the cache block to the next level in the memory hierarchy when that cache block is “evicted” -need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted -can use a write buffer to help “buffer” write-backs of dirty blocks Handling Cache Hits

4 CPE432 Chapter 5A.4Dr. W. Abu-Sufah, UJ Classes/Types of Cache Misses  Compulsory misses (cold start or process migration, first reference): l First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant l Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)  Capacity misses: l Cache cannot contain all blocks accessed by the program l Solution: increase cache size (may increase access time)

5 CPE432 Chapter 5A.5Dr. W. Abu-Sufah, UJ Classes/Types of Cache Misses (continued)  Conflict misses (collision): l Multiple memory locations mapped to the same cache location l Solution 1: increase cache size l Solution 2: increase associativity

6 CPE432 Chapter 5A.6Dr. W. Abu-Sufah, UJ Handling Cache Misses (Single Word Blocks)  Read misses (I$ and D$) l stall the pipeline l fetch the block from the next level in the memory hierarchy l install it in the cache l send the requested word to the processor l then let the pipeline resume

7 CPE432 Chapter 5A.7Dr. W. Abu-Sufah, UJ Handling Cache Misses (Single Word Blocks)  Write misses (D$ only) Option 1 l stall the pipeline l fetch the block from next level in the memory hierarchy l install it in the cache (which may involve having to evict a dirty block if using a write-back cache) l write the word from the processor to the cache l then let the pipeline resume OR

8 CPE432 Chapter 5A.8Dr. W. Abu-Sufah, UJ Handling Cache Misses (Single Word Blocks) (cont.)  Write misses (D$ only) l Option 2: Write allocate l just write the word into the cache updating both the tag and data l no need to check for cache hit l no need to stall OR

9 CPE432 Chapter 5A.9Dr. W. Abu-Sufah, UJ Handling Cache Misses (Single Word Blocks) (cont.)  Write misses (D$ only) l Option 3: No-write allocate l skip the cache write (but must invalidate that cache block since it will now hold stale data) l just write the word to the write buffer (and eventually to the next memory level) l no need to stall if the write buffer isn’t full

10 CPE432 Chapter 5A.10Dr. W. Abu-Sufah, UJ Multiword Block Considerations  Read misses (I$ and D$) l Processed the same as for single word blocks – a miss returns the entire block from memory l Miss penalty grows as block size grows -Early restart – processor resumes execution as soon as the requested word of the block is returned -Requested word first – requested word is transferred from the memory to the cache first (and to processor) l Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss

11 CPE432 Chapter 5A.11Dr. W. Abu-Sufah, UJ Multiword Block Considerations (continued)  Write misses (D$) l If using write allocate must first fetch the block from memory and then write the word to the block

12 CPE432 Chapter 5A.12Dr. W. Abu-Sufah, UJ  The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways Memory Systems that Support Caches One word wide organization (one word wide bus and one word wide memory) bus CPU Cache DRAM Memory 32-bit data & 32-bit addr per bus cycle on-chip  Definition: Memory-Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle

13 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Main Memory Supporting Caches (continued) Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM word access 1 bus cycle per DRAM word transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles Bandwidth = 16 bytes / 65 bus cycles = 0.25 B/bus cycle

14 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Increasing Memory Bandwidth 4-word wide memory Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle Compared to 0.25 B/bus cycle for 1-word-wide DRAM

15 CPE432 Chapter 5A.15Dr. W. Abu-Sufah, UJ Interleaved Memory, One Word Wide Bus  For a block size of four words cycle to send 1 st address cycles to read DRAM banks cycles to return last data word total clock cycles miss penalty CPU Cache bus on-chip  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock 15 cycles (4 x 4)/20 = 0.8 1 15 4*1 = 4 20 DRAM Memory bank 1 DRAM Memory bank 0 DRAM Memory bank 2 DRAM Memory bank 3

16 CPE432 Chapter 5A.16Dr. W. Abu-Sufah, UJ DRAM Memory System Summary  Its important to match the cache characteristics l caches access one block at a time (usually more than one word)  with the DRAM characteristics l use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache  with the memory-bus characteristics l make sure the memory-bus can support the DRAM access rates and patterns l with the goal of increasing the Memory-Bus to Cache bandwidth

17 CPE432 Chapter 5A.17Dr. W. Abu-Sufah, UJ Measuring Cache Performance  Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = IC × CPI × CC = IC × (CPI ideal + Memory-stall cycles) × CC CPI stall  Memory-stall cycles come from cache misses (a sum of read- stalls and write-stalls) Read-stall cycles = reads/program × read miss rate × read miss penalty Write-stall cycles = (writes/program × write miss rate × write miss penalty) + write buffer stalls  For write-through caches, we can simplify this to Memory-stall cycles = accesses/program × miss rate × miss penalty

18 CPE432 Chapter 5A.18Dr. W. Abu-Sufah, UJ Impacts of Cache Performance (cont.)  Calculate memory-stall cycles given: l A processor with a CPI ideal of 2 l a 100 cycle miss penalty l 36% load/store instructions l 2% I$ miss rate l 4% D$ miss rate Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44 So CPI stalls = 2 + 3.44 = 5.44 more than twice the CPI ideal !  What if the CPI ideal is reduced to 1? 0.5? 0.25?  What if the D$ miss rate went up 1%? 2%?  What if the processor clock rate is doubled (doubling the miss penalty)?

19 CPE432 Chapter 5A.19Dr. W. Abu-Sufah, UJ Average Memory Access Time (AMAT)  A larger cache will have a longer access time (so longer hit time).  An increase in hit time will likely add another memory stage to the pipeline.  At some point the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance.  We measure the performance of a main memory/cache system by the Average Memory Access Time (AMAT): l The average time to access memory considering both hits and misses AMAT = Time for a hit + Miss rate x Miss penalty

20 CPE432 Chapter 5A.20Dr. W. Abu-Sufah, UJ Average Memory Access Time (AMAT) Example  What is the AMAT for a processor with l a 20 psec clock l a miss penalty of 50 clock cycles l a miss rate of 0.02 misses per instruction l a cache access time (hit time) of 1 clock cycle? AMAT = 20 +.02 X 50 X 20 = 40 psec

21 CSE431 Chapter 5A.21Irwin, PSU, 2008 Reducing Cache Miss Rates #1 1.Allow more flexible block placement (instead of direct mapped)  fully associative cache  n-way set associative

22 CPE432 Chapter 5A.22Dr. W. Abu-Sufah, UJ Reducing Cache Miss Rates #2 Use multiple levels of caches  With advancing technology there is more than enough room on the die for  bigger L1 caches or  for a second level of caches – normally a unified L2 cache (i.e., it holds both instructions and data)  and  in some cases even a unified L3 cache

23 CPE432 Chapter 5A.23Dr. W. Abu-Sufah, UJ Performance of a Single Level VS. Two Level Cache I  A processor with a CPI ideal of 2 l a 100 cycle miss penalty l 36% load/store instructions l 2% I$ L1 cache miss rate l 4% D$ L1 cache miss rate Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44 So CPI stalls = 2 + 3.44 = 5.44

24 CPE432 Chapter 5A.24Dr. W. Abu-Sufah, UJ Performance of a Single Level VS. Two Level Cache II  CPI ideal of 2  100 cycle miss penalty (to main memory)  36% load/stores  2% I$ L1 cache miss rate  4% D$ L1 cache miss rate  25 cycle miss penalty to Unified Level 2 cache (UL2$)  add a 0.5% UL2$ miss rate CPI stalls = 2 +.02×25 +.36×.04×25 +.005×100 +.36×.005×100 = 3.54 (as compared to 5.44 with no L2$) CPI stalls =

25 CPE432 Chapter 5A.25Dr. W. Abu-Sufah, UJ Multilevel Cache Design Considerations  Design considerations for L1 and L2 caches are very different l Primary cache should focus on minimizing hit time in support of a shorter clock cycle -Smaller with smaller block sizes l Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times -Larger with larger block sizes -Higher levels of associativity

26 CPE432 Chapter 5A.26Dr. W. Abu-Sufah, UJ Multilevel Cache Design Considerations (cont.)  The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache – so it can be smaller (i.e., faster) but have a higher miss rate  For the L2 cache, hit time is less important than miss rate l The L2$ hit time determines L1$’s miss penalty l The global miss rate << than L2$ local miss rate

27 CPE432 Chapter 5A.27Dr. W. Abu-Sufah, UJ Two Machines’ Cache Parameters Intel NehalemAMD* Barcelona L1 cache organization & size Split I$ and D$; 32KB for each per core; 64B blocks Split I$ and D$; 64KB for each per core; 64B blocks L1 associativity4-way (I), 8-way (D) set assoc.; ~LRU replacement 2-way set assoc.; LRU replacement L1 write policywrite-back, write-allocate L2 cache organization & size Unified; 256KB per core; 64B blocks Unified; 512KB per core; 64B blocks L2 associativity8-way set assoc.; ~LRU16-way set assoc.; ~LRU L2 write policywrite-back, write-allocate L3 cache organization & size Unified; 8192KB shared by cores; 64B blocks Unified; 2048KB shared by cores; 64B blocks L3 associativity16-way set assoc.32-way set assoc. L3 write policywrite-back, write-allocatewrite-back; write-allocate * Advanced Micro Devices

28 CPE432 Chapter 5A.28Dr. W. Abu-Sufah, UJ Read Section 5.7 Using A Finite State Machine to Control A Simple Cache

29 Cache Control (Design of Controller) Example cache characteristics Direct-mapped, write-back/write allocate (i.e. policy on a write miss: allocate block followed by the write-hit action) 32-bit byte addresses Block size: 4 32-bit words (so 16B) Cache size: 16KB (so 1024 blocks) Valid bit and dirty bit per block Blocking cache CPU waits until access is complete TagIndexOffset 034131431 4 bits10 bits18 bits

30 Interface Signals Cache CPU Memory Read/Write Valid Address Write Data Read Data Ready 32 Read/Write Valid Address Write Data Read Data Ready 32 128 When set to 1, indicates that there is a cache operation When set to 1, indicates that there is a memory operation When set to 1, indicates that the cache operation is complete When set to 1, indicates that the memory operation is complete

31 CPE432 Chapter 5A.31Dr. W. Abu-Sufah, UJ Four State Cache Controller Idle Compare Tag: If Valid Sinal && Hit Set Valid Bit, Set Tag, If Write set Dirty Allocate: Read new block from memory Write Back: Write old block to memory Cache Hit : Mark Cache Ready Cache Miss : Old block is Dirty Memory Ready Memory Not Ready Memory Not Ready Cache Miss : Old block is clean Valid CPU request

32 CPE432 Chapter 5A.32Dr. W. Abu-Sufah, UJ Read 5.8 Parallelism and Memory Hierarchies: Cache Coherence

33 CPE432 Chapter 5A.33Dr. W. Abu-Sufah, UJ Cache Coherence in Multicores  In multicore processors cores will share a common physical address space, causing a cache coherence problem Core 1Core 2 L1 I$L1 D$ Unified (shared) L2 L1 I$L1 D$ X = 0 Read X Write 1 to X X = 1 This is the cache coherence problem

34 34 Coherence Defined Informally: If caches are coherent, then reads return most recently written value Formally: If core1 writes X and then core1 reads X (with no intervening writes)  the read operation returns the written value Core1 writes X then core2 reads X (sufficiently later)  the read operation returns the written value core1 writes X then core2 writes X  cores 1 & 2 see writes in the same order End up with the same final value for X

35 CPE432 Chapter 5A.35Dr. W. Abu-Sufah, UJ A Coherent Memory System  Any read of a data item should return the most recently written value of the data item l Coherence – defines what values can be returned by a read -Writes to the same location are serialized (two writes to the same location must be seen in the same order by all cores) l Consistency – determines when a written value will be returned by a read

36 CPE432 Chapter 5A.36Dr. W. Abu-Sufah, UJ A Coherent Memory System (cont.)  To enforce coherence, caches must provide l Replication of shared data items in multiple cores’ caches l A data item is copied from a cache to another cache instead of accessing next level L2 cache again l Replication reduces both latency and contention for a read shared data item l Migration of shared data items to a core’s local L1 cache l When shared data are being simultaneously read, the L1 caches make copies for different cores L1 caches of the data item. l Migration reduced the latency of the access to the data and the bandwidth demand on the shared memory (L2 in our example)

37 CPE432 Chapter 5A.37Dr. W. Abu-Sufah, UJ Cache Coherence Protocols  Snooping Cache Coherence Protocols: Most poular hardware protocols to ensure cache coherence  The cache controllers snoop (monitor ) on the bus to determine if their cache has a copy of a block that is requested  Write invalidate protocol – writes require exclusive access to the cache block and invalidate all other copies l Exclusive access ensure that no other readable or writable copies of an item exists  If two processors attempt to write the same data at the same time, one of them wins the race causing the other core’s copy to be invalidated. For the other core to complete, it must obtain a new copy of the data which must now contain the updated value – thus enforcing write serialization

38 CPE432 Chapter 5A.38Dr. W. Abu-Sufah, UJ Example of Snooping Invalidation  When the second miss by Core 2 occurs, Core 1 responds with the value canceling the response from the L2 cache (and also updating the L2 copy) Core 1Core 2 L1 I$L1 D$ Unified (shared) L2 L1 I$L1 D$ X = 0 Read X Write 1 to X X = 1 Read X X = I X = 1

39 Invalidating Snooping Protocols A processor’s cache gets exclusive access to a block when this processor writes in this block The exclusive access is achieved by broadcasting an “invalidate” message on the bus which will make copies of the block in caches of other processors invalid. A subsequent read by a another processor produces a cache miss Owning cache supplies updated value 39

40 CPE432 Chapter 5A.40Dr. W. Abu-Sufah, UJ A Write-Invalidate Cache Coherence Protocol Shared (clean) Invalid Modified (dirty) write-back caching protocol in black read (miss) write (hit) read (hit) read (hit) or write (hit) write (miss) send invalidate receives invalidate (write by another core to this block) write-back due to read miss by another core to this block send invalidate signals from the core in red signals from the bus in blue

41 CPE432 Chapter 5A.41Dr. W. Abu-Sufah, UJ Summary: Improving Cache Performance 0. Reduce the time to hit in the cache l smaller cache l direct mapped cache l smaller blocks l for writes -no write allocate – no “hit” on cache, just write to write buffer -write allocate – to avoid two cycles pipeline writes via a delayed write buffer to cache

42 CPE432 Chapter 5A.42Dr. W. Abu-Sufah, UJ Summary: Improving Cache Performance (cont.) 1. Reduce the miss rate l bigger cache l more flexible placement (increase associativity) l larger blocks (16 to 64 bytes typical) l Implement a “victim” cache – small buffer holding most recently discarded blocks

43 CPE432 Chapter 5A.43Dr. W. Abu-Sufah, UJ Summary: Improving Cache Performance (cont. II0 2. Reduce the miss penalty l smaller blocks l use a write buffer to hold dirty blocks being replaced so don’t have to wait for the write to complete before reading l check write buffer (and/or victim cache) on read miss – may get lucky l for large blocks fetch critical word first l use multiple cache levels – L2 cache not tied to CPU clock rate l faster backing store (i.e. improved memory bandwidth) -wider buses -memory interleaving, DDR SDRAMs

44 CPE432 Chapter 5A.44Dr. W. Abu-Sufah, UJ Summary: The Cache Design Space  Several interacting dimensions l cache size l block size l associativity l replacement policy l write-through vs write-back l write allocation  The optimal choice is a compromise l depends on access characteristics -workload -use (I-cache, D-cache, TLB) l depends on technology / cost Associativity Cache Size Block Size Factor A Bad Good LessMore Factor B Simplicity often wins


Download ppt "CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 3 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University."

Similar presentations


Ads by Google