Download presentation
Presentation is loading. Please wait.
Published byBeatrix Summers Modified over 8 years ago
1
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and Jihong Kim School of Computer Science & Engineering Seoul National University Computer Architecture and Embedded Systems (CARES) Laboratory Workshop on Chip Multiprocessor Memory Systems and Interconnects 2007(CMP-MSI) 2007.2.11
2
2 CMP-MSI.07 CARES/SNU Outline Introduction Motivation Reusability-Aware Cache Sharing Technique (RACS) Overview of the RACS technique Two Major Steps –Step 1: Block Reusability Prediction –Step 2: Memory Demand Prediction Evaluation Conclusions
3
3 CMP-MSI.07 CARES/SNU Introduction Chip Multiprocessors (CMPs) emerge as a dominant architectural alternative Most current CMPs support two levels of on-chip hierarchy L1 cache organization is almost same –A small private L1 cache L2 cache organization could be quite different –Private L2 cache vs. Shared L2 cache Efficient L2 cache management is necessary On-chip cache memory space is limited in CMPs Off-chip memory accesses require a much longer latency than on-chip communication costs
4
4 CMP-MSI.07 CARES/SNU L2 Cache Organization in CMPs Private L2 cache vs. Shared L2 cache L1 Short access latency Utilizing capacity efficiently How to Combine Strengths of Private & Shared Caches? But inefficient in utilizing the L2 cache space But longer access latency, More on-chip network traffic
5
5 CMP-MSI.07 CARES/SNU Cooperative Caching (CMP-CC) “Cooperative Caching for Chip Multiprocessors”, ISCA 2006 Based on the private cache organization Writing back L2 victims from the local cache to peer cache P0 Private L2 P1 Private L2 P2 Private L2 P3 Private L2 L2 Victim Randomly with a given probability 0% ~ 100%
6
6 CMP-MSI.07 CARES/SNU Problem of the Reusability-Oblivious Write Back If the block is written back to other cache, but is not reused ->System performance could be degraded Reusability-Aware Adaptive Write Backs are Necessary Should be reduced
7
7 CMP-MSI.07 CARES/SNU Adaptive Write Back Adaptive Write Back requires Reusability of each block Memory demand of each processor P0 Private L2 P1 Private L2 P2 Private L2 P3 Private L2 L2 Victim Low Reusability? High Reusability? Which peer cache has a block with low reusability? Which peer cache has a memory demand smaller than P0?
8
8 CMP-MSI.07 CARES/SNU Reusability Prediction Technique Goal: Do not write back blocks with low reusability The reusability of a block is based on Access Time Interval and Frequency (ATIF) Reusability after the eviction time The block with long time interval has high reusability The block with short time interval has low reusability Eviction
9
9 CMP-MSI.07 CARES/SNU Access Time Interval and Frequency Pattern Classify blocks into 16 patterns Two counters per each block –Number of accesses with long time interval –Number of accesses with short time interval 2 bits of the short time interval counter + 2 bits of the long time interval counter Monitor how many blocks are reused per each pattern If blocks in a certain pattern are highly reused -> Blocks in this pattern have high reusability
10
10 CMP-MSI.07 CARES/SNU Fraction of Unused and Reused Blocks Short time interval Long time interval The larger number of long time interval accesses -> The larger number of reused blocks
11
11 CMP-MSI.07 CARES/SNU Memory Demand Prediction Technique Goal: Do not corrupt the L2 cache of processors with a high memory demand Heuristic: The more replacements occur, the more processor requires memory Replacement time interval history (Repl interval_history ) Prediction value of the memory demand Updated every time replacement occurs Smaller Repl interval_history means the processor requires more memory We write back the block to the peer cache with smaller memory demand
12
12 CMP-MSI.07 CARES/SNU Experiment Setting Based on a CATS Shared-Memory Multiprocessor Simulator Parameter L1 I/D Cache –16KB, 1-way –1cycle L2 Private Cache –256 KB, 4-way –6/40 cycles L2 Shared Cache –1MB, 16-way –38 cycles Off-chip memory latency –500 cycles Splash2 benchmark programs used: Cholesky, FMM, LU, Radix
13
13 CMP-MSI.07 CARES/SNU The Number of Unused and Reused Blocks 61%62% The number of reused blocks is same with CMP-CC 100% The number of unused blocks is reduced by 62%
14
14 CMP-MSI.07 CARES/SNU Normalized Memory Access Latency The RACS scheme reduces the average memory access latency by 14% and 4% over the private L2 scheme and CMP-CC 100% on average, respectively Even though memory access latency increases as the probability of the CMP-CC increases, RACS reduces it. Memory access latency increases as the probability of the CMP-CC increases, RACS also reduces it.
15
15 CMP-MSI.07 CARES/SNU Normalized Average IPC Improves average IPC by 3% and 1% over the private L2 scheme and CMP-CC 100% on average, respectively
16
16 CMP-MSI.07 CARES/SNU Normalized Energy Consumption 10% less energy over the private cache 2% less energy over the CMP-CC 100%
17
17 CMP-MSI.07 CARES/SNU Conclusions Proposed Reusability-Aware Cache memory Sharing technique (RACS) Based on private L2 cache Taking advantage of both private L2 cache and shared L2 cache Adaptively writing back L2 victims to peer L2 cache Using reusability of the block and memory demand of the processor RACS reduces the number of unused blocks by 60% over CMP-CC. RACS reduces the average memory access latency by 14% and 4% over the private L2 cache and CMP-CC, respectively.
18
18 CMP-MSI.07 CARES/SNU Thank You
19
19 CMP-MSI.07 CARES/SNU Overhead Hardware overhead Peer-to-peer communication lines between caches –For 4 CPU, 6 lines of 21 bits Additional counters for two prediction technique –Reusability prediction 4-bit counter: the number of accesses with long time interval per each block 2-bit counters: the number of accesses with short time interval per each block 2 bits: indicating which processor writes back this block 1 bit: indicating that block is reused 16 2-bit pattern counters per each private cache –Memory demand prediction 8 bit counter: time from the last replacement 8 bit : replacement interval history –Total: 9 bits per block and 48 bits per cache => Area overhead is less than 1% of the private cache
20
20 CMP-MSI.07 CARES/SNU Time overhead Write back decision is made after –A block is evicted from the cache and placed in the write back queue -> Write back decision is not on the critical path
21
21 CMP-MSI.07 CARES/SNU How to distinguish access with short time interval and long time interval? If there is any intervening access to the block that belongs to the same set -> long time interval If not -> short time interval Using 2-bit (for 4-way cache) to record the most recently accessed block
22
22 CMP-MSI.07 CARES/SNU Shared State? Block a is evicted from private L2 cache A Do not write back block a To peer L2 cache Yes No Low reusability? No Written back from other L2 cache and not used? Exists any block with low reusability at the bottom of LRU in peer L2 caches? Exists any L2 cache whose memory demand is ω times smaller than cache A’s? Write back block a to the cache which has the low reusability block Write back block a to the cache which has ω times smaller memory demand Yes No
23
23 CMP-MSI.07 CARES/SNU If there is no peer L2 cache with ω times smaller memory demand, we do not write back the block to a peer L2 cache even though it has high reusability ω value? Each private L2 cache has its own ω value Decreased by 1 when block is reused Increased by 1 when three blocks are written back to other cache If many of blocks are reused, we can write back a block to a peer cache –Even though the difference of the memory demand is not large
24
24 CMP-MSI.07 CARES/SNU Processor L2 Cache Processor L2 Cache Processor L2 Cache Set number, Repl interval_history Exist the block with low reusability ? And Received Repl interval_history > own Repl interval_history 1 1 1 2 2 2
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.