Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Similar presentations


Presentation on theme: "CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z."— Presentation transcript:

1 CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood

2 Outline Motivation Related Work (1) – Non-uniform Caches CMP-NuRAPID Related Work (2) – Replication Schemes ASR

3 Motivation Two options for L2 caches in CMPs  Shared: high latency because of wire delay  Private: more misses because of replications Need hybrid L2 caches Take in mind  On-chip communication is fast  On-chip capacity is limited

4 NUCA Non-Uniform Cache Architecture Place frequently-accessed data closest to the core to allow fast access Couple tag and data placement Can only place one or two ways in each set close to the processor

5 NuRAPID Non-uniform access with Replacement And Placement usIng Distance associativity Decouple the set-associative way number from data placement Divide the cache data array into d-groups Use forward and reverse pointers  Forward: from tag to data  Reverse: from data to tag  One to one?

6 CMP-NuRAPID - Overview Hybrid private tag Shared data organization Controlled Replication – CR In-Situ Communication – ISC Capacity Stealing – CS

7 CMP-NuRAPID – Structure Need carefully chosen d-group preference

8 CMP-NuRAPID – Data and Tag Array Tag arrays snoop on bus to maintain coherence The data array is accessed through a crossbar

9 CMP-NuRAPID – Controlled Replication For read-only sharing First use no copy, save capacity Second copy, reduce future access latency In total, avoid off-chip misses

10 CMP-NuRAPID – Time Issues Start to read before the invalidation and end after the invalidation  Mark the tag for the block being read from a farther d-group busy Start to read after the invalidation begins and end before the invalidation completes  Put an entry in the queue that holds the order of the bus transaction before sending a read request to a farther d-group

11 CMP-NuRAPID – In-situ Communication For read-write sharing Communication state Write-through for all C blocks in L1 cache

12 CMP-NuRAPID – Capacity Stealing Demote less-frequently-used data to unused frames in the d-groups closer to the cores with less capacity demands Placement and Promotion  Place all private blocks in the d-group closest to the initiating core  Promote the block directly to the closest d-group for the core

13 CMP-NuRAPID – Capacity Stealing Demotion and Replacement  Demote the block to the next-fastest d-group  Replace in the order of invalid, private, and shared Doesn’t this kind of demotion pollute another core’s fastest d-group?

14 CMP-NuRAPID - Methodology Simics 4-core CMP 8 MB, 8-way CMP-NuRAPID with 4 single- ported d-groups Both multithreaded and multiprogrammed workloads

15 CMP-NuRAPID – Multithreaded

16 CMP-NuRAPID – Multiprogrammed

17 Replication Schemes Cooperative Caching  Private L2 caches  Restrict replication under certain criteria Victim Replication  Share L2 cache  Allow replication under certain criteria Both have static replication policies How about dynamic?

18 ASR - Overview Adaptive Selective Replication Dynamic cache block replication Replicate blocks when the benefits exceed the costs  Benefits: lower L2 hit latency  Costs: More L2 misses

19 ASR – Sharing Types Shingle Requestor  Blocks are accessed by a single processor Shared Read-Only  Blocks are read, but not written, by multiple processors Shared Read-Write  Blocks are accessed by multiple processors, with at least one write Focus on replicating shared read-only blocks  High locality  Little Capacity  Large portion of requests

20 ASR - SPR Selective Probabilistic Replication Assume private L2 caches and selectively limits replication on L1 evictions Use probabilistic filtering to make local replication decisions

21 ASR – Balancing Replication

22 ASR – Replication Control Replication levels  C: Current  H: Higher  L: Lower Cycles  H: Hit cycles-per-instruction  M: Miss cycles-per-instruction

23 ASR – Replication Control

24 Wait until there are enough events to ensure a fair cost/benefit comparison Wait until four consecutive evaluation intervals predict the same change before change the replication level

25 ASR – Designs Supported by SPR SPR-VR  Add 1-bit per L2 cache block to identify replicas  Disallow replications when the local cache set is filled with owner blocks with identified sharers SPR-NR  Store a 1-bit counter per remote processor for each L2 block  Remove the shared bus overhead (How?) SPR-CC  Model the centralized tag structure using an idealized distributed tag structure

26 ASR - Methodology Two CMP configurations – Current and Future 8 processors Writeback, write-allocate cache Both commercial and scientific workloads Use throughput as metrics

27 ASR – Memory Cycles

28 ASR - Speedup

29 Conclusion Hybrid is better Dynamic is better Need tradeoff How does it scale…


Download ppt "CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z."

Similar presentations


Ads by Google