Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Similar presentations


Presentation on theme: "Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy."— Presentation transcript:

1 Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

2 Core 2 Duo die “Just a few years ago, the idea of putting multiple processors on a chip was farfetched. Now it is accepted and commonplace, and virtually every new high performance processor is a chip multiprocessor of some sort…” Center for Electronic System Design Univ. of California Berkeley Chip Multiprocessors?? “Mowry is working on the development of single-chip multiprocessors: one large chip capable of performing multiple operations at once, using similar techniques to maximize performance” -- Technology Review, 1999 Sony's Playstation 3, 2006

3 CMP Caches: Design Space Architecture – Placement of Cache/Processors – Interconnects/Routing Cache Organization & Management – Private/Shared/Hybrid – Fully Hardware/OS Interface “L2 is the last line of defense before hitting the memory wall, and is the focus of our talk”

4 Private L2 Cache I$D$I$D$ L2 $ I NT ER CO NN EC T Coherence Protocol Offchip Memory + Less interconnect traffic + Insulates L2 units + Hit latency – Duplication – Load imbalance – Complexity of coherence – Higher miss rate L1 Proc

5 Shared-Interleaved L2 Cache – Interconnect traffic – Interference between cores – Hit latency is higher + No duplication + Balance the load + Lower miss rate + Simplicity of coherence I$D$I$D$ I NT ER CO NN EC T Coherence Protocol L1 L2

6 Take Home Message Leverage on-chip access time

7 Take Home Messages Leverage on-chip access time Better sharing of cache resources Isolating performance of processors Place data on the chip close to where it is used Minimize inter-processor misses (in shared cache) Fairness towards processors

8 On to some solutions… Jichuan Chang and Gurindar S. Sohi Cooperative Caching for Chip Multiprocessors International Symposium on Computer Architecture, 2006. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches International Symposium on Computer Architecture, 2009. Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors Architectural Support for Programming Languages and Operating, Systems 2008. each handles this problem in a different way

9 Co-operative Caching (Chang & Sohi) Private L2 caches Attract data locally to reduce remote on chip access. Lowers average on-chip misses. Co-operation among the private caches for efficient use of resources on the chip. Controlling the extent of co-operation to suit the dynamic workload behavior

10 CC Techniques Cache to cache transfer of clean data – In case of miss transfer “clean” blocks from another L2 cache. – This is useful in the case of “read only” data (instructions). Replication aware data replacement – Singlet/Replicate. – Evict singlet only when no replicates exist. – Singlets can be “spilled” to other cache banks. Global replacement of inactive data – Global management needed for managing “spilling”. – N-Chance Forwarding. – Set recirculation count to N when spilled. – Decrease N by 1 when spilled again, unless N becomes 0.

11 Set “Pinning” -- Setup P 1 P 2 P 3 P 4 Set 0 Set 1 :::: Set (S-1) L1 cache ProcessorsShared L2 cache InterconnectInterconnect Main Memory

12 Set “Pinning” -- Problem P 1 P 2 P 3 P 4 Set 0 Set 1 :::: Set (S-1) Main Memory

13 Set “Pinning” -- Types of Cache Misses Compulsory (aka Cold) Capacity Conflict Coherence Compulsory Inter-processor Intra-processor versus

14 P 1 P 2 P 3 P 4 Main Memory POP 1 POP 2 POP 3 POP 4 Set :::: OwnerOther bitsData

15 R-NUCA: Use Class-Based Strategies Solve for the common case! Most current (and future) programs have the following types of accesses 1.Instruction Access – Shared, but Read-Only 2.Private Data Access – Read-Write, but not Shared 3.Shared Data Access – Read-Write (or) Read-Only, but Shared.

16 R-NUCA: Can do this online! We have information from the OS and TLB For each memory block, classify it as – Instruction – Private Data – Shared Data Handle them differently – Replicate instructions – Keep private data locally – Keep shared data globally

17 R-NUCA: Reactive Clustering Assign clusters based on level of sharing – Private Data given level-1 clusters (local cache) – Shared Data given level-16 clusters (16 neighboring machines), etc. Clusters ≈ Overlapping Sets in Set-Associative Mapping Within a cluster, “Rotational Interleaving” – Load-Balancing to minimize contention on bus and controller

18 Future Directions Area has been closed.

19 Just Kidding… Optimize for Power Consumption Assess trade-offs between more caches and more cores Minimize usage of OS, but still retain flexibility Application adaptation to allocated cache quotas Adding hardware directed thread level speculation

20 Questions? THANK YOU!

21 Backup Commercial and research prototypes – Sun MAJC – Piranha – IBM Power 4/5 – Stanford Hydra

22 Backup

23 Design Space / Tradeoffs Designs to achieve best of both worlds. SharedPrivate miss rate and eliminates coherence issues. Private L2 reduces the hit latency and complexity. On-chipoff-chip low on-chip access time TB/s of Bandwidth Coherence effects higher capacity HardwareSoftware Flexible


Download ppt "Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy."

Similar presentations


Ads by Google