1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

The University of Adelaide, School of Computer Science

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Cache Coherence Mechanisms (Research project) CSCI-5593

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

To Include or Not to Include? Natalie Enright Dana Vantrease.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

FreshCache: Statically and Dynamically Exploiting Dataless Ways Arkaprava Basu, Derek R. Hower, Mark D. Hill, Mike M. Swift.

The University of Adelaide, School of Computer Science

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Multiprocessor Cache Coherency

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Cache Coherence: Directory Protocol

Software Coherence Management on Non-Coherent-Cache Multicores

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Lecture 13: Large Cache Design I

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Lecture 2: Snooping-Based Coherence

Multiprocessors - Flynn’s taxonomy (1966)

Improving Multiple-CMP Systems with Token Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Lecture 9: Directory Protocol Implementations

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 10: Directory-Based Examples II

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012

2 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s  Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!

3Outline Motivation & Coherence Background Scalability Challenges 1.Communication 2.Storage 3.Enforcing Inclusion 4.Latency 5.Energy Extension to Non-Inclusive Shared Caches Criticisms & Summary

4 Academics Criticize HW Coherence Choi et al. [DeNovo]: o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. Kelm et al. [Cohesion]: o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic...

5 Industry Eschews HW Coherence Intel 48-Core IA-32 Message- Passing Processor … SW protocols … to eliminate the communication & HW overhead IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT…

6 Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.

Define “Coherence as Scalable” Define a coherent system as scalable when the cost of providing coherence grows (at most) slowly as core count increases Our Focus o YES: coherence o NO: Any scalable system also requires scalable HW (interconnects, memories) and SW (OS, middleware, apps) Method o Identify each overhead & show it can grow slowly Expect more cores o Moore Law’s provide more transistors o Power-efficiency improvements (w/o Dennard Scaling) o Experts disagree on how many core possible 7

Caches & Coherence Cache— fast, hidden memory—to reduce o Latency: average memory access time o Bandwidth: interconnect traffic o Energy: cache misses cost more energy Caches hidden (from software) o Naturally for single core system o Via Coherence Protocol for multicore Maintain coherence invariant o For a given (memory) block at a give time either o Modified (M): A single core can read & write o Shared (S): Zero or more cores can read, but not write 8

Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 9 Baseline Multicore Chip Intel Core i7 like C = 16 Cores (not 8) Private L1/L2 Caches Shared Last-Level Cache (LLC) 64B blocks w/ ~8B tag HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)

Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 10 Baseline Chip Coherence 2B per 64+8B L2 block to track L1 copies Inclusive L2 (w/ recall messages on LLC evictions)

11 Coherence Example Setup Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B:

12 Coherence Example 1/4 Block A at Core 0 exclusive read-write: Modified(M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B: Write A M, … A: {1000} M …

13 Coherence Example 2/4 Block B at Cores 1+2 shared read-only: Shared (S) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0000} I … B: Read B M, … A: S, … B: {0100} S … Read B S, … B: {0110} S …

14 Coherence Example 3/4 Block A moved from Core 0 to 3 (still M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: M, … A: Write A M, … A: {0001} M …

15 Coherence Example 4/4 Block B moved from Cores1+2 (S) to Core 1 (M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache S, … B: S, … B: Bank 0 Bank 1 A: Bank 2 Bank 3 {0110} S … B: M, … B: Write B M, … A: {0001} M … {1000} M …

Caches & Coherence 16

17Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary

18 1. Communication: (a) No Sharing, Dirty Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request  Data  Data(writeback) o W/ coherence: Request  Data  Data(writeback)  Ack o Overhead = 8/( ) = 5% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

19 1. Communication: (b) No Sharing, Clean Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request  Data  0 o W/ coherence: Request  Data  (Evict)  Ack o Overhead = 16/(8+72) = 10-20% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

20 1. Communication: (c) Sharing, Read Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o To memory: Request  Data o To one other core: Request  Forward  Data  (Cleanup) o Charge 1-2 Control messages (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

21 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request  { Data, C Invalidations + C Acks}  (Cleanup) o Needed since most directory protocols send invalidations to caches that have & sometimes do not have copies o Not Scalable Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

22 1. Communication: Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request  Data o Core C Write: Request  { Data, 2 Inv + 2 Acks}  (Cleanup) o Charge Write for all necessary & unnecessary invalidations o What if all invalidations necessary? Charge reads that get data! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1|2 3|4.. C-1|C} { } { }{ }

23 1. Communication: No Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request  Data + {Inv + Ack} (in future) o Core C Write: Request  Data  (Cleanup) o If all invalidations necessary, coherence adds o Bounded overhead to each miss -- Independent of #cores! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data { C-1 C} { } { }{ }

24 1. Communication Overhead (1) Communication overhead bounded & scalable (a) Without Sharing & Dirty (b) Without Sharing & Clean (c) Shared Read Miss (charge future inv + ack) (d) Shared Write Miss (not charged for inv + acks) But depends on tracking exact sharers (next)

25 Total Communication C Read Misses per Write Miss Exact (unbounded storage) Inexact (32b coarse vector) How get performance of “exact” w/ reasonable storage?

26Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary

27 2. Storage Overhead (Small Chip) Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits

28 2. Storage Overhead (Larger Chip) Use Hierarchy! core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Inter-cluster Interconnection network core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Cache tracking state bits tag block data Shared last-level cache Cluster 1Cluster K {11..1 … 10..1} S … {11..1} S …{10..1} S … {1 … 1} S …

29 2. Storage Overhead (Larger Chip) Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores E.g., core clusters o 256 cores (16*16) o 3% storage overhead!! More generally?

30 Storage Overhead for Scaling (2) Hierarchy enables scalable storage 16 clusters of 16 cores each

31Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary

32 3. Enforcing Inclusion (Subtle) Inclusion: Block in a private cache  In shared cache + Augment shared cache to track private cache sharers (as assumed) -Replace in shared cache  Replace in private c. -Make impossible? -Requires too much shared cache associativity  -E.g., 16 cores w/ 4-way caches  64-way assoc -Use recall messages  Make recall messages necessary & rare

33 Inclusion Recall Example Shared cache miss to new block C Needs to replace (victimize) block B in shared cache Inclusion forces replacement of B in private caches Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A: Write C

34 Make All Recalls Necessary Exact state tracking (cover earlier) + L1/L2 replacement messages (even clean) = Every recall message finds cached block  Every recall message necessary & occurs after a cache miss (bounded overhead)

35 Make Necessary Recalls Rare Recalls naturally rare when Shared Cache Size/ Σ Private Cache sizes > 2 (3) Recalls made rare Assume misses to random sets [Hill & Smith 1989] Core i7

36Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary

37 4. Latency Overhead – Often None Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache 1.None: private hit 2.“None”: private miss + “direct” shared cache hit 3.“None”: private miss + shared cache miss 4.BUT … Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

38 4. Latency Overhead -- Some Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache X: private miss + shared cache hit with indirection(s) How bad? Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

4. Latency Overhead -- Indirection X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect interconnect + cache + interconnect Acceptable today Relative latency similar w/ more cores/hierarchy Vs. magically having data at shared cache (4) Latency overhead bounded & scalable 39

5. Energy Overhead Dynamic -- Small o Extra message energy – traffic increase small/bounded o Extra state lookup – small relative to cache block lookup o…o… Static – Also Small o Extra state – state increase small/bounded o…o… Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, … (5) Energy overhead bounded & scalable 40

41Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD Criticisms & Summary

42 Review Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache Inclusive Shared Cache: Block in a private cache  In shared cache Blocks must be cached redundantly  tracking bits state tag block data ~1 bit per core ~2 bits ~64 bits ~512 bits

43 Non-Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache tracking bits state tag ~1 bit per core ~2 bits ~64 bits 2. Inclusive Directory (probe filter) state tag block data ~2 bits ~64 bits ~512 bits 1. Non-Inclusive Shared Cache Any size or associativity Avoids redundant caching Allows victim caching Dataless Ensures coherence But duplicates tags 

44 Non-Inclusive Shared Cache Non-Inclusive Shared Cache: Data Block + Tag (Any Configuration ) Inclusive Directory: Tag (Again)  + State Inclusive Directory == Coherence State Overhead WITH TWO LEVELS o Directory size proportional to sum of private cache sizes o 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size Coherence overhead higher than w/ inclusion L2 / ΣL1s1248 Overhead11%7.6%4.6%2.5%

45 Non-Inclusive Shared Caches WITH THREE LEVELS Cluster has L2 cache & cluster directory o Cluster directory points to cores w/ L1 block (as before) o (1) Size = 22% * ΣL1s sizes Chip has L3 cache & global directory o Global directory points to cluster w/ block in o (2) Cluster directory for size 22% * ΣL1s + o (3) Cluster L2 cache for size 22% * ΣL2s Hierarchical overhead higher than w/ inclusion L3 / ΣL2 = L2 / ΣL1s 1248 Overhead (1)+(2)+(3) 23%  13%6.5%3.1%

46Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary

Some Criticisms (1) Where are workload-driven evaluations? o Focused on robust analysis of first-order effects (2) What about non-coherent approaches? o Showed compatible of coherence scales (3) What about protocol complexity? o We have such protocols today (& ideas for better ones) (4) What about multi-socket systems? o Apply non-inclusive approaches (5) What about software scalability? o Hard SW work need not re-implement coherence

48 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s  Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!

Coherence NOT this Awkward 49

50 Backup Slides Some old

51Outline Baseline Multicore Chip & Coherence Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary

52 Coherence Example SAVE Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A:

53 1. Communication Overhead WITHOUT SHARING o 8-byte control messages: Request, Evict, Ack o 72-byte messages for 64-byte data Dirty Blocks o W/o coherence: Request + Data + Data(writeback) o W/ coherence: Request + Data + Data(writeback) + Ack o Overhead = 8/( ) = 5% Clean Blocks o W/o coherence: Request + Data + 0 o W/ coherence: Request + Data + (Evict) + Ack o Overhead = 16/(8+72) = 10-20%  Overhead independent of #cores

54 1. Communication Overhead WITH SHARING Read Miss o To memory: Request  Data o To one other core: Request  Forward  Data  (Cleanup) o Charge (at most) 1 Invalidation + 1 Ack Write Miss o To one other core: Request  Forward  Data  (Cleanup ) o To C other cores: As above + C Invalidations  C Acks o If every invalidation useful, charge Read not Write Miss (1) Communication overhead bounded & scalable But depends on tracking exact sharers (next)

55 2. Storage Overhead Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores (picture next)

4. Latency Overhead Added coherence latency (ignoring hierarchy) 1.None: private hit 2.“None”: private miss + shared cache miss 3.“None”: private miss + “direct” shared cache hit X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect interconnect + cache + interconnect Acceptable today Not significantly changed by scale or hierarchy (4) Latency overhead bounded & scalable 56

57 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request  { Data, C Invalidations + C Acks}  (Cleanup) o If every invalidation useful, charge Read not Write Miss o Overhead independent of #Cores for all cases! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

Why On-Chip Cache Coherence is Here to Stay Milo M. K. Martin. Univ. of Pennsylvania Mark D. Hill, Univ. of Wisconsin Daniel J. Sorin, Duke Univ. October Wisconsin Appears in [Communications of the ACM, July 2012] Study cache coherence performance WITHOUT trace-driven or execution driven simulation!

59 A Future for HW Coherence? Academics criticize HW coherence o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. –Choi et al. [DeNovo] o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic... –Kelm et al. [Cohesion] Industry experiments with avoiding HW coherence o Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead o IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory