1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012
2 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!
3Outline Motivation & Coherence Background Scalability Challenges 1.Communication 2.Storage 3.Enforcing Inclusion 4.Latency 5.Energy Extension to Non-Inclusive Shared Caches Criticisms & Summary
4 Academics Criticize HW Coherence Choi et al. [DeNovo]: o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. Kelm et al. [Cohesion]: o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic...
5 Industry Eschews HW Coherence Intel 48-Core IA-32 Message- Passing Processor … SW protocols … to eliminate the communication & HW overhead IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT…
6 Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.
Define “Coherence as Scalable” Define a coherent system as scalable when the cost of providing coherence grows (at most) slowly as core count increases Our Focus o YES: coherence o NO: Any scalable system also requires scalable HW (interconnects, memories) and SW (OS, middleware, apps) Method o Identify each overhead & show it can grow slowly Expect more cores o Moore Law’s provide more transistors o Power-efficiency improvements (w/o Dennard Scaling) o Experts disagree on how many core possible 7
Caches & Coherence Cache— fast, hidden memory—to reduce o Latency: average memory access time o Bandwidth: interconnect traffic o Energy: cache misses cost more energy Caches hidden (from software) o Naturally for single core system o Via Coherence Protocol for multicore Maintain coherence invariant o For a given (memory) block at a give time either o Modified (M): A single core can read & write o Shared (S): Zero or more cores can read, but not write 8
Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 9 Baseline Multicore Chip Intel Core i7 like C = 16 Cores (not 8) Private L1/L2 Caches Shared Last-Level Cache (LLC) 64B blocks w/ ~8B tag HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)
Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 10 Baseline Chip Coherence 2B per 64+8B L2 block to track L1 copies Inclusive L2 (w/ recall messages on LLC evictions)
11 Coherence Example Setup Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B:
12 Coherence Example 1/4 Block A at Core 0 exclusive read-write: Modified(M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B: Write A M, … A: {1000} M …
13 Coherence Example 2/4 Block B at Cores 1+2 shared read-only: Shared (S) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0000} I … B: Read B M, … A: S, … B: {0100} S … Read B S, … B: {0110} S …
14 Coherence Example 3/4 Block A moved from Core 0 to 3 (still M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: M, … A: Write A M, … A: {0001} M …
15 Coherence Example 4/4 Block B moved from Cores1+2 (S) to Core 1 (M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache S, … B: S, … B: Bank 0 Bank 1 A: Bank 2 Bank 3 {0110} S … B: M, … B: Write B M, … A: {0001} M … {1000} M …
Caches & Coherence 16
17Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary
18 1. Communication: (a) No Sharing, Dirty Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request Data Data(writeback) o W/ coherence: Request Data Data(writeback) Ack o Overhead = 8/( ) = 5% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
19 1. Communication: (b) No Sharing, Clean Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request Data 0 o W/ coherence: Request Data (Evict) Ack o Overhead = 16/(8+72) = 10-20% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
20 1. Communication: (c) Sharing, Read Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o To memory: Request Data o To one other core: Request Forward Data (Cleanup) o Charge 1-2 Control messages (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
21 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request { Data, C Invalidations + C Acks} (Cleanup) o Needed since most directory protocols send invalidations to caches that have & sometimes do not have copies o Not Scalable Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
22 1. Communication: Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request Data o Core C Write: Request { Data, 2 Inv + 2 Acks} (Cleanup) o Charge Write for all necessary & unnecessary invalidations o What if all invalidations necessary? Charge reads that get data! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1|2 3|4.. C-1|C} { } { }{ }
23 1. Communication: No Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request Data + {Inv + Ack} (in future) o Core C Write: Request Data (Cleanup) o If all invalidations necessary, coherence adds o Bounded overhead to each miss -- Independent of #cores! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data { C-1 C} { } { }{ }
24 1. Communication Overhead (1) Communication overhead bounded & scalable (a) Without Sharing & Dirty (b) Without Sharing & Clean (c) Shared Read Miss (charge future inv + ack) (d) Shared Write Miss (not charged for inv + acks) But depends on tracking exact sharers (next)
25 Total Communication C Read Misses per Write Miss Exact (unbounded storage) Inexact (32b coarse vector) How get performance of “exact” w/ reasonable storage?
26Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary
27 2. Storage Overhead (Small Chip) Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits
28 2. Storage Overhead (Larger Chip) Use Hierarchy! core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Inter-cluster Interconnection network core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Cache tracking state bits tag block data Shared last-level cache Cluster 1Cluster K {11..1 … 10..1} S … {11..1} S …{10..1} S … {1 … 1} S …
29 2. Storage Overhead (Larger Chip) Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores E.g., core clusters o 256 cores (16*16) o 3% storage overhead!! More generally?
30 Storage Overhead for Scaling (2) Hierarchy enables scalable storage 16 clusters of 16 cores each
31Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary
32 3. Enforcing Inclusion (Subtle) Inclusion: Block in a private cache In shared cache + Augment shared cache to track private cache sharers (as assumed) -Replace in shared cache Replace in private c. -Make impossible? -Requires too much shared cache associativity -E.g., 16 cores w/ 4-way caches 64-way assoc -Use recall messages Make recall messages necessary & rare
33 Inclusion Recall Example Shared cache miss to new block C Needs to replace (victimize) block B in shared cache Inclusion forces replacement of B in private caches Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A: Write C
34 Make All Recalls Necessary Exact state tracking (cover earlier) + L1/L2 replacement messages (even clean) = Every recall message finds cached block Every recall message necessary & occurs after a cache miss (bounded overhead)
35 Make Necessary Recalls Rare Recalls naturally rare when Shared Cache Size/ Σ Private Cache sizes > 2 (3) Recalls made rare Assume misses to random sets [Hill & Smith 1989] Core i7
36Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary
37 4. Latency Overhead – Often None Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache 1.None: private hit 2.“None”: private miss + “direct” shared cache hit 3.“None”: private miss + shared cache miss 4.BUT … Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
38 4. Latency Overhead -- Some Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache X: private miss + shared cache hit with indirection(s) How bad? Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
4. Latency Overhead -- Indirection X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect interconnect + cache + interconnect Acceptable today Relative latency similar w/ more cores/hierarchy Vs. magically having data at shared cache (4) Latency overhead bounded & scalable 39
5. Energy Overhead Dynamic -- Small o Extra message energy – traffic increase small/bounded o Extra state lookup – small relative to cache block lookup o…o… Static – Also Small o Extra state – state increase small/bounded o…o… Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, … (5) Energy overhead bounded & scalable 40
41Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD Criticisms & Summary
42 Review Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache Inclusive Shared Cache: Block in a private cache In shared cache Blocks must be cached redundantly tracking bits state tag block data ~1 bit per core ~2 bits ~64 bits ~512 bits
43 Non-Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache tracking bits state tag ~1 bit per core ~2 bits ~64 bits 2. Inclusive Directory (probe filter) state tag block data ~2 bits ~64 bits ~512 bits 1. Non-Inclusive Shared Cache Any size or associativity Avoids redundant caching Allows victim caching Dataless Ensures coherence But duplicates tags
44 Non-Inclusive Shared Cache Non-Inclusive Shared Cache: Data Block + Tag (Any Configuration ) Inclusive Directory: Tag (Again) + State Inclusive Directory == Coherence State Overhead WITH TWO LEVELS o Directory size proportional to sum of private cache sizes o 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size Coherence overhead higher than w/ inclusion L2 / ΣL1s1248 Overhead11%7.6%4.6%2.5%
45 Non-Inclusive Shared Caches WITH THREE LEVELS Cluster has L2 cache & cluster directory o Cluster directory points to cores w/ L1 block (as before) o (1) Size = 22% * ΣL1s sizes Chip has L3 cache & global directory o Global directory points to cluster w/ block in o (2) Cluster directory for size 22% * ΣL1s + o (3) Cluster L2 cache for size 22% * ΣL2s Hierarchical overhead higher than w/ inclusion L3 / ΣL2 = L2 / ΣL1s 1248 Overhead (1)+(2)+(3) 23% 13%6.5%3.1%
46Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary
Some Criticisms (1) Where are workload-driven evaluations? o Focused on robust analysis of first-order effects (2) What about non-coherent approaches? o Showed compatible of coherence scales (3) What about protocol complexity? o We have such protocols today (& ideas for better ones) (4) What about multi-socket systems? o Apply non-inclusive approaches (5) What about software scalability? o Hard SW work need not re-implement coherence
48 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!
Coherence NOT this Awkward 49
50 Backup Slides Some old
51Outline Baseline Multicore Chip & Coherence Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary
52 Coherence Example SAVE Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A:
53 1. Communication Overhead WITHOUT SHARING o 8-byte control messages: Request, Evict, Ack o 72-byte messages for 64-byte data Dirty Blocks o W/o coherence: Request + Data + Data(writeback) o W/ coherence: Request + Data + Data(writeback) + Ack o Overhead = 8/( ) = 5% Clean Blocks o W/o coherence: Request + Data + 0 o W/ coherence: Request + Data + (Evict) + Ack o Overhead = 16/(8+72) = 10-20% Overhead independent of #cores
54 1. Communication Overhead WITH SHARING Read Miss o To memory: Request Data o To one other core: Request Forward Data (Cleanup) o Charge (at most) 1 Invalidation + 1 Ack Write Miss o To one other core: Request Forward Data (Cleanup ) o To C other cores: As above + C Invalidations C Acks o If every invalidation useful, charge Read not Write Miss (1) Communication overhead bounded & scalable But depends on tracking exact sharers (next)
55 2. Storage Overhead Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores (picture next)
4. Latency Overhead Added coherence latency (ignoring hierarchy) 1.None: private hit 2.“None”: private miss + shared cache miss 3.“None”: private miss + “direct” shared cache hit X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect interconnect + cache + interconnect Acceptable today Not significantly changed by scale or hierarchy (4) Latency overhead bounded & scalable 56
57 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request { Data, C Invalidations + C Acks} (Cleanup) o If every invalidation useful, charge Read not Write Miss o Overhead independent of #Cores for all cases! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
Why On-Chip Cache Coherence is Here to Stay Milo M. K. Martin. Univ. of Pennsylvania Mark D. Hill, Univ. of Wisconsin Daniel J. Sorin, Duke Univ. October Wisconsin Appears in [Communications of the ACM, July 2012] Study cache coherence performance WITHOUT trace-driven or execution driven simulation!
59 A Future for HW Coherence? Academics criticize HW coherence o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. –Choi et al. [DeNovo] o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic... –Kelm et al. [Cohesion] Industry experiments with avoiding HW coherence o Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead o IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory