Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.

Similar presentations


Presentation on theme: "1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012."— Presentation transcript:

1 1 CACM July 2012 Talk: Mark D. Hill, Wisconsin @ Cornell University, 10/2012

2 2 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s  Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!

3 3Outline Motivation & Coherence Background Scalability Challenges 1.Communication 2.Storage 3.Enforcing Inclusion 4.Latency 5.Energy Extension to Non-Inclusive Shared Caches Criticisms & Summary

4 4 Academics Criticize HW Coherence Choi et al. [DeNovo]: o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. Kelm et al. [Cohesion]: o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic...

5 5 Industry Eschews HW Coherence Intel 48-Core IA-32 Message- Passing Processor … SW protocols … to eliminate the communication & HW overhead IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT…

6 6 Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.

7 Define “Coherence as Scalable” Define a coherent system as scalable when the cost of providing coherence grows (at most) slowly as core count increases Our Focus o YES: coherence o NO: Any scalable system also requires scalable HW (interconnects, memories) and SW (OS, middleware, apps) Method o Identify each overhead & show it can grow slowly Expect more cores o Moore Law’s provide more transistors o Power-efficiency improvements (w/o Dennard Scaling) o Experts disagree on how many core possible 7

8 Caches & Coherence Cache— fast, hidden memory—to reduce o Latency: average memory access time o Bandwidth: interconnect traffic o Energy: cache misses cost more energy Caches hidden (from software) o Naturally for single core system o Via Coherence Protocol for multicore Maintain coherence invariant o For a given (memory) block at a give time either o Modified (M): A single core can read & write o Shared (S): Zero or more cores can read, but not write 8

9 Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 9 Baseline Multicore Chip Intel Core i7 like C = 16 Cores (not 8) Private L1/L2 Caches Shared Last-Level Cache (LLC) 64B blocks w/ ~8B tag HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)

10 Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 10 Baseline Chip Coherence 2B per 64+8B L2 block to track L1 copies Inclusive L2 (w/ recall messages on LLC evictions)

11 11 Coherence Example Setup Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B:

12 12 Coherence Example 1/4 Block A at Core 0 exclusive read-write: Modified(M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B: Write A M, … A: {1000} M …

13 13 Coherence Example 2/4 Block B at Cores 1+2 shared read-only: Shared (S) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0000} I … B: Read B M, … A: S, … B: {0100} S … Read B S, … B: {0110} S …

14 14 Coherence Example 3/4 Block A moved from Core 0 to 3 (still M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: M, … A: Write A M, … A: {0001} M …

15 15 Coherence Example 4/4 Block B moved from Cores1+2 (S) to Core 1 (M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache S, … B: S, … B: Bank 0 Bank 1 A: Bank 2 Bank 3 {0110} S … B: M, … B: Write B M, … A: {0001} M … {1000} M …

16 Caches & Coherence 16

17 17Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary

18 18 1. Communication: (a) No Sharing, Dirty Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request  Data  Data(writeback) o W/ coherence: Request  Data  Data(writeback)  Ack o Overhead = 8/(8+72+72) = 5% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

19 19 1. Communication: (b) No Sharing, Clean Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request  Data  0 o W/ coherence: Request  Data  (Evict)  Ack o Overhead = 16/(8+72) = 10-20% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

20 20 1. Communication: (c) Sharing, Read Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o To memory: Request  Data o To one other core: Request  Forward  Data  (Cleanup) o Charge 1-2 Control messages (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

21 21 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request  { Data, C Invalidations + C Acks}  (Cleanup) o Needed since most directory protocols send invalidations to caches that have & sometimes do not have copies o Not Scalable Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

22 22 1. Communication: Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request  Data o Core C Write: Request  { Data, 2 Inv + 2 Acks}  (Cleanup) o Charge Write for all necessary & unnecessary invalidations o What if all invalidations necessary? Charge reads that get data! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1|2 3|4.. C-1|C} { 0 0.. 0 } { 1 0.. 0 }{ 0 0.. 1 }

23 23 1. Communication: No Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request  Data + {Inv + Ack} (in future) o Core C Write: Request  Data  (Cleanup) o If all invalidations necessary, coherence adds o Bounded overhead to each miss -- Independent of #cores! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1 2 3 4.. C-1 C} {0 0 0 0.. 0 0} {1 0 0 0.. 0 0}{0 0 0 0.. 0 1}

24 24 1. Communication Overhead (1) Communication overhead bounded & scalable (a) Without Sharing & Dirty (b) Without Sharing & Clean (c) Shared Read Miss (charge future inv + ack) (d) Shared Write Miss (not charged for inv + acks) But depends on tracking exact sharers (next)

25 25 Total Communication C Read Misses per Write Miss Exact (unbounded storage) Inexact (32b coarse vector) How get performance of “exact” w/ reasonable storage?

26 26Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary

27 27 2. Storage Overhead (Small Chip) Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits

28 28 2. Storage Overhead (Larger Chip) Use Hierarchy! core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Inter-cluster Interconnection network core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Cache tracking state bits tag block data Shared last-level cache Cluster 1Cluster K {11..1 … 10..1} S … {11..1} S …{10..1} S … {1 … 1} S …

29 29 2. Storage Overhead (Larger Chip) Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores E.g., 16 16-core clusters o 256 cores (16*16) o 3% storage overhead!! More generally?

30 30 Storage Overhead for Scaling (2) Hierarchy enables scalable storage 16 clusters of 16 cores each

31 31Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary

32 32 3. Enforcing Inclusion (Subtle) Inclusion: Block in a private cache  In shared cache + Augment shared cache to track private cache sharers (as assumed) -Replace in shared cache  Replace in private c. -Make impossible? -Requires too much shared cache associativity  -E.g., 16 cores w/ 4-way caches  64-way assoc -Use recall messages  Make recall messages necessary & rare

33 33 Inclusion Recall Example Shared cache miss to new block C Needs to replace (victimize) block B in shared cache Inclusion forces replacement of B in private caches Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A: Write C

34 34 Make All Recalls Necessary Exact state tracking (cover earlier) + L1/L2 replacement messages (even clean) = Every recall message finds cached block  Every recall message necessary & occurs after a cache miss (bounded overhead)

35 35 Make Necessary Recalls Rare Recalls naturally rare when Shared Cache Size/ Σ Private Cache sizes > 2 (3) Recalls made rare Assume misses to random sets [Hill & Smith 1989] Core i7

36 36Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary

37 37 4. Latency Overhead – Often None Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache 1.None: private hit 2.“None”: private miss + “direct” shared cache hit 3.“None”: private miss + shared cache miss 4.BUT … Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

38 38 4. Latency Overhead -- Some Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache 4. 1.5-2X: private miss + shared cache hit with indirection(s) How bad? Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

39 4. Latency Overhead -- Indirection 4. 1.5-2X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect --------------------------------------------------------------------------------------------- interconnect + cache + interconnect Acceptable today Relative latency similar w/ more cores/hierarchy Vs. magically having data at shared cache (4) Latency overhead bounded & scalable 39

40 5. Energy Overhead Dynamic -- Small o Extra message energy – traffic increase small/bounded o Extra state lookup – small relative to cache block lookup o…o… Static – Also Small o Extra state – state increase small/bounded o…o… Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, … (5) Energy overhead bounded & scalable 40

41 41Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD Criticisms & Summary

42 42 Review Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache Inclusive Shared Cache: Block in a private cache  In shared cache Blocks must be cached redundantly  tracking bits state tag block data ~1 bit per core ~2 bits ~64 bits ~512 bits

43 43 Non-Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache tracking bits state tag ~1 bit per core ~2 bits ~64 bits 2. Inclusive Directory (probe filter) state tag block data ~2 bits ~64 bits ~512 bits 1. Non-Inclusive Shared Cache Any size or associativity Avoids redundant caching Allows victim caching Dataless Ensures coherence But duplicates tags 

44 44 Non-Inclusive Shared Cache Non-Inclusive Shared Cache: Data Block + Tag (Any Configuration ) Inclusive Directory: Tag (Again)  + State Inclusive Directory == Coherence State Overhead WITH TWO LEVELS o Directory size proportional to sum of private cache sizes o 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size Coherence overhead higher than w/ inclusion L2 / ΣL1s1248 Overhead11%7.6%4.6%2.5%

45 45 Non-Inclusive Shared Caches WITH THREE LEVELS Cluster has L2 cache & cluster directory o Cluster directory points to cores w/ L1 block (as before) o (1) Size = 22% * ΣL1s sizes Chip has L3 cache & global directory o Global directory points to cluster w/ block in o (2) Cluster directory for size 22% * ΣL1s + o (3) Cluster L2 cache for size 22% * ΣL2s Hierarchical overhead higher than w/ inclusion L3 / ΣL2 = L2 / ΣL1s 1248 Overhead (1)+(2)+(3) 23%  13%6.5%3.1%

46 46Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary

47 Some Criticisms (1) Where are workload-driven evaluations? o Focused on robust analysis of first-order effects (2) What about non-coherent approaches? o Showed compatible of coherence scales (3) What about protocol complexity? o We have such protocols today (& ideas for better ones) (4) What about multi-socket systems? o Apply non-inclusive approaches (5) What about software scalability? o Hard SW work need not re-implement coherence

48 48 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s  Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!

49 Coherence NOT this Awkward 49

50 50 Backup Slides Some old

51 51Outline Baseline Multicore Chip & Coherence Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary

52 52 Coherence Example SAVE Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A:

53 53 1. Communication Overhead WITHOUT SHARING o 8-byte control messages: Request, Evict, Ack o 72-byte messages for 64-byte data Dirty Blocks o W/o coherence: Request + Data + Data(writeback) o W/ coherence: Request + Data + Data(writeback) + Ack o Overhead = 8/(8+72+72) = 5% Clean Blocks o W/o coherence: Request + Data + 0 o W/ coherence: Request + Data + (Evict) + Ack o Overhead = 16/(8+72) = 10-20%  Overhead independent of #cores

54 54 1. Communication Overhead WITH SHARING Read Miss o To memory: Request  Data o To one other core: Request  Forward  Data  (Cleanup) o Charge (at most) 1 Invalidation + 1 Ack Write Miss o To one other core: Request  Forward  Data  (Cleanup ) o To C other cores: As above + C Invalidations  C Acks o If every invalidation useful, charge Read not Write Miss (1) Communication overhead bounded & scalable But depends on tracking exact sharers (next)

55 55 2. Storage Overhead Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores (picture next)

56 4. Latency Overhead Added coherence latency (ignoring hierarchy) 1.None: private hit 2.“None”: private miss + shared cache miss 3.“None”: private miss + “direct” shared cache hit 4.1.5-2X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect --------------------------------------------------------------------------------------------- interconnect + cache + interconnect Acceptable today Not significantly changed by scale or hierarchy (4) Latency overhead bounded & scalable 56

57 57 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request  { Data, C Invalidations + C Acks}  (Cleanup) o If every invalidation useful, charge Read not Write Miss o Overhead independent of #Cores for all cases! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data

58 Why On-Chip Cache Coherence is Here to Stay Milo M. K. Martin. Univ. of Pennsylvania Mark D. Hill, Univ. of Wisconsin Daniel J. Sorin, Duke Univ. October 2012 @ Wisconsin Appears in [Communications of the ACM, July 2012] Study cache coherence performance WITHOUT trace-driven or execution driven simulation!

59 59 A Future for HW Coherence? Academics criticize HW coherence o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. –Choi et al. [DeNovo] o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic... –Kelm et al. [Cohesion] Industry experiments with avoiding HW coherence o Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead o IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory


Download ppt "1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012."

Similar presentations


Ads by Google