To Include or Not to Include? Natalie Enright Dana Vantrease
Motivation CMP technology affects coherence protocols differently than previously studied MP systems New shared on-chip resources (e.g. L2) Low latency between on-chip caches Need for scalability in design Industry Examples IBM Power 4 – Inclusion Piranha – Exclusion Our goal: Determine at which point, each inclusion protocol (strict inclusion, non-inclusion and exclusion) is the best choice for CMP performance.
SMP vs CMP Opportunities L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 VS
Multilevel Inclusion Protocol given to us with the simulator L1 has Modified, Shared and Invalid States L2 has Modified, Owned, Shared, and Invalid States When an L2 line is replaced, any copies present on the chip must be invalidated (the sharers are given in the directory entry) In a single processor chip, there are only 2 caches (Instruction and Data) connected to a single L2 cache Chip multiprocessors introduce an additional 2 level 1 caches per processor which could make this forced inclusion harmful.
Non-Inclusion Protocol courtesy of Mike L1 now has owned and exclusion states Complexity of the on chip directory has increased significantly States added to indicate local level 1 sharers or a local level 1 owner. L1 directory state also needs to be visible for external requests from other chips Increase effective on-chip cache storage
Directory Exclusion No replication of Data between a single L1 and the L2 L2 Acts as Large Victim Cache Utilizes cache space, lowering required off-chip bandwidth L2 is centralized coherency point (tag lookup) L1 States: M, E, I, SC, SM L2 States: M, E, I No ownership – simply request 1 st Sharer in Tag Lookup for Data Request
L1 L2 L1 Tags Directory Exclusion L1 L2 L1 Tags L1 L2 L1 Tags
Tag Lookup Cache Aids in off-chip coherency and directing on- chip requests Associativity = L1 associativity * # L1s # Sets = #Sets in a single L1 # Data Entries = # L1s Data Entry = The L1 corresponding to the Data Entry has the data or not (1/0). Scalability?
Methodology Vary the L1 cache size to find the design point at which an inclusive protocol hurts performance. As the number of cores increases, so does the aggregate L1 cache size
Simulation Configuration Configuration 4 processors per chip and 1 chip 2 MB of L2 cache Small but wanted to see the effect of changing the ratio of L1 size to L2 size. 16 processors per chip as future work Only simulated one chip to isolate the effects of intra-chip coherence from inter-chip coherence Future work: see how extending the life of a block on chip through non-inclusion or exclusion affects other chips.
Results Inclusion vs. Non-Inclusion
Results (cont.) Inclusion vs. Pseudo-Exclusion
Conclusion/Future Work An inclusive protocol is less complex Esp. considering inter-chip communication Non-Inclusion performs consistently better than inclusion Additional complexity only warranted after the total L1 cache size is greater than 25% of the L2 cache size. Longer runs and more benchmarks would provide more conclusive evidence
Future Work Ongoing: Get working exclusion protocol in Ruby tester and Simics. Current Status: Currently runs 500 memory transactions in the Ruby tester. Run comparable tests to those run for Non- inclusion Analyze benefits of exclusion over inclusion. Expand to 16 cores and study scalability issues.