Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur

Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in

Balanced $ (IIT, Kanpur) Talk in one slide Closed-addressed hashing used in traditional cache designs with a fixed collision chain length (known as associativity) Closed-addressed hashing used in traditional cache designs with a fixed collision chain length (known as associativity) Clustering of physical addresses to a few hot sets is a well-known phenomenon Clustering of physical addresses to a few hot sets is a well-known phenomenon Non-uniform set utilization leads to high volume of conflict misses Non-uniform set utilization leads to high volume of conflict misses First proposal on a fully dynamic scheme to re-balance sets by migrating blocks from “hot regions” to “cooler regions” First proposal on a fully dynamic scheme to re-balance sets by migrating blocks from “hot regions” to “cooler regions”

Balanced $ (IIT, Kanpur)Sketch  Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur)Observation#1

Observation#2, 3

Balanced $ (IIT, Kanpur)Sketch Observations Observations  Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Design detail Overview Overview –The basic idea is to migrate evicted blocks to sets with smaller fill count Involves the following sub-problems Involves the following sub-problems –Identify a good receiver set quickly –Locate migrated blocks efficiently –Offer dynamic control of hit/miss critical path Optimizations worth exploring Optimizations worth exploring –Selective migration (not all blocks are important) –Bound migrations from a particular set –Retain migrated blocks (the difficult part)

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Destination of migration Associate a saturating counter C(s) with each set s and a global counter G Associate a saturating counter C(s) with each set s and a global counter G –Increment C(s) on a refill into s –When C(s) reaches a value equal to the associativity, increment G –When G reaches a value equal to the number of sets, reset G and C(s) for all s –Size C(s) so that it can count up to k times the associativity (we set k to 4)

Balanced $ (IIT, Kanpur) Destination of migration Divide the sets into clusters of sets and associate a saturating counter D(u) with each cluster u Divide the sets into clusters of sets and associate a saturating counter D(u) with each cluster u –Increment D(u) whenever C(s) is incremented for some s in u –Reset D(u) when all C(s) are reset –Have a comparator tree to compute the minimum among all D(u) whenever an increment takes place (scalable?) –Have a second comparator tree to compute the minimum among all C(s) within the minimum u found by the first tree; the set t with this minimum is the target of migration provided C(s) > C(t) for source set s

Balanced $ (IIT, Kanpur) Locating migrated blocks The migrated tags are duplicated in a migration tag cache (MTC) The migrated tags are duplicated in a migration tag cache (MTC) –MTC is organized as a direct-mapped table –Each entry has a tag, a target set index, a forward pointer to an MTC entry, a backward pointer to an MTC entry, a head bit, and a tail bit –Starting at an index of the MTC, one can follow the forward pointers in a linked list until the tail bit is encountered –One tag list in the MTC corresponds to the migrated tags from a particular parent set in the main cache

Balanced $ (IIT, Kanpur) Locating migrated blocks Tag lookup protocol Tag lookup protocol –With each set s in the main cache, a head pointer H(s) to the MTC is maintained; H(s) points to the index of MTC where the list of migrated tags belonging to set s begins –The main cache is looked up first as usual –On a miss, H(s) is read out and an MTC walk is initiated at index H(s) –Note that on reset, the MTC is organized as a free list; a new migration from set s allocates an MTC entry, links it at the head of the list starting at H(s), and updates H(s)

Balanced $ (IIT, Kanpur) Locating migrated blocks Tag lookup protocol Tag lookup protocol –On an MTC hit, the block is swapped with the LRU block in the parent set to improve future hit latency (behaves like a folded victim cache) –It is necessary to avoid false hits –Now the same set may contain the same tag multiple times –Each tag is extended by log(A) bits where A is the associativity; the target way of a migrated tag is stored along with the tag

Balanced $ (IIT, Kanpur) Locating migrated blocks Replacement of migrated blocks Replacement of migrated blocks –A migrated block may get replaced due to primary or secondary replacements –A primary migrated block replacement is again migrated to a different target set; this case is easy to handle because it requires only MTC entry modification –But to get to the MTC entry, one needs to maintain a direct MTC entry pointer MEP(t) with each migrated tag t in the main cache

Balanced $ (IIT, Kanpur) Locating migrated blocks Replacement of migrated blocks Replacement of migrated blocks –A secondary migrated block replacement evicts the block from the cache –This requires delinking the tag from its list –Efficient delinking is possible only in doubly- linked lists and this is why we need a backward pointer with each MTC entry –Also, this may need updating the H(s) field in the parent set s –To be able to get to the parent set, each MTC entry needs to store the parent set index

Balanced $ (IIT, Kanpur) Locating migrated blocks Summary of structures added till now Summary of structures added till now –Per set s: one saturating counter C(s), one head pointer H(s) and VALID(H(s)) –Per tag t: MTC entry pointer MEP(t) and VALID(MEP(t)), extra way bits W(t) –Per MTC entry m: migrated tag MT(m) including the extra way bits, target set index TS(m), parent set index PS(m), forward pointer FPTR(m), backward pointer BPTR(m), head/tail bits HT(m) –Per set cluster u: saturating counter D(u) –A global saturating counter –Two comparator trees

Balanced $ (IIT, Kanpur) Hit/Miss critical path Reducing the MTC walk latency Reducing the MTC walk latency –Proposal#1: Make MTC dual-ported so that a list can be walked from both ends (a win-win situation); halves hit as well as miss paths –Add a tail pointer T(s) to each set (along with H(s)) so that the tail of a list can be accessed directly –Proposal#2: Maintain the summary of migrated tags from a set s in a small filter F(s) attached to s –Query F(s) first before walking MTC; a negative response from F(s) means the tag is definitely not there in MTC; optimizes the miss path only

Balanced $ (IIT, Kanpur) Hit/Miss critical path Reducing the MTC walk latency Reducing the MTC walk latency –We experimented with a simple design of a 60-bit F(s) with great success –Divide the 60 bits into nine segments: each of the lower eight segments is seven bits wide and the upper segment is four bits wide –When a tag t is queried, the lower three bits of t identifies one of the lower eight segments of F(s) –Let the contents of the identified segment be f[6:0] and the contents of the upper segment be g[3:0]

Balanced $ (IIT, Kanpur) Hit/Miss critical path Reducing the MTC walk latency Reducing the MTC walk latency –The filter says “yes” if and only if (f[6:0] AND t[9:3]) == t[9:3] and (g[3:0] AND t[13:10]) == t[13:10] –A newly migrated tag t is hashed into F(s) by ORing t[9:3] into the identified segment and ORing t[13:10] with the upper segment –F(s) is not updated if a migrated tag is removed (not possible to update) –On a false positive from F(s), all the migrated tags for the set s will have to be visited anyway; at this time F(s) is cleared and rebuilt

Balanced $ (IIT, Kanpur) Selective migration Not all blocks are important Not all blocks are important –Unnecessary migrations waste energy and may hurt performance by using up MTC space –Ideally, we want to migrate the most frequently missing blocks –Usually, these blocks are associated with the hot sets –The idea, therefore, should be to identify the hot sets and migrate only the blocks evicted from the hot sets

Balanced $ (IIT, Kanpur) Selective migration Identifying hot sets Identifying hot sets –Associate a saturating counter R(s) with each set s to count the number of external refills to the set –Whenever some R(s) reaches its maximum value, all R(s) are reset (leader-decides rule) –Maintain the total refill count across all sets in a register TRC and the maximum refill count across all sets in another register MaxRC; let average refill count be ARC = TRC >> log(|S|) –Definition: A set s is hot if and only if R(s) > ARC + (MaxRC – ARC) >> delta –Delta is dynamically incremented

Balanced $ (IIT, Kanpur) Throttling migration If a set becomes very hot, it may start migrating a large number of blocks If a set becomes very hot, it may start migrating a large number of blocks –While this may appear desirable, monotonically increasing expected MTC walk cost outweighs the benefits soon –We impose a limit on the length of the migrated tag list belonging to a particular set –However, a static limit may not work; so the limit is dynamically increased by monitoring the volume of rejected migrations due to too short a length limit –Each set s now maintains a list length register LLR(s)

Balanced $ (IIT, Kanpur) Retaining migrated blocks Number of misses between two misses to the same block is often very high Number of misses between two misses to the same block is often very high –Points to the danger of losing the migrated blocks before they get reused –We need to design a replacement policy that gives lower replacement priority to the migrated blocks because these are the blocks we really want to retain –Classify the sets into high-hit and low-hit sets –For high-hit sets continue with baseline policy (LRU in our case) –For low-hit sets, consider the non-migrated blocks before the migrated ones

Balanced $ (IIT, Kanpur) Retaining migrated blocks Associate a hit counter HC(s) with each set s Associate a hit counter HC(s) with each set s –Reset HC(s) when the refill counter is reset –Count a hit on a migrated block as a hit in the parent set –Classify a set as low-hit if and only if HC(s) ≤ hR(s) and R(s) > r for some constant h > 1 and r r for some constant h > 1 and r < associativity –We fix h to 4 and r to 1/8 th of associativity More research is needed on better retention schemes More research is needed on better retention schemes –This is going to play a big role

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks  Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Scaling to CMPs Assume that the CMP caches will be banked Assume that the CMP caches will be banked –All the policies can be applied to each bank or a subset of close-by banks independently –No cross-bank (or cross-switch) migration –Use cross-bank migration only for proximity enhancement (more detail in second talk) –The entire design scales seamlessly to larger caches In our simulations, we assume that a pair of banks share a switch on a ring and cross-bank migration is allowed only within a pair In our simulations, we assume that a pair of banks share a switch on a ring and cross-bank migration is allowed only within a pair

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs  Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Simulation results Single-threaded and multi-threaded applications Single-threaded and multi-threaded applications Single-threaded runs are done on 2 MB 16-way L2 caches Single-threaded runs are done on 2 MB 16-way L2 caches Multi-threaded runs are done on 8 cores sharing a 4 MB 16-way L2 cache Multi-threaded runs are done on 8 cores sharing a 4 MB 16-way L2 cache –Each core has private L1 caches The MTC is sized to hold half the tags compared to the main cache The MTC is sized to hold half the tags compared to the main cache Space overhead of about 56 KB per 1 MB bank Space overhead of about 56 KB per 1 MB bank

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results  Summary

Balanced $ (IIT, Kanpur)Summary Huge potential for improving performance and saving energy with slightly over 5% extra storage Huge potential for improving performance and saving energy with slightly over 5% extra storage Logic simplifications need to be explored further Logic simplifications need to be explored further

Cooling the Hot Sets: Improving Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in THANK YOU!

Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur

Similar presentations

Presentation on theme: "Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur

Similar presentations

Presentation on theme: "Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur"— Presentation transcript:

Similar presentations

About project

Feedback