Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Similar presentations


Presentation on theme: "ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades."— Presentation transcript:

1 ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades Dept. of Computer Science University of Pittsburgh

2 Tiled CMP Architectures Tiled CMP Architectures have recently been advocated as a scalable design. They replicate identical building blocks (tiles) connected over a switched network on-chip (NoC). A tile typically incorporates a private L1 cache and an L2 cache bank. A traditional practice of CMP caches is a one that logically shares the physically distributed L2 banks  Shared Scheme

3 L2 miss  The home tile of a cache block B is designated by the HS bits of B’s physical address.  Tile T1 requests B.  B is fetched from the main memory and mapped at its home tile (together with its dir info).  Pros: High capacity utilization. Simple Coherence Enforcement (Only for L1).

4 Shared Scheme: Latency Problem (Cons) Access latencies to L2 banks differ depending on the distances between requester cores and target banks. This design is referred to as a Non Uniform Cache Architecture  NUCA

5 NUCA Solution: Block Migration T0 requests block B. Move accessed blocks closer to the requesting cores  Block Migration B is migrated from T15 to T0. T0 requests B. Local hit Total Hops = 14 Total Hops = 0  HS of B = 1111 (T15)

6 NUCA Solution: Block Migration T3 requests B (hops = 6). T0 requests B (hops = 8). T8 requests B (hops = 8). Assume B is migrated to T3. T3 requests B (hops = 0). T0 requests B (hops = 11). T8 requests B (hops = 13). Though T0 saved 6 hops, in total there is a loss of 2 hops. Total Hops = 22 Total Hops = 24  HS of B = 0110 (T6)

7 Our work  Collect information about tiles (sharers) that have accessed a block B.  Depend on the past to predict the future: a core that accessed a block in the past is likely to access it again in the future.  Migrate B to a tile (host) that minimizes the overall number of NoC hops needed.

8 Talk roadmap  Predicting optimal host location  Locating Migratory Blocks Cache-the-cache-tag policy.  Replacement policy upon migration Swap-with-the-lru policy.  Quantitative Evaluation  Conclusion and future works

9 Predicting Optimal Host Location  Keeping a cache block B at its home tile might not be optimal.  The best host location of B is not known until runtime.  Adaptive Controlled Migration (ACM): Keep a pattern for the accessibility of B. At runtime (after a specific migration frequency level is reached for B) compute the best host to migrate B by finding the one that minimizes the total latency cost between the sharers of B

10 ACM: A Working Example Tiles 0 and 6 are sharers:  Case 1: Tile 3 is a host.  Case 2: Tile 15 is a host.  Case 3: Tile 2 is a host.  Case 4: Tile 0 is the host. Total Latency Cost = 14Total Latency Cost = 22Total Latency Cost = 10Total Latency Cost = 8 Select T0

11 Locating Migratory Blocks  After a cache block B is migrated, the HS bits of B’s physical address can’t be used anymore to locate B at a subsequent access.  Assume B has been migrated from its home tile T4 to a new host tile T7.  T3 requests B: L2 miss.  A tag can be kept at T4 to point to T7.  Scenario: 3-way cache-to-cache transfer (T3, T4, and T7)  Deficiencies: Useless migration. Fails to exploit distance locality False L2 Miss  HS of B = 0100 (T4) B at T7

12 Locating Migratory Blocks: cache-the-cache-tag Policy  Idea: cache the tag of block B at the requester’s tile (within a data structure referred to as MT table).  T3 requests B. It looks up its MT table before reaching B’s home tile. MT miss: 3-way communication (first access).  T3 caches B’s tag at its MT table.  T3 requests B. It looks up its MT table before reaching B’s home tile. MT hit: direct fetch (second- and up-accesses)  HS of B = 0100 (T4) MT Miss MT Hit

13 Locating Migratory Blocks: cache-the-cache-tag Policy  The MT table of a tile T can now hold 2 types of tags: A tag for each block B whose home tile is T and had been migrated to another tile (local entry). Tags to keep track of the locations of the migratory blocks that have been recently accessed by T but whose home tile is not T (remote entry).  The MT table replacement policy: An invalid tag. The LRU remote entry.  The MT remote and local tags of B are kept consistent via extending the local entry of B at B’s home tile by a bit mask that indicates which tiles have cached corresponding remote entries.

14 Replacement Policy Upon Migration: swap-with-lru Policy  After the ACM algorithm predicts the optimal host, H, for a block B, a decision is to be made regarding which block to replace at H upon migrating B.  Idea: Swap B with the LRU block at H (swap-with-the-lru policy).  The LRU block at H could be: A migratory one. A non-migratory one.  The swap-with-the-lru policy is very effective especially for workloads that have working sets which are large relative to L2 banks (bears similarity to victim replication but more robust)

15 Quantitative Evaluation: Methodology and Benchmarks.  We simulate a 16-way tiled CMP.  Simulator: Simics 3.0.29 (Solaris OS)  Cache line size: 64 Bytes.  L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles.  L2 size/ways/latency: 512KB per bank/16 ways/6 cycles.  Latency per hop: 5 cycles.  Memory latency: 300 cycles.  Migration frequency level: 10  Benchmarks: NameInput SPECjbbJava HotSpot ™ server VM 1.5, 4 warehouses Lu1024*1024 (16 threads) Ocean514*514 (16 threads) Radix2 M integers (16 threads) Barnes16K particles (16 threads) Parser, Art, Equake, Mcf, Ammp, Vortex Reference MIX1(vortex, Ammp, Mcf, and Equake) MIX2(Art, Equake, Parser, Mcf)

16 Quantitative Evaluation: Single- threaded and Multiprogramming Results  VR successfully offsets the miss rate from fast replica hits for all the single-threaded benchmarks.  VR fails to offset the L2 miss increase of MIX1 and MIX2.  For single-threaded workloads: ACM generates on average 20.5% and 3.7% better AAL than S and VR, respectively.  For multiprogramming workloads: ACM generates on average 2.8% and 31.3% better AAL than S and VR Poor Capacity Utilization Maintains Efficient Capacity Utilization

17 Quantitative Evaluation: Multithreaded Results  An increase in the degree of sharing suggests that the capacity occupied by replicas could increase significantly leading to a decrease in the effective L2 cache size.  ACM exhibits AALs that are on average 27% and 37.1% better than S and VR, respectively.

18 Quantitative Evaluation: Avg. Memory Access Cycles Per 1K Instr.  ACM performs on average 18.6% and 2.6% better than S for the single-threaded and multipr ogramming workloads, respectively.  ACM performs on average 20.7% better than S for multithreaded workloads.  VR performs on average 15.1% better than S, and 38.4% worse than S for the single- threaded and multiprogramming workloads, respectively.  VR performs on average 19.6% worse than S for multithreaded workloads.

19 Quantitative Evaluation: ACM Scalability Poor Capacity Utilization  As the number of tiles on a CMP platform increases, the NUCA problem exacerbates.  ACM is independent of the underlying platform and always selects hosts that minimize AAL.  More Exposure to the NUCA problem translates effectively to a larger benefit from ACM.  For the simulated benchmarks: with 16-way CMP, ACM improves AAL by 11.6% over S.  With 32-way CMP, ACM improves AAL by 56.6% on average over S.

20 Quantitative Evaluation: Sensitivity to MT table Sizes.  With half (50%) and quarter (25%) MT table sizes as compared to the regular L2 cache bank size, ACM increases AAL by 5.9% and 11.3% over the base one (100% - or identical to the L2 cache bank size).

21 Quantitative Evaluation: Sensitivity to L2 Cache Sizes.  AAL maintains improvement of 39.7% over S.  VR fails to demonstrate stability.

22 Conclusion  This work proposes ACM, a strategy to manage CMP NUCA caches.  ACM offers: Better average L2 access latency over traditional NUCA (20.4% on average). Maintains L2 miss rate of NUCA.  ACM proposes a robust location strategy (cache-the-cache- tag) that can work for any NUCA migration scheme.  ACM reveals the usefulness of migration technique in CMP context.

23 Future works  Improve ACM prediction mechanism. Currently: Cores are treated equally (we consider only the case with 0-1 weights assigning 1 for a core that accessed block B and 0 for a one that didn’t). Improvement: Reflect the non-uniformity in cores access weights (trade off between access weights and storage overhead).  Propose an adaptive mechanism for selecting migration frequency levels.

24 ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors M. Hammoud, S. Cho, and R. Melhem Special thank to Socrates Demetriades Dept. of Computer Science University of Pittsburgh Thank you!


Download ppt "ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades."

Similar presentations


Ads by Google