Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Jiajun Wang, Lu Zhang, Reena Panda, Lizy John The Univeristy of Texas at Austin

Introduction Why efficient LLC replacement policy is important? Goal
LLC shared by multicores LLC accesses have low temporal locality and long data reuse distance Small capacity compared with big data application working set size Goal Ideally, every LLC cache blocks get reused before eviction. (Maximize total reuse counts) It requires: Bypass streaming accesses Select dead block as victim

Review of Belady’s Optimal Algorithm
Gives the most optimal case of cache behavior with the knowledge of future, the block with the largest forward distance in the string of future references should be replaced at the time of a miss. Access A B C D time A C B 2-way fully associative cache

Motivation However… Same miss counts != same cycle penalty cost
Miss latency variance (e.g., get missed data from LLC vs DRAM) Access type priority (e.g., writeback or prefetch is not on critical path) Access Type: LD ST WB Access Addr: A B C D time A B

Lime Proposal Basic idea:
A cache replacement policy which leverages key idea of Belady’s algorithm but focuses on demand accesses (i.e. loads and stores) that have direct impact on system performance, and bypasses training process for writeback and prefetch accesses. Builds on prior work Caching behavior of past load instructions can guide future caching decisions[1][2] Leverages Belady’s algorithm on past accesses[3] [1]W. A. Wong and J.-L. Baer. Modified LRU policies for improving second-level cache behavior. In HPCA 2000, [2]C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer. SHiP: Signature-based hit predictor for high performance caching. In MICRO 2011 [3] A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement. In ISCA 2016

Background: Hawkeye OPTgen: Unique Addr D C B A Cached Non-Cached time
1 2 Occupancy Vector

Lime Structure: Overall

Lime Structure: Belady’s Trainer
PC Addr Tag Cached? Occupancy Vector Belady Trainer Oldest Access Entry … Latest Access

Handle Writeback and Prefetch
Load / Store Writeback Prefetch Belady Trainer Cache/Bypass Cache Cache Data Cache SRRIP replacement Replace way[0] Replace way[0]

Lime Structure: PC Classifier
Input: PC Output: Cached If PC is not found in PC Classifier: Cached=true Else: If PC is in RANDOM bin Cached=latest Cache decision Else if PC is in KEEP bin: Else if PC is in BYPASS bin: Cached=false PC PC Classifier KEEP (bloom filter) BYPASS (bloom filter) RANDOM (lut) Should install data into cache

Configuration Storage Cost Workloads Simpoint length of 200M
Single core, 2MB LLC Multicore, multiprogram, 8MB LLC Compare against LRU

Results: Single Core. w/o prefetch

Results: Single Core. w/ prefetch

Results: Multicore. w/o prefetch

Results: Multicore. w/ prefetch

Conclusion LIME respects the observation that load/store misses are more likely to cause pipeline stall than writeback and prefetch misses Significant IPC improvement can be achieved with LIME, even with increasing total misses in some cases.

Thank you!

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Similar presentations

Presentation on theme: "Less is More: Leveraging Belady’s Algorithm with Demand-based Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Similar presentations

Presentation on theme: "Less is More: Leveraging Belady’s Algorithm with Demand-based Learning"— Presentation transcript:

Similar presentations

About project

Feedback