I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander.

I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander V. Veidenbaum 1 1 University of California, Irvine 2 Universitat Politecnica de Catalunya and Barcelona Supercomputing Center

C ACHE M ANAGEMENT 2 Cache Management Single- core Replacement Shared- cache BypassPartitioning LRU NRU EELRU DIP RRIP … SPD … UCP PIPP TA-DIP TA-DRRIP Vantage … PDP Prefetch Have been a hot research topic

O VERVIEW Proposed new cache replacement and partitioning algorithms with a better balance between reuse and pollution Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior Models are used to dynamically compute the best PD Showed that PD-based cache management policies improve performance for both single- and multi-core systems 3

O UTLINE 1. The concept of Protecting Distance 2. The single-core PD-based replacement and bypass policy (PDP) 3. The multi-core PD-based management policies 4. Evaluation 4

D EFINITIONS The (line) reuse distance: The number of accesses to the same cache set between two accesses to the same line This metric is directly related to hit rate The reuse distance distribution (RDD) A distribution of observed reuse distances A program signature for a given cache configuration RDDs of representative benchmarks X-axis: the RD (<256) 5

F UTURE B EHAVIOR P REDICTION Cache management policies use past reference behavior to predict future accesses Prediction accuracy is critical Prediction in some of the prior policies LRU: predicts that lines are reused after K unique accesses, where K < W (W: cache associativity) Early eviction LRU (EELRU): Counts evictions in two non- LRU regions (early/late) to predict a line to evict RRIP: Predicts if a line will be reused in a near, long, or distant future 6

B ALANCING R EUSE AND C ACHE P OLLUTION Key to good performance (high hit rate) Cache lines must be reused as much as possible before eviction AND must be evicted soon after the last reuse to give space to new lines The former can be achieved by using the reuse distance and actively preventing eviction Protecting a line from eviction The latter can be achieved by evicting when not reused within this distance There is an optimal reuse distance balancing the two It is called a Protecting Distance (PD) 7

E XAMPLE : 436.C ACTUS ADM A majority of lines are reused at 64 or fewer accesses There are multiple peaks at different reuse distances Reuse maximized if lines are kept in the cache for 64 accesses Lines may not be reused if evicted before that Lines kept beyond that are likely to pollute cache Assume that no lines are kept longer than a given RD 8

T HE P ROTECTING D ISTANCE (PD) A distance at which a majority of lines are covered A single value for all sets Predicted based on the current RDD Questions to answer/solve Why does using the PD achieve the balance? How to dynamically find the PD for an application and a cache configuration? How to build the PD-based management policies? 9

O UTLINE 1. The concept of Protecting Distance 2. Single-core PD-based replacement and bypass policy (PDP) 3. The multi-core PD-based management policies 4. Evaluation 10

T HE S INGLE - CORE PDP A cache tag contains a lines remaining PD (RPD) A line can be evicted when its RPD=0 The RPD of an inserted or promoted line set to the predicted PD RPDs of other lines in a set are decremented Example: A 4-way cache, the predicted PD is 7 A line is promoted on a hit A set with RPDs before and after the hit access 11 0 6 521463 Reused lineInserted line (unused)

T HE S INGLE - CORE PDP (C ONT.) Selecting a victim on a miss A line with an RPD = 0 can be replaced Two cases when all RPDs > 0 (no unprotected lines) Caches without bypass (inclusive): Unused lines are less likely to be reused than reused lines Replace unused line with highest RPD first No unused line: Replace a line with highest RPD Caches with bypass (non-inclusive): Bypass the new line 12 6 3520463 03521463 03 6 21463 035 6 1463 Reused lineInserted line (unused)

E VALUATION OF THE S TATIC PDP Static PDP: use the best static PD for each benchmark PD < 256 SPDP-NB: Static PDP with replacement only SPDP-B: Static PDP with replacement and bypass Performance: in general, DDRIP < SPDP-NB < SPDP-B 436.cactusADM: a 10% additional miss reduction Two static PDP policies have similar performance 483.xalancbmk: 3 different execution windows have different behavior for SPDP-B 13

436. CACTUS ADM: E XPLAINING THE PERFORMANCE DIFFERENCE How the evicted lines occupy the cache? DRRIP: Early evicted lines: 75% of accesses, but occupy only 4% Late evicted lines: 2% of accesses, but occupy 8% of the cache pollution SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4% SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache yielding cache space to useful lines 14 PDP has less pollution caused by long RD lines in the cache than RRIP

C ASE S TUDY : 483. XALANCBMK 15 The best PD is different in different windows And for different programs Need a dynamic policy that finds best PD Need a model to drive the search There is a close relationship between the hit rate, the PD and the RDD

A H IT R ATE M ODEL F OR N ON - INCLUSIVE C ACHE The model estimates the hit rate as a function of d p and the RDD {N i }, N t : The RDD d p : The protecting distance d e : Experimentally set to W (W: Cache associativity) 16 RDD E Hit rate Used to find the PD maximizing the hit rate

PDP C ACHE O RGANIZATION RD Sampler tracks access to several cache sets In L2 miss/WB stream, can reduce sampling rate Measures reuse distance of a new access RD Counter Array collects # of accesses at RD=i, N t To reduce overhead, each counter covers a range of RDs PD Compute Logic: finds PD that maximizes E Computed PD used in the next interval (.5M L3 accesses) Reasonable hardware overhead 2 or 3 bits per tag to store the RPD 17 LLC RD Sampler RD Counter Array PD Compute Logic Access address Higher level Main memory RD RDD PD

PDP VS. E XISTING P OLICIES Management policy Supported policy (*) BalanceDistance measurement Model ReplacementBypassReusePollution LRUYesNo YesStack-basedNo EELRU [1]YesNo YesStack-basedProbabilistic DIP [2]YesNoYesNoN/ANo RRIP [3]YesNoYesNoN/ANo SDP [4]NoYes NoN/ANo PDPYes Access-basedHit rate 18 [1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS99 [2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA07 [3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA10 [4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO10 (*) Originally proposed EELRU has the concept of late eviction point, which shares some similarities with the protecting distance However, lines are not always guaranteed to be protected

PD- BASED S HARED C ACHE P ARTITIONING Each thread has its own PD (thread-aware) Counter array replicated per thread Sampler and compute logic shared A threads PD determines its cache partition Its lines occupy cache longer if its PD is large The cache is implicitly partitioned per needs of each thread using thread PDs The problem is to find a set of thread PDs that together maximize the hit rate 20

S HARED -C ACHE H IT R ATE M ODEL Extending the single-core approach Compute a vector (T= number of threads) Exhaustive search for is not practical A heuristic search algorithm finds a combination of threads RDD peaks that maximizes hit rate The single-core model generates top 3 peaks per thread The complexity is O(T 2 ) See the paper for more detail 21

E VALUATION M ETHODOLOGY CMP$im simulator, LLC replacement Target cache: LLC 23 CacheParams DCache32KB, 8-way, 64B, 2 cycles ICache32KB, 4-way, 64B, 2 cycles L2Cache256KB, 8-way, 64B, 10 cycles L3Cache (LLC)2MB, 16-way, 64B, 30 cycles Memory200 cycles

E VALUATION M ETHODOLOGY (C ONT.) Benchmarks: SPEC CPU 2006 benchmarks Excluded those which did not stress the LLC Single-core: Compared to EELRU, SDP, DIP, DRRIP Multi-core 4- and 16-core configurations, 80 workloads each The workloads generated by randomly combining benchmarks Compared to UCP, PIPP, TA-DRRIP Our policy: PDP-x, where x is the number of bits per cache line 24

S INGLE - CORE PDP PDP-x, where x is the number of bits per cache line Each benchmark is executed for 1B instructions Best if can use 3 bits per line, but still better than prior work at 2 bits 25

5 benchmarks which demonstrate significant phase changes Each benchmark is run for 5B instructions Change of PD (X-axis: 1M LLC accesses) A DAPTATION TO P ROGRAM P HASES 26

A DAPTATION TO P ROGRAM P HASES (C ONT.) IPC improvement over DIP 27

PD- BASED C ACHE P ARTITIONING FOR 16 CORES Normalized to TA-DRRIP 28

H ARDWARE O VERHEAD PolicyPer-line bits Overhead (%) DIP40.8% RRIP20.4% SDP41.4% PDP-220.6% PDP-330.8% 29

O THER R ESULTS Exploration of PDP cache parameters Cache bypass fraction Prefetch-aware PDP PD-based cache management policy for 4-core 30

C ONCLUSIONS Proposed the concept of Protecting Distance (PD) Showed that it can be used to better balance reuse and cache pollution Developed a hit rate model as a function of the PD, program behavior, and cache configuration Proposed PD-based management policies for both single- and multi-core systems PD-based policies outperform existing policies 31

T HANK Y OU ! 32

B ACKUP S LIDES RDD, E and hit rate of all benchmarks 33

RDD S, M ODELED AND R EAL H IT R ATES OF SPEC CPU 2006 B ENCHMARKS 34

RDD S, M ODELED AND R EAL H IT R ATES OF SPEC CPU 2006 B ENCHMARKS (C ONT.) 35

RDD S, M ODELED AND R EAL H IT R ATES OF SPEC CPU 2006 B ENCHMARKS (C ONT.) 36

I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander.

Similar presentations

Presentation on theme: "I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander.

Similar presentations

Presentation on theme: "I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander."— Presentation transcript:

Similar presentations

About project

Feedback