Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen.

Similar presentations


Presentation on theme: "1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen."— Presentation transcript:

1 1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen

2 2 The Multi-Core Challenge Multi-core chip –Dominant on market –Last level cache is commonly shared by sibling cores, however sharing is not well controlled Challenge: Performance Isolation –Poor performance due to conflicts –Unpredictable performance –Denial of service attacks source: http://www.intel.com

3 3 Possible Software Approach: Page Coloring Partition cache at coarse granularity Page coloring: advocated by many previous works –[Bershad’94, Bugnion’96, Cho ‘06, Tam ‘07, Lin ‘08, Soares ‘08] Thread A Thread B Cache Way-1Way-n………… Memory page Color # = CacheSize PageSize*CacheAssociativity

4 4 Challenges for Page Coloring Expensive page re-coloring –Re-coloring is needed due to optimization goal or co-runner change –Without extra support, re-coloring means memory copying –3 micro-seconds per page copy, >10K pages to copy, possibly happen every time quantum Artificial memory pressure –Cache share restriction also restricts memory share

5 5 Hotness-based Page Coloring Basic idea –Restrain page coloring to a small group of hot pages Challenge: –How to efficiently find out hot pages Outline –Efficient hot page identification –Cache partition policy –Hot page coloring

6 6 Method to Track Page Hotness Hardware access bits + sequential table scan –Generally available on x86, automatically set by hardware –One bit per Page Table Entry (PTE) Conventional wisdom: scan whole page table is expensive –Not entirely true, per-entry scan latency is overlapped by hardware prefetching –Sequential table scan spends a large portion of time on non-accessed pages, but we can improve that

7 7 Accelerate Sequential Scan Program exhibits spatial locality even at page granularity –Page non-access correlation metric: Prob (next X neighbors are not accessed | current page is not accessed) Plot for SPECcpu2k benchmark mesa Prob # of contiguous non-accessed pages

8 8 Locality-based Jumping Start with sequential mode –change to jumping mode once see non-accessed one If an entry we jumped to is –not accessed, double the next jump range –accessed, roll back and reset jump range to 1, change to sequential mode Randomized to avoid overlooking pathological access patterns Page 0 1 Page 1 1 Page 2 0 Page 3 0 Page 4 0 Page 5 0 Page 6 0 Page 7 0 Page 8 0 Page 9 0 Page 10 0 Page 11 0 Page 12 1 Page 13 0 Roll back Access bit X X X

9 9 Sampling of Access Bits Recycle spare bits in PTE as hotness counter –Counter is aged to reflect recency and frequency –Could be extended to support LFU page replacement Decouple sampling frequency and window –Sampling frequency N –Sampling time window T 0 N 2N 3N 4N TN+T 2N+T3N+T 4N+T Clear all access bits Check all access bits Time

10 10 Hot Page Identification Efficiency Entries skipped using locality-based jumping: > 60% on avg. Runtime overhead –Tested 12 SPECcpu2k benchmarks on a Intel 3.0 Ghz core2duo processor –On avg. 2%/7% overhead at 100/10 milliseconds sampling frequency –Save 20%/58% over sequential scan

11 11 Hot Page Identification Accuracy No major accuracy loss due to jumping as measured by two metrics (Jeffrey divergence & rank error rate) Fairly accurate result

12 12 Roadmap Efficient hot page identification - locality jumping Cache partition policy - MRC-based Hot page coloring

13 13 Cache Partition Policy Miss-Ratio-Curve (MRC) based performance model –MRC profiled offline –Single app’s execution time ≈ Miss * Memory_Latency + Hit * Cache_Latency Cache partition goal: optimize system overall performance –System performance metric: geometric mean of all apps’ normalized performance. Normalization baseline is the performance when one monopolize whole cache

14 14 MRC-driven Cache Partition Policy Thread A’s Miss Ratio Thread B’s Miss Ratio Cache Allocation 0.50.70.30.2 0 0 Optimal partition point Cache Size = ∑ A,B Cache Allocation 4M Geometric mean of two apps’ normalized performance =

15 15 Hot Page Coloring Budget control of page re-coloring overhead –% of time slice, e.g. 5% Recolor from hottest until budget is reached –Maintain a set of hotness bins during sampling bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1] –Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins –Make sure hot pages are uniformly distributed among colors

16 16 Re-coloring Procedure Budget = 3 pages Cache share decrease Color Red 100 83 71 … … … 14 1 999799 87 74 … … … 10 1 82 75 … … … 11 3 81 73 … … … 12 2 Color BlueColor GreenColor Gray X hotness counter value Page sorted in hotness ascending order

17 17 Performance Comparisons 4 SPECcpu2k benchmarks are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache. {art, equake} vs. {mcf, twolf}

18 18 Relieve Artificial Memory Pressure Thread A Thread B Cache Way-1Way-n………… Memory page A1 A2 A3 A4 A5 Thread A’s footprint App’s footprint may be larger than its entitled memory color pages App may “steal” other’s colors, a.k.a. “polluting” other’s cache share Hotness-aware pollution that preferentially copies cold pages to other’s memory colors (in round-robin fashion as not to impose new pressure)

19 19 Relieve Artificial Memory Pressure On a dual-core chip, L2 cache was originally evenly partitioned between polluting and victim benchmarks. Because of memory pressure, polluting benchmark moves 1/3 (~62MB) of its footprint to victim’s shares. Non-space-sensitive appsSpace-sensitive apps

20 20 Summary Contributions: –Efficient hot page identification that can potentially be used by multiple applications –Hotness-based page coloring to mitigate two drawbacks: memory pressure & re-coloring cost Caveat: large time quantum still required to amortize overhead Ongoing work: –exploring other possible approaches e.g. Execution throttling based cache management [USENIX’09]


Download ppt "1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen."

Similar presentations


Ads by Google