1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Memory System Characterization of Big Data Workloads

G Robert Grimm New York University Sprite LFS or Let’s Log Everything.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Caching and Demand-Paged Virtual Memory

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.

A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao.

1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 10: Virtual Memory Background Demand Paging Page Replacement Allocation of.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

Reducing Memory Interference in Multicore Systems

Software Coherence Management on Non-Coherent-Cache Multicores

Lecture: Large Caches, Virtual Memory

Memory Caches & TLB Virtual Memory

Xiaodong Wang, Shuang Chen, Jeff Setter,

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

18742 Parallel Computer Architecture Caching in Multi-core Systems

(Find all PTEs that map to a given PPN)

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Andy Wang Operating Systems COP 4610 / CGS 5765

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture: Cache Innovations, Virtual Memory

Andy Wang Operating Systems COP 4610 / CGS 5765

Performance metrics for caches

Performance metrics for caches

CARP: Compression-Aware Replacement Policies

Lecture 10: Branch Prediction and Instruction Delivery

Performance metrics for caches

Lecture: Cache Hierarchies

Performance metrics for caches

Lecture 9: Caching and Demand-Paged Virtual Memory

Principle of Locality: Memory Hierarchies

Sarah Diesburg Operating Systems CS 3430

Andy Wang Operating Systems COP 4610 / CGS 5765

Performance metrics for caches

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen

2 The Multi-Core Challenge Multi-core chip –Dominant on market –Last level cache is commonly shared by sibling cores, however sharing is not well controlled Challenge: Performance Isolation –Poor performance due to conflicts –Unpredictable performance –Denial of service attacks source:

3 Possible Software Approach: Page Coloring Partition cache at coarse granularity Page coloring: advocated by many previous works –[Bershad’94, Bugnion’96, Cho ‘06, Tam ‘07, Lin ‘08, Soares ‘08] Thread A Thread B Cache Way-1Way-n………… Memory page Color # = CacheSize PageSize*CacheAssociativity

4 Challenges for Page Coloring Expensive page re-coloring –Re-coloring is needed due to optimization goal or co-runner change –Without extra support, re-coloring means memory copying –3 micro-seconds per page copy, >10K pages to copy, possibly happen every time quantum Artificial memory pressure –Cache share restriction also restricts memory share

5 Hotness-based Page Coloring Basic idea –Restrain page coloring to a small group of hot pages Challenge: –How to efficiently find out hot pages Outline –Efficient hot page identification –Cache partition policy –Hot page coloring

6 Method to Track Page Hotness Hardware access bits + sequential table scan –Generally available on x86, automatically set by hardware –One bit per Page Table Entry (PTE) Conventional wisdom: scan whole page table is expensive –Not entirely true, per-entry scan latency is overlapped by hardware prefetching –Sequential table scan spends a large portion of time on non-accessed pages, but we can improve that

7 Accelerate Sequential Scan Program exhibits spatial locality even at page granularity –Page non-access correlation metric: Prob (next X neighbors are not accessed | current page is not accessed) Plot for SPECcpu2k benchmark mesa Prob # of contiguous non-accessed pages

8 Locality-based Jumping Start with sequential mode –change to jumping mode once see non-accessed one If an entry we jumped to is –not accessed, double the next jump range –accessed, roll back and reset jump range to 1, change to sequential mode Randomized to avoid overlooking pathological access patterns Page 0 1 Page 1 1 Page 2 0 Page 3 0 Page 4 0 Page 5 0 Page 6 0 Page 7 0 Page 8 0 Page 9 0 Page 10 0 Page 11 0 Page 12 1 Page 13 0 Roll back Access bit X X X

9 Sampling of Access Bits Recycle spare bits in PTE as hotness counter –Counter is aged to reflect recency and frequency –Could be extended to support LFU page replacement Decouple sampling frequency and window –Sampling frequency N –Sampling time window T 0 N 2N 3N 4N TN+T 2N+T3N+T 4N+T Clear all access bits Check all access bits Time

10 Hot Page Identification Efficiency Entries skipped using locality-based jumping: > 60% on avg. Runtime overhead –Tested 12 SPECcpu2k benchmarks on a Intel 3.0 Ghz core2duo processor –On avg. 2%/7% overhead at 100/10 milliseconds sampling frequency –Save 20%/58% over sequential scan

11 Hot Page Identification Accuracy No major accuracy loss due to jumping as measured by two metrics (Jeffrey divergence & rank error rate) Fairly accurate result

12 Roadmap Efficient hot page identification - locality jumping Cache partition policy - MRC-based Hot page coloring

13 Cache Partition Policy Miss-Ratio-Curve (MRC) based performance model –MRC profiled offline –Single app’s execution time ≈ Miss * Memory_Latency + Hit * Cache_Latency Cache partition goal: optimize system overall performance –System performance metric: geometric mean of all apps’ normalized performance. Normalization baseline is the performance when one monopolize whole cache

14 MRC-driven Cache Partition Policy Thread A’s Miss Ratio Thread B’s Miss Ratio Cache Allocation Optimal partition point Cache Size = ∑ A,B Cache Allocation 4M Geometric mean of two apps’ normalized performance =

15 Hot Page Coloring Budget control of page re-coloring overhead –% of time slice, e.g. 5% Recolor from hottest until budget is reached –Maintain a set of hotness bins during sampling bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1] –Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins –Make sure hot pages are uniformly distributed among colors

16 Re-coloring Procedure Budget = 3 pages Cache share decrease Color Red … … … … … … … … … … … … 12 2 Color BlueColor GreenColor Gray X hotness counter value Page sorted in hotness ascending order

17 Performance Comparisons 4 SPECcpu2k benchmarks are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache. {art, equake} vs. {mcf, twolf}

18 Relieve Artificial Memory Pressure Thread A Thread B Cache Way-1Way-n………… Memory page A1 A2 A3 A4 A5 Thread A’s footprint App’s footprint may be larger than its entitled memory color pages App may “steal” other’s colors, a.k.a. “polluting” other’s cache share Hotness-aware pollution that preferentially copies cold pages to other’s memory colors (in round-robin fashion as not to impose new pressure)

19 Relieve Artificial Memory Pressure On a dual-core chip, L2 cache was originally evenly partitioned between polluting and victim benchmarks. Because of memory pressure, polluting benchmark moves 1/3 (~62MB) of its footprint to victim’s shares. Non-space-sensitive appsSpace-sensitive apps

20 Summary Contributions: –Efficient hot page identification that can potentially be used by multiple applications –Hotness-based page coloring to mitigate two drawbacks: memory pressure & re-coloring cost Caveat: large time quantum still required to amortize overhead Ongoing work: –exploring other possible approaches e.g. Execution throttling based cache management [USENIX’09]