15-740/18-740 Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Advertisements

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

Memory Management 2010.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Lecture 17: Virtual Memory, Large Caches

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

1 Coordinating Accesses to Shared Caches in Multi-core Processors Software Approach Xiaodong Zhang Ohio State University Collaborators: Jiang Lin, Zhao.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

The Evicted-Address Filter

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Fall 2010 Computer Architecture Lecture 6: Caching Basics Prof. Onur Mutlu Carnegie Mellon University.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.

15-740/ Computer Architecture Lecture 22: Caching in Multi-Core Systems Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/7/2011.

CS161 – Design and Architecture of Computer

Improving Cache Performance using Victim Tag Stores

Virtual memory.

Prof. Onur Mutlu Carnegie Mellon University

Lecture: Large Caches, Virtual Memory

CS161 – Design and Architecture of Computer

Computer Architecture Lecture 18: Caches, Caches, Caches

Cache Performance Samira Khan March 28, 2017.

18-447: Computer Architecture Lecture 23: Caches

Lecture: Large Caches, Virtual Memory

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Lecture: Large Caches, Virtual Memory

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Application Slowdown Model

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 23: Cache, Memory, Virtual Memory

Page Replacement.

Lecture: Cache Innovations, Virtual Memory

CARP: Compression-Aware Replacement Policies

Massachusetts Institute of Technology

Lecture 14: Large Cache Design II

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Virtual Memory Overcoming main memory size limitation

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 18: Caching Basics

Lecture: Cache Hierarchies

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Virtual Memory: Working Sets

CSE 542: Operating Systems

Presentation transcript:

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University

Last Time Multi-core issues in prefetching and caching  Prefetching coherence misses: push vs. pull  Coordinated prefetcher throttling  Cache coherence: software versus hardware  Shared versus private caches  Utility based shared cache partitioning 2

Readings in Caching for Multi-Core Required  Qureshi and Patt, “Utility-Based Cache Partitioning: A Low- Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO Recommended (covered in class)  Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA  Qureshi et al., “Adaptive Insertion Policies for High- Performance Caching,” ISCA

Software-Based Shared Cache Management Assume no hardware support (demand based cache sharing, i.e. LRU replacement) How can the OS best utilize the cache? Cache sharing aware thread scheduling  Schedule workloads that “play nicely” together in the cache E.g., working sets together fit in the cache Requires static/dynamic profiling of application behavior Fedorova et al., “Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler,” PACT Cache sharing aware page coloring  Dynamically monitor miss rate over an interval and change virtual to physical mapping to minimize miss rate Try out different partitions 4

OS Based Cache Partitioning Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA Cho and Jin, “Managing Distributed, Shared L2 Caches through OS- Level Page Allocation,” MICRO Static cache partitioning  Predetermines the amount of cache blocks allocated to each program at the beginning of its execution  Divides shared cache to multiple regions and partitions cache regions through OS-based page mapping Dynamic cache partitioning  Adjusts cache quota among processes dynamically  Page re-coloring  Dynamically changes processes’ cache usage through OS- based page re-mapping 5

Page Coloring Physical memory divided into colors Colors map to different cache sets Cache partitioning  Ensure two threads are allocated pages of different colors 6 Thread A Thread B Cache Way-1Way-n………… Memory page

Page Coloring virtual page number Virtual address page offset physical page number Physical address Page offset Address translation Cache tag Block offset Set index Cache address Physically indexed cache page color bits … OS control = Physically indexed caches are divided into multiple regions (colors). All cache lines in a physical page are cached in one of those regions (colors). OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).

Static Cache Partitioning using Page Coloring …... …… … … Physically indexed cache … …… … Physical pages are grouped to page bins according to their page color … i+2 i i+1 … Process … i+2 i i+1 … Process 2 OS address mapping Shared cache is partitioned between two processes through address mapping. Cost: Main memory space needs to be partitioned, too.

Allocated color Dynamic Cache Partitioning via Page Re-Coloring page color table …… N Page re-coloring:  Allocate page in new color  Copy memory contents  Free old page Allocated colors Pages of a process are organized into linked lists by their colors. Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.

Dynamic Partitioning in Dual Core Init: Partition the cache as (8:8) Run current partition (P 0 :P 1 ) for one epoch finished Try one epoch for each of the two neighboring partitions: (P 0 – 1: P 1 +1) and (P 0 + 1: P 1 -1) Choose next partitioning with best policy metrics measurement (e.g., cache miss rate) No Yes Exit

Experimental Environment Dell PowerEdge1950  Two-way SMP, Intel dual-core Xeon 5160  Shared 4MB L2 cache, 16-way  8GB Fully Buffered DIMM Red Hat Enterprise Linux 4.0  kernel  Performance counter tools from HP (Pfmon)  Divide L2 cache into 16 colors

Performance – Static & Dynamic Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008.

Software vs. Hardware Cache Management Software advantages + No need to change hardware + Easier to upgrade/change algorithm (not burned into hardware) Disadvantages - Less flexible: large granularity (page-based instead of way/block) - Limited page colors  reduced performance per application (limited physical memory space!), reduced flexibility - Changing partition size has high overhead  page mapping changes - Adaptivity is slow: hardware can adapt every cycle (possibly) - Not enough information exposed to software (e.g., number of misses due to inter-thread conflict) 13

Handling Shared Data in Private Caches Shared data and locks ping-pong between processors if caches are private -- Increases latency to fetch shared data/locks -- Reduces cache efficiency (many invalid blocks) -- Scalability problem: maintaining coherence across a large number of private caches is costly How to do better?  Idea: Store shared data and locks only in one special core’s cache. Divert all critical section execution to that core/cache. Essentially, a specialized core for processing critical sections Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS

Non-Uniform Cache Access Large caches take a long time to access Wire delay  Closeby blocks can be accessed faster, but furthest blocks determine the worst-case access time Idea: Variable latency access time in a single cache Partition cache into pieces  Each piece has different latency  Which piece does an address map to? Static: based on bits in address Dynamic: any address can map to any piece  How to locate an address?  Replacement and placement policies? Kim et al., “An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches,” ASPLOS

Multi-Core Cache Efficiency: Bandwidth Filters Caches act as a filter that reduce memory bandwidth requirement  Cache hit: No need to access memory  This is in addition to the latency reduction benefit of caching  GPUs use caches to reduce memory BW requirements Efficient utilization of cache space becomes more important with multi-core  Memory bandwidth is more valuable Pin count not increasing as fast as # of transistors  10% vs. 2x every 2 years  More cores put more pressure on the memory bandwidth How to make the bandwidth filtering effect of caches better? 16

Revisiting Cache Placement (Insertion) Is inserting a fetched/prefetched block into the cache (hierarchy) always a good idea?  No allocate on write: does not allocate a block on write miss  How about reads? Allocating on a read miss -- Evicts another potentially useful cache block + Incoming block potentially more useful Ideally:  we would like to place those blocks whose caching would be most useful in the future  we certainly do not want to cache never-to-be-used blocks 17

Revisiting Cache Placement (Insertion) Ideas:  Hardware predicts blocks that are not going to be used Lai et al., “Dead Block Prediction,” ISCA  Software (programmer/compiler) marks instructions that touch data that is not going to be reused How does software determine this? Streaming versus non-streaming accesses  If a program is streaming through data, reuse likely occurs only for a limited period of time  If such instructions are marked by the software, the hardware can store them temporarily in a smaller buffer (L0 cache) instead of the cache 18

Reuse at L2 Cache Level 19 DoA Blocks: Blocks unused between insertion and eviction For the 1MB 16-way L2, 60% of lines are DoA  Ineffective use of cache space (%) DoA Lines

Why Dead on Arrival Blocks? 20  Streaming data  Never reused. L2 caches don’t help.  Working set of application greater than cache size Solution: if working set > cache size, retain some working set art Misses per 1000 instructions Cache size in MB mcf Misses per 1000 instructions Cache size in MB

Cache Insertion Policies: MRU vs. LRU 21 abcdefgh MRULRU iabcdefg Reference to ‘i’ with traditional LRU policy: abcdefgi Reference to ‘i’ with LIP: Choose victim. Do NOT promote to MRU Lines do not enter non-LRU positions unless reused

Other Insertion Policies: Bimodal Insertion 22 if ( rand() <  ) Insert at MRU position; else Insert at LRU position; LIP does not age older lines Infrequently insert lines in MRU position Let  Bimodal throttle parameter For small , BIP retains thrashing protection of LIP while responding to changes in working set

Analysis with Circular Reference Model 23 For small , BIP retains thrashing protection of LIP while adapting to changes in working set Policy(a 1 a 2 a 3 … a T ) N (b 1 b 2 b 3 … b T ) N LRU00 OPT(K-1)/T LIP(K-1)/T0 BIP (small  ) ≈ (K-1)/T Reference stream has T blocks and repeats N times. Cache has K blocks (K >T) Two consecutive reference streams:

Analysis with Circular Reference Model 24

LIP and BIP Performance vs. LRU 25 Changes to insertion policy increases misses for LRU-friendly workloads LIP BIP  1/32) (%) Reduction in L2 MPKI

Dynamic Insertion Policy (DIP) Qureshi et al., “Adaptive Insertion Policies for High- Performance Caching,” ISCA Two types of workloads: LRU-friendly or BIP-friendly DIP can be implemented by: 1.Monitor both policies (LRU and BIP) 2.Choose the best-performing policy 3.Apply the best policy to the cache Need a cost-effective implementation  Set Sampling

Dynamic Insertion Policy Miss Rate 27 DIP (32 dedicated sets) BIP (%) Reduction in L2 MPKI

DIP vs. Other Policies 28 DIP OPT Double(2MB) (LRU+RND) (LRU+LFU) (LRU+MRU)