Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

Slides:

Advertisements

Similar presentations

CSCI 8150 Advanced Computer Architecture

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

Cache Optimization Summary

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

CSCI2413 Lecture 6 Operating Systems Memory Management 2 phones off (please)

1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

By: Aidahani Binti Ahmad

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Computer Architecture Lecture 26 Fasih ur Rehman.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

By Islam Atta Supervised by Dr. Ihab Talkhan

Page Table Implementation. Readings r Silbershatz et al:

The University of Adelaide, School of Computer Science

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.

The University of Adelaide, School of Computer Science

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Architecture and Design of AlphaServer GS320

Computer Architecture

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

18742 Parallel Computer Architecture Caching in Multi-core Systems

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Directory-based Protocol

The University of Adelaide, School of Computer Science

Outline Midterm results summary Distributed file systems – continued

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 3410, Spring 2014 Computer Science Cornell University

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Update : about 8~16% are writes

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida

Project Proposal Architecture of the machine consists of a private level of cache (say L2) and a shared next level of cache (say L3) Aim is to further partition the private level of cache (L2) depending on the application characteristics running on the different cores For example, if the applications running on different cores share blocks among them, some blocks will be exclusively marked for shared usage This will help in reducing miss rate for the applications which shares data heavily. If the applications do not share data it will perform as before

Motivation

As can be seen from the previous figure, for the commercial workloads number of shared blocks constitutes majority of the memory access As workloads are all web server applications, it shares a large amount of data between the threads it spawns on multiple cores In case of virtualized server consolidations, we will have a great amount of sharing among the cores participating in a virtual server So, as per Amdahl’s law if we reduce the miss rate for the shared blocks in these situations we should be able to improve the total hit rate

Proposed Strategy Each cache set has bit vector (call it Replacement Priority Vector [RPV]) of length equal to the associativity of the cache (i.e. equal to the number of the blocks in the set) A value of 1 in the position x in that vector indicates that the block x in that particular set is reserved exclusively for shared block. Other blocks can have both private and shared blocks During replacement two different strategies are followed depending on the state of the block which comes into the cache If the incoming block will be in shared state, all blocks in the set is considered and LRU is replaced If the incoming block will be in private state, all blocks except the ones reserved exclusively for shared blocks are considered for LRU replacement

Proposed Strategy RPV for each cache set is set up by each core depending on the directions of the cache directory controller (We assume a directory based cache coherency protocol) Directory tracks the number of misses for the shared blocks in a time interval for all the processors in a buffer called Processor Activity Buffer [PAB] PAB consists of three entries: A core Id, number of misses on shared block for that processor in present time interval and that in previous time interval If the difference for a particular core is great than a threshold it sends a message to the core to increase the number of reserved shared blocks and vice versa if it is below the threshold

Proposed Strategy RPV for each set of each core is set to zero initially Upon receiving a “increase shared blocks” message from directory, It looks at the current number of shared blocks in each cache sets (A counter is associated with each set which is incremented when a shared block comes into that set and decremented when a shared block is replaced) It decides on which sets there will be a increase in the number of reserved shared blocks It then modifies RPV for those blocks by turning on a bit in RPV depending on its current RPV On Receiving a “decrease shared blocks” message from the directory it finds the sets with the lowest amount of shared blocks and modified RPV accordingly

Cache Cohenrence Protocol Simple directory based coherence protocol as indicated in Hennessey Patterson Fetch/Invalidate send Data Write Back message to home directory Invalidate Invalid Shared (read/o nly) Exclusiv e (read/wri te) CPU Read Send Read Miss message CPU Write: Send Write Miss msg to H.D. CPU Write: Send Write Miss message to home directory CPU read hit CPU write hit Fetch: send Data Write Back message to home directory CPU read miss: send Data Write Back message and read miss to home directory CPU Read hit

Cache Cohenrence Protocol Data Write Back: Sharers = {} (Write back block) Uncached Shared (read only) Exclusiv e (read/writ ) Read miss: Sharers = {P} send Data Value Reply Write Miss: Sharers = {P}; send Data Value Reply message Read miss: Sharers += {P}; send Fetch; send Data Value Reply message to remote cache (Write back block)

Simulation Strategy Each Core is represented by a process. It reads from the trace file generated for this core from the MP trace file Each Core processor connects to the directory process using sockets and sends the current address to the directory if this is not a hit in the local cache Directory process updates PAB if needed and sends an update to the core processes after T requests to the directory As this is a simplified coherence protocol, processes wait for acknowledgement and data from the directory before proceeding for the next address

Simulation Strategy MP Trace file given by Zhou (Thanks Zhou!!!) T is chosen to be 1000 Each time a “increase” request comes from directory each core looks at first 10 cache sets in terms of number of shared blocks and updates the PRV by inserting a 1 in a random non-zero position of the RPV For different L2 cache associativity and size miss rate for each core is plotted along with the case where simple LRU is used

Results

Inference With higher associativity, the effect of the new policy is clear as allocating some blocks as shared does not affect private data LRU is not really used now-a-days. We should compare against new policies like co-operative caching for better insight As confirmed by Zhou, the MP workload as indeed of an application which was sharing most of its data apart from the code section, hence performance improvement is more prominent

Tunable Parameters This is merely a study with a workload which happened to have a good sharing characteristics Many parameters can be tuned, like which cores to update from directory, What number of blocks should be chosen for modifying their RPV Analysis of the RPV and the trace to infer if RPV reflects what kind of sharing is present Impact of False sharing and how to eliminate it

Thank You