Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida

Project Proposal Architecture of the machine consists of a private level of cache (say L2) and a shared next level of cache (say L3) Aim is to further partition the private level of cache (L2) depending on the application characteristics running on the different cores For example, if the applications running on different cores share blocks among them, some blocks will be exclusively marked for shared usage This will help in reducing miss rate for the applications which shares data heavily. If the applications do not share data it will perform as before

Motivation

As can be seen from the previous figure, for the commercial workloads number of shared blocks constitutes majority of the memory access As workloads are all web server applications, it shares a large amount of data between the threads it spawns on multiple cores In case of virtualized server consolidations, we will have a great amount of sharing among the cores participating in a virtual server So, as per Amdahl’s law if we reduce the miss rate for the shared blocks in these situations we should be able to improve the total hit rate

Proposed Strategy Each cache set has bit vector (call it Replacement Priority Vector [RPV]) of length equal to the associativity of the cache (i.e. equal to the number of the blocks in the set) A value of 1 in the position x in that vector indicates that the block x in that particular set is reserved exclusively for shared block. Other blocks can have both private and shared blocks During replacement two different strategies are followed depending on the state of the block which comes into the cache If the incoming block will be in shared state, all blocks in the set is considered and LRU is replaced If the incoming block will be in private state, all blocks except the ones reserved exclusively for shared blocks are considered for LRU replacement

Proposed Strategy RPV for each cache set is set up by each core depending on the directions of the cache directory controller (We assume a directory based cache coherency protocol) Directory tracks the number of misses for the shared blocks in a time interval for all the processors in a buffer called Processor Activity Buffer [PAB] PAB consists of three entries: A core Id, number of misses on shared block for that processor in present time interval and that in previous time interval If the difference for a particular core is great than a threshold it sends a message to the core to increase the number of reserved shared blocks and vice versa if it is below the threshold

Proposed Strategy RPV for each set of each core is set to zero initially Upon receiving a “increase shared blocks” message from directory, It looks at the current number of shared blocks in each cache sets (A counter is associated with each set which is incremented when a shared block comes into that set and decremented when a shared block is replaced) It decides on which sets there will be a increase in the number of reserved shared blocks It then modifies RPV for those blocks by turning on a bit in RPV depending on its current RPV On Receiving a “decrease shared blocks” message from the directory it finds the sets with the lowest amount of shared blocks and modified RPV accordingly

Cache Cohenrence Protocol Simple directory based coherence protocol as indicated in Hennessey Patterson Fetch/Invalidate send Data Write Back message to home directory Invalidate Invalid Shared (read/o nly) Exclusiv e (read/wri te) CPU Read Send Read Miss message CPU Write: Send Write Miss msg to H.D. CPU Write: Send Write Miss message to home directory CPU read hit CPU write hit Fetch: send Data Write Back message to home directory CPU read miss: send Data Write Back message and read miss to home directory CPU Read hit

Cache Cohenrence Protocol Data Write Back: Sharers = {} (Write back block) Uncached Shared (read only) Exclusiv e (read/writ ) Read miss: Sharers = {P} send Data Value Reply Write Miss: Sharers = {P}; send Data Value Reply message Read miss: Sharers += {P}; send Fetch; send Data Value Reply message to remote cache (Write back block)

Simulation Strategy Each Core is represented by a process. It reads from the trace file generated for this core from the MP trace file Each Core processor connects to the directory process using sockets and sends the current address to the directory if this is not a hit in the local cache Directory process updates PAB if needed and sends an update to the core processes after T requests to the directory As this is a simplified coherence protocol, processes wait for acknowledgement and data from the directory before proceeding for the next address

Simulation Strategy MP Trace file given by Zhou (Thanks Zhou!!!) T is chosen to be 1000 Each time a “increase” request comes from directory each core looks at first 10 cache sets in terms of number of shared blocks and updates the PRV by inserting a 1 in a random non-zero position of the RPV For different L2 cache associativity and size miss rate for each core is plotted along with the case where simple LRU is used

Results

Inference With higher associativity, the effect of the new policy is clear as allocating some blocks as shared does not affect private data LRU is not really used now-a-days. We should compare against new policies like co-operative caching for better insight As confirmed by Zhou, the MP workload as indeed of an application which was sharing most of its data apart from the code section, hence performance improvement is more prominent

Tunable Parameters This is merely a study with a workload which happened to have a good sharing characteristics Many parameters can be tuned, like which cores to update from directory, What number of blocks should be chosen for modifying their RPV Analysis of the RPV and the trace to infer if RPV reflects what kind of sharing is present Impact of False sharing and how to eliminate it

Thank You shibdas@gmail.com

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

Similar presentations

Presentation on theme: "Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

Similar presentations

Presentation on theme: "Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback