Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

To Include or Not to Include? Natalie Enright Dana Vantrease.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

By Islam Atta Supervised by Dr. Ihab Talkhan

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Presented by: Nick Kirchem Feb 13, 2004

Cache Coherence: Directory Protocol

Cache Coherence: Directory Protocol

Reducing Memory Interference in Multicore Systems

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Architecture and Design of AlphaServer GS320

Lecture: Large Caches, Virtual Memory

The University of Adelaide, School of Computer Science

Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs

A New Coherence Method Using A Multicast Address Network

A Study on Snoop-Based Cache Coherence Protocols

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Multiprocessor Cache Coherency

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Lecture: Large Caches, Virtual Memory

Lecture 13: Large Cache Design I

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 12: Cache Innovations

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches

Lecture: Cache Innovations, Virtual Memory

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

DDM – A Cache-Only Memory Architecture

Lecture 9: Directory Protocol Implementations

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Lucía G. Menezo Valentín Puente Jose Ángel Gregorio Flask Coherence A Morphable Hybrid Coherence Protocol to Balance Energy, Performance and Scalability Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

Motivation Complex on-chip cache hierarchies in CMPs To maintain coherence (and programmers ): Hardware mechanism  coherence protocol What type of protocol should be used? Not universal solution Trade-offs in costs, energy and performance HPCA 2015 Lucia G. Menezo

Pure coherence: directory-based Structure with all the coherence information Demands directory inclusivity Duplicate-tag inefficient with large num of cores due to associativity To meet energy constraints: overprovision the directory capacity to minimize evictions in private caches Private cache sizes growing  number of tracked blocks also grows HPCA 2015 Lucia G. Menezo

Pure coherence : broadcast-based Miss in private caches  broadcast request Better resource utilization: neither inclusivity nor additional structures to track block copies are required ++ traffic and cache snoops  -- energy efficiency Traffic impact in medium/large-scale noticeable or possibly unsustainable Better performance although on-chip resource contention could degrade it Broadcast-based: miss in private caches  broadcast request Better resource utilization: neither inclusivity nor additional structures to track block copies are required + traffic and cache snoops  - energy efficiency Impact in medium/large-scale noticeable or possibly unsustainable On-chip resource contention: degrading CMP performance HPCA 2015 Lucia G. Menezo

Hybrid coherence ? Flask Performance of broadcast-based Energy efficiency of a directory-based Structure to track: Shared blocks precisely (actively ones) Private blocks approximately Broadcast after some misses in the structure Additional features to minimize on-chip and off-chip traffic Flask gets the best of both worlds: performance of broadcast-based and energy efficiency of directory-based. HPCA 2015 Lucia G. Menezo

Flask Coherence Controller Directory only tracks actively shared blocks Filter to know (aprox.) when a block is inside the chip Token coherence to: To guarantee coherent invariants To monitor when to update Flask structure Controller Flask Core On-chip network Private cache LLC0 LLC1 FC LLCn Flask Controller FC Token Coherence protocol: Initially each block := # tokens (==#procs) Read request: data and 1 token Write request: data and all tokens Directory: contains tags of the blocks that are being actively shared (NOT ALL OF THEM!!) Filter: contains info of the blocks inside the chip Besides, Flask uses token coherence main characteristics to guarantee coherence invariants HPCA 2015 Lucia G. Menezo

Flask Coherence Basics Private block Block only in one private cache  no entry allocated in directory Replacement will move the block to further levels until LLC Actively shared block A block present in any private cache is requested by another core Need to allocate an entry in the directory For actively shared blocks: the allocated entry in the directory, does not have to be present all the time that the block is present in the private caches. See next slide. HPCA 2015 Lucia G. Menezo

Flask Conceptual Approach Directory Filter 3 Flask Coherence Controllers Last Level Caches V I -- sharers P0 P1 P2 P3 0 1 0 0 2 2 1 If directory has the information for a requested block, Flask works as a normal directory-based coherence protocol: forward request to the owner, the owner sends a copy to the requestor. But if the requestor needs to replace the entry with the information because of lack of space, it can replace it without invalidating the copies of the block present in the private caches. In that case, if a core sends a request to the directory, this one has no information about the block and needs to find it and reconstruct its sharing information (if it is shared). For doing so, it checks if the requested address is present in the filter. If the filter says the address IS NOT inside the chip, the request is sent to memory and memory will reply with the block and all the tokens. This will be a private block so the directory will not have to allocate an entry for that block. If the filter says that the block IS inside, the the controller checks if the data block with all the tokens is present in the LLC and if it is not, it sends a broadcast request to all the coherence agents in the system. All the coherence agents will reply with the number of tokens they have. This broadcast requests also includes information about the requestor of the block so that the owner can solve the pending operation during the reconstruction process (keeping it out of the critical path). Private Caches S 1 data I N/A O 3 data S 1 data I N/A 4 3 HPCA 2015 Lucia G. Menezo

Reconstruction process All coherence agents will respond with the number of tokens owned for the requested data 2 possible cases: LLC has all the tokens No other copy of the block in the chip Block not actively shared: no entry allocated. Any private cache has token Block actively shared and entry is allocated in dir. Non of them have any tokens Filter gave a false positive CASE1: LLC has a copy of the block with all the tokens NOTE: In this case, the broadcast reconstruction message is not sent to the private caches. It is a private a block so no entry is allocated in the directory. CASE2: any private cache has token Actively shared block: entry allocated in the directory CASE 3! no one has tokens --> Flask filter false positive (see paper for probabilities) HPCA 2015 Lucia G. Menezo

Flask Filter d-left Counting Bloom Filters Each filter: Double efficiency of the counters of a CBF Each filter: attached to each directory entry tracks all the tags that map in that entry Combination of hash function and permutations to know which bits are modified (See paper for Flask filter design details) F. Bonomi et al. “An improved construction for counting bloom filters,” in ESA’06 14th Annual European Symposium, 2006. HPCA 2015 Lucia G. Menezo

When is the Flask filter modified? Increment filter: after an on-chip miss Counter saturation avoidance mechanism Decrement filter: LLC evicts block with all tokens Flask prepared in case LLC needs to evict with not all tokens (<0.01%) Updates are done overlapped with main memory access or after LLC replacement (outside the critical path) 99.99% of the cases when LLC wants to replace, it has all the tokens (See paper for Flask filter design details.) HPCA 2015 Lucia G. Menezo

Resource Partitioning Directory and filter are “complementary” ++ actively shared blocks -- pressure on filter ++ private blocks  -- pressure on directory Dynamically decide according workload characteristics Directory Filter tag sharers … C0, C1, …, Cn tag sharers C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn … How the dleft filter works, allows to use a conventional single-ported SRAM memory. This gives the possibility to use the same storage to keep either directory or filter information and the controller logic can route how is the distribution made according the workload characteristics. C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn C0, C1, …, Cn HPCA 2015 Lucia G. Menezo

Evaluation methodology Number of cores 16 @3GHz IWin size/Issue Width 196, 6-way Block size 64B Private L1 Size / Associativity 32KB I/D, 4-way L2 256KB, 8-way (exclusive with L1) L3 Shared 16MB 16-way NUCA Mapping Static, interleaved across slices Memory Capacity 4GB Max. Outstanding Mem. Operations 64 Topology 4×4 Mesh R C $1 $2 LLC0 Flask LLC1 F LLC2 LLC3 LLC4 LLC5 LLC6 LLC7 LLC8 LLC9 LLC10 LLC11 LLC12 LLC13 LLC14 LLC15 HPCA 2015 Lucia G. Menezo

Flask Execution Time TokenB Dir Flask (160% - 80% - 40% - 20% - 10% - 5%) To keep implementation cost constant, in Flask, storage capacity is half dedicated to the directory and half to the filter. Sparse dir performance degrades below 1/4th of the aggregate private cache capacity Hits in private caches are less as the number of tracked blocks is reduced  remove blocks from private caches because of lack of space in directory, while flask allows these blocks to stay in the private caches. Sparse dir needs to get any reused blocks from LLC while Flask keeps them closer to the processor. Consistent with sharing degree of apps Unstable behavior of tokenB due to on-chip contention X%: space available to track X% of the private cache block tags Dir: 160%=8K entries; 5%=32 entries Flask: 160%=4K entries; 5%=16 entries HPCA 2015 Lucia G. Menezo

Analyzing On-Chip Traffic Adapting resources Broadcast traffic of both protocols, flask normalized to Tokenb. Avoided reconstruction broadcasts shown to see filter efficiency for each directory size. Flask traffic is orange and red section. Number of false positives increases as the directory size is reduced. As more address are mapping to the same entry in the directory, more elements each filter has to track (with the same number of bits). If we run the same applications but deciding dynamically how the storage capacity is distributed (adapting resources), filter positives are decreased when we give less filter and more filter in apps with a lot of private blocks (multiprogrammed and numerical) and compulsory reconstructions are decreased when we give more directory and less filter in applications with high sharing degree (server). If we check on average, we see how the traffic is reduced by half compared to a static resource assignation. TokenB 160% 5% … -- directory ++ filter ++ directory -- filter HPCA 2015 Lucia G. Menezo

On-chip memory hierarchy EDP TokenB Flask(5%) Sparse Dir (160%) All Flask benefits not at the expense of energy. HPCA 2015 Lucia G. Menezo

Conclusions and future work Flask: re-architectures directory coherence protocols with benefits from snoop-based coherence Improve performance and power (with extreme configurations) with no high toll Cloud-computing scenarios Morphing dir & filter during application execution Multi-CMP: hierarchical coherence Use directory+filter+token information to minimize traffic between chips HPCA 2015 Lucia G. Menezo

Thank you Questions? HPCA 2015 Lucia G. Menezo

HPCA 2015 Lucia G. Menezo

Sketch of a dlCBF filter Address Hash Function Permutation 1 Permutation 2 bucket1, Remainder1 bucket2, Remainder2 b0 b1 b2 b4 b3 b0 b1 b2 b4 b3 R1 cnt HPCA 2015 Lucia G. Menezo

Memory latency overhead TokenB Dir Flask (160% - 80% - 40% - 20% - 10% - 5%) Off-chip requests are induce by LLC capacity misses and both protocols handle LLC the same Numerical: false positives affect negatively due to delay in off-chip access Commercial: extra traffic of compulsory reconstructions increases contention delaying memory access Effect < 5% in the most adverse configuration  unnoticeable in average access time HPCA 2015 Lucia G. Menezo

Performance with different associativities Negligible impact of directory conflicts on Flask performance: 1) No need to perform external invalidations after a directory eviction 2) Directory only tracks actively shared blocks HPCA 2015 Lucia G. Menezo

Performance with different associativities  FT very low directory reuse Negligible impact of directory conflicts on Flask performance: 1) No need to perform external invalidations after a directory eviction 2) Directory only tracks actively shared blocks HPCA 2015 Lucia G. Menezo

Mosaic vs. Flask Flask only reconstructs shared blocks Multi-programmed: -- traffic (increases as the filter is reduced) Commercial: high sharing degree, compulsory reconstructions are unavoidable HPCA 2015 Lucia G. Menezo