Presentation on theme: "Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level."— Presentation transcript:
Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level Intel Montecito , AMD Opteron , or IBM POWER6 
Private cache Advantages Short L2 cache access latency Small amount of network traffic generated: Since the local L2 cache bank can filter most of the memory requests, the number of coherence messages injected into the interconnection network is limited. Disadvantages Data blocks can get duplicated if the working set accessed by the different cores is not well-balanced, some caches can be over-utilized whilst others can be under-utilized
Shared cache Cache coherence is maintained at the L1 level Bits usually chosen for the mapping to a particular bank are the less significant ones Piranha , Hydra , Sun UltraSPARC T2  and Intel Merom 
Shared caches Advantage Single copy of blocks Workload balancing: Since the utilization of each cache bank does not depend on the working set accessed by each core, but they are uniformly distributed among cache banks in a round-robin fashion, the aggregate cache capacity is augmented. Disadvantages Many requests will be will be serviced by remote banks (L2 NUCA architecture)
Hammer protocol AMD - Opteron systems It relies on broadcasting requests to all tiles to solve cache misses It targets systems that use unordered point-to-point interconnection networks On every cache miss, Hammer sends a request to the home tile. If the memory block is present on-chip, the request is forwarded to the rest of tiles to obtain the requested block All tiles answer to the forwarded request by sending either an acknowledgement or the data message to the requesting core. The requesting core needs to wait until it receives the response from each other tile. When the requester receives all the responses, it sends an unblock message to the home tile.
Hammer protocol Disadvantages Requires three hops in the critical path before the requested data block is obtained. Broadcasting invalidation messages increases considerably the traffic injected into the interconnection network and, therefore, its power consumption.
Directory protocol In order to accelerate cache misses, this directory information is not stored in main memory. Instead, it is usually stored on-chip at the home tile of each block. In tiled CMPs, the directory structure is split into banks which are distributed across the tiles. Each directory bank tracks a particular range of memory blocks.
Directory protocol The indirection problem – every cache miss must reach the home tile before any coherence action can be performed. – adds unnecessary hops into the critical path of the cache misses The directory memory overhead to keep the track of sharers for each memory block could be intolerable for large-scale configurations. – Example: block size 16 bytes, 64 tiles