Presentation on theme: "Physical Design of Snoop-Based Cache Coherence in Multiprocessors"— Presentation transcript:
1 Physical Design of Snoop-Based Cache Coherence in Multiprocessors Muge Guher
2 Cache Coherence Definition A Microprocessor is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order.Write propagationWrites are visible to other processesWrite serializationAll writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity)E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi
3 Cache Coherence Snooping Shared memory multiprocessor environment Main Memory is passiveCaches distribute state transitions to other caches and memoryAll caches listen to snoop messages and act on themMost machines use cache coherence protocols with different trade-offsBut, performance (latency and bandwidth) also depends on physical implementation.Bus designCache designIntegration with memory
4 Cache Coherence Requirements Protocol AlgorithmStatesState transitionsActions/OutputsPhysical DesignProtocol intent is implemented in FSMsCache controller FSMMultiple states per misBus controller FSMOther ControllersSupport for:Multiple Bus transactionsMulti-Level CachesSplit-Transaction Busses
5 Design Wish List Implementation should be Correct Require minimal extra hardwareOffer high performanceHigh Performance can be achieved with multiple events in progress, overlapping latenciesLeads to numerous complex interactions between eventsMore bugs!
6 Design Issues with implementing Snooping Cache controller and tagsBus side and processor side interactionsReporting snoop results: how and whenHandling write-backsNon-atomic state transitionsOverall set of actions for memory operation are not atomicRace conditionsAtomic operationsDeadlock, livelock, starvation, serialization.
7 Cache Controller and Tags Must “monitor” bus operations and “respond” to processor operationsTwo controllers: bus-side, and processor-sideBus transactions: Bus-side capture address and perform tag check.Fail, snoop miss, no actionHit, cache coherence protocol, RMW on state bitsFor single level caches, duplicate set of tags and state or dual-ported tag and state storeController is an initiator and responder to bus transactions.Data is not duplicatedBoth sets of tags may be updated simultaneouslySingle-level snoopy cache organization 
8 Reporting Snoop Results How does memory know another cache will respond and provide a copy of the block so it doesn’t have to? UniprocessorInitiator places an address on the busResponder must acknowledge within a time-out window (wired-OR), otherwise bus error.Snooping CachesAll caches must report on the bus before transaction can proceed.Snoop result informs main memory, if it should respond or a cache has a modified copy of the block.When and how the snoop result is reported on the bus?For Example to implement MESI protocol,Memory needs to know; Is block dirty? Should it respond or not?Requesting cache needs to know; Is block shared?
9 When to report Snoop Results Within fixed number of clock cycles from the address issue on the busDual set of tags, high priority processor access to the tags.Both set is inaccessible during processor updates.Extra HW & longer snoop latency, but simple memory subsystemPentium Pro, HP Servers, Sun enterprise.After a variable delayMemory assumes one of the caches will supply the data, until all have snooped and indicated results.Easier to implement, tag access-conflicts and high performanceHigher performance, don't have to assume worst case delaySGI Challenge, fetches the data and stalls until snoop completeImmediatelyMain memory maintains a state bit per block, modified in a cache.Complexity introduced to main memory subsystem
10 How to Report Snoop Results Three wired-OR signals1,2 : Two for snoop resultsShared: asserted if any cache has a copyDirty: asserted if some cache has a dirty copyDirty cache knows what action to take3 : One indicating snoop valid. Inhibit signal, asserted until all processors have completed their snoop.Illinois MESI protocol allows cache-to-cache transfers.Retrieve data from other caches rather than memory.Priority scheme neededSI Challenge, Sun Enterprise Server, only in exclusive or modified state.Challenge updates memory during cache-to-cache transfer (no shared modified state)
11 Single Level Snooping Cache Assumptions- Single Level write-back cache- Invalidation protocol- Processor can have one memory request outstanding- The System bus is atomicSnooping cache design 
12 Multi-level Cache Hierarchies How would a design of a cache controller be modified in case of L1/L2 caches?Complicates CoherenceChanges made by the processor to L1 cache may not be visible to L2 cache controller, which is responsible for bus operationsBus transactions are not directly visible to L1 cacheA Solution:Independent bus snooping HW for each cache level hierarchyL1 cache is usually on the processor, on chip snooper consumes pins to monitor shared busDuplicating tags consumes too much on chip areaDuplication of effort between L1 and L2 snoops.Intel’s 8870 chipset has a “snoop filter” for quad-core
13 How do you guarantee coherence in a multi-level cache hierarchy? Better Solution: Based on “Inclusion Property”If memory block is in L1 cache it must also be in L2 cacheIf the block is in modified state (or shared-modified) in L1 cache, then it must also be marked modified in L2 cache, (its copy in L2)Therefore:only a snooper at L2 is necessary, as it has all the required informationIf a BusRd requests a block that is in modified state in either cache, then L2 can wave memory access and inform L1.Now information flows both ways:L1 accesses L2 for cache miss handling and block state changes;L2 forwards to L1 blocks invalidated/updated by bus transactions;
14 Inclusion PropertyDifficulties with maintaining the inclusion property:L1 and L2 may have different eviction algorithms (replacement differences)While a block is kept by L1 it may be evicted by L2Separated data and instruction caches.Different cache block sizes.On a most commonly encountered case, inclusion works automatically:L1 is direct mappedL2 is either direct mapped or set associativeSame block size for both cachesNumber of sets in L1 is smaller than in L2
15 Explicitly Maintaining Inclusion Extend the mechanisms used for propagating coherence events to cache hierarchy.Propagate L2 replacements to L1Invalidate or flush messagesPropagate bus transactions from L2 to L1Send all transactions to L1 (even if the given block is not present there)Add extra state to L2 (a bit per block) which blocks in L2 are also in L1 (inclusion bit)On write: propagate modified state from L1 to L2. If L1 is:Write-through (so all modifications affect also L2), invalidateWrite-back :Add per bit state every block in L2, "modified-but-stale"Request flush from L1 on Bus readL2 serves as a filter for the L1 cache, screening out irrelevant transactions from the bus, i.e. dual tags are less critical with multilevel caches
16 Propagating transactions for Coherence in Hierarchy How is the transaction propagated for multilevel caches? Show some examples of modern systems. Only one transaction on the bus at a time.Transactions are propagated up and down the hierarchy, bus transactions may be held until propagation completes.Performance penalty for holding processor write until BusRdX has been granted in high, so motivation to de-couple these operationTwo-level snoopy cache organization
17 Split Transaction BusIn a Split-transaction bus (STB), transactions that require a response are split in two independent sub-transactions: a request transaction and a response transaction.Arbitrate each phase separatelyOther transactions are allowed to intervene between request & responseBuffering between bus and the cache controllers allows multiple outstanding transactions (waiting for snoop and/or data responses)Pro: By pipelining bus operations the bus is utilized more efficiently.Con: Increased complexity.Mem Access DelayAddress/CMDDataBusarbitration
18 Issues supporting STBs A new request can appear on the bus before the snoop and/or servicing of an earlier request are complete;these requests may be conflicting requests (same block);The number of buffers for incoming requests and potential data responses from bus to cache controller is usually fixed and small, flow control is neededSince requests from the bus are buffered, when and how snoop and data responses are produced on the busIn the same order as requests arrive?Snoop and data response together or separatelyExample separately: Sun, together: SGIThere are 3 phases in a transaction:A request is put on the busSnoop results are sent by other cachesData is sent for the requesting cache, if needed
19 SGI Challenge Example Features: Does not allow conflicting requests for same block (8 outstanding requests)NACK Flow-controlNACK as soon as request appears on bus, requestor retriesSeparate command (incl. NACK) + address and tag + data busesResponses may be in different order than requestsOrder of transactions determined by requestsSnoop results presented on bus with responseExamine implementation specifics of:Bus design, request response matchingSnoop resultsFlow Control
20 Two independently arbitrated buses: How would a design of a cache controller be modified in case of split transactions buses? Show examples of modern split transaction buses/systems? How many outstanding transactions are allowed? Bus DesignTwo independently arbitrated buses:Request: command+address (BusRd, BusWB + target address)Response: dataMatch each response to outstanding request, since they arrive out of orderTag request with 3-bits (8 outstanding) when launchedTag arrives back with corresponding responseAddress bus is free, as tag is sufficient for request matchingAddress and data buses can be arbitrated seperatelySeparate bus lines for arbitration, flow control and snoop results
21 Bus and Cache Controller Design To keep track of outstanding requests on the bus:each cache controller maintains eight entry buffer, “request table”A new request on the bus, added to all request tables at same index,Index is 3-bit tag assigned at arbitrationTable entry contains; block address, request type, state in that cache etc.Table is fully associative, new entry can be placed anywhere in tableChecked for a match by the requesting processor and by all snooped requests and responses on the busEntry and tag freed when response is observed on the bus,Now tag can be reassigned by bus
22 Bus Interface and Request Table Bus interface logic to accommodate split-transaction bus 
23 Snoop Results & Request Conflicts Variable delay snoopingSnoop portion of the bus consists of three wired-OR linesSharing, dirty, inhibitRequest phase determines who will respond, but may take may cycles and intervening request response transactionsAll controllers present their snoop results on bus when they see responseNo data response or snoop results for write backs and upgradesAvoid conflicts by:Every controller keeps record of pending reads in request tableDon't issue request for a block with outstanding responseWrites performed during request phaseHowever does not ensure sequential consistency!
24 Flow Control Implement flow control at: incoming request buffers from bus to cache controller (write-back buffer)Cache subsystem has a response buffer (address + cache block of data)limit number of outstanding requestsFlow control is also needed at main memory,Each of the 8 pending requests can generate a write-back to memoryCan happen in quick succession on busSGI Challenge: separate NACK lines for address and data busesAsserted before ack phase of request (response) cycle is doneRequest (response) cancelled everywhere, and retries laterBackoff and priorities to reduce traffic and starvationSUN Enterprise: destination initiates retry when it has a free buffersource keeps watch for this retryguaranteed space will still be there, so only two “tries” needed at most
25 Preventing violation of Sequential Consistency SC: Serialization of operations to different locations.Multiple outstanding requests on the bus, invalidations are buffered between bus and cache and are not applied to cache immediatelyCommitment versus completionValue produced by a write commit may not be visible to other processorsCondition necessary for SC: a processor should not be allowed to actually see the new value to a write before previous writes (in bus order) are visible to it.not letting certain types of incoming transactions from bus to cache be reordered in the incoming queuesallowing these re-orderings in the queues, but then ensuring that the important orders are preserved at the necessary points in the machine.a simpler approach is to threat all the requests in FIFO order. Although this approach is simpler, it can have performance problems;
26 Multi-level Caches and STB Considerable number of cycles for a request to propagate through cache hierarchyAllow other transactions to move up and down hierarchy while waitingTo maintain high bandwidth while allowing the individual units (controllers and caches) to operate at their own rates, queues are placed between levels of the hierarchy.Leads to deadlock and serialization issues
27 Deadlock Fetch deadlock: Must buffer incoming requests/responses while request outstandingOne outstanding request per processor => need space to hold p requests plus one reply (latter is essential)If smaller (or if multiple o/s requests), may need to NACKThen need priority mechanism in bus arbiter to ensure progressBuffer deadlock:L1 to L2 queue filled with read requests, waiting for response from L2L2 to L1 queue filled with bus requests waiting for response from L1Latter condition only when cache closer than lowest level is write backCould provide enough buffering, requires a lot of area, not scalableQueues may need to support bypassing
28 Sequential Consistency Separation of commitment from completion even greater with multi level cacheDo not wait for an invalidation to reach all the way up to L1 and return a reply, consider write committed when placed on the busFortunately techniques for single-level cache and ST bus extend, either method works:don’t allow certain re-orderings of transactions at any leveldon’t let outgoing operation proceed past level before incoming invalidations/updates at that level are applied
29 Shared Cache Designs Are there any solutions of shared L2 caches that are based on bus network? How does the bus network need to be modified to support shared caches?Benefits of sharing a cache:Eliminates the need for cache-coherence at this levelIf L1 cache is shared then there are no multiple copies of a cache block and hence no coherence problemReduces the latency of communication.L1 communication latency 2-10 clocks, main-memory many times largerreduced latency enables finer-grained sharing of dataPre-fetching data across processors.With private caches each processor incurs miss penalty separatelyReduces the BW requirements at the next level of the hierarchy.More effective use of long cache blocks, as there is no false sharing;Shared cache is smaller than the combined size of the private caches if working sets from different processors overlap
30 Shared Cache Designs Extreme case: All processors share a L1 cache, below is a shared memoryProcessors are connected to shared cache by a switch, More likely a crossbar to allow cache access by processors in parallelSupport high BW by interleaving cache and main memoryDisadvantages of sharing L1:higher bandwidth demandhit latency to a shared cache is higher than to a private onehigher cache complexityshared caches are usually slowerinstead of constructive interference (like the working set example), destructive interference can occur
31 Example of Shared Cache Designs Alliant FX-8 machine (1980's),8 custom processorsClock cycle 170nsProcessors connected using crossbar to 512Kbyte, 4-way interleaved cacheCache: 32 byte block size, direct mapped, write-back, 2 outstanding misses per processorEncore Multimax (contemporary)Snoopy cache coherent multiprocessorEach private cache supports two processors instead of onePractical approach:private L1 caches and a shared L2 cache among groups of processors.packaging considerations are also important
32 References David Culler, Jaswinder Pal Singh, and Anoop Gupta, Morgan Kaufmann, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann; preliminary draft edition (August 1997), pp Daniel Braga de Faria, Stanfard, Book Summaries, retrieved October 2010, from students.stanford.edu/~dbfaria/ Andy Pimentel, “Introduction to Parallel Architecture”, retrived on October 2010, from pp R. H. Katz, S. J. Eggers, DA.A. Wood, C.L Perkins and R.G. Shedon, “Implementing a cache Consistency Protocol”, Proceedings of the 12th ISCA, 1985, pp M. S. Papamarcos, J.H. Patel, “A low Overhead Coherence Solution for Microprocessors with Private Cache Memories” , Proceedings of the 11th ISCA, 1984, pp R. Kumar, V. Zyuban, and D. M. Tullsen, “Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling”. In ISCA, Jun 2005.