Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.

Similar presentations

Presentation on theme: "The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh."— Presentation transcript:

1 The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh Gharachoroloo, Anoop Gupta, and John Hennessy

2 Designing low-cost high- performance multiprocessor  Message-passing (multicomputer) -distributed add. space, locally access more scalable  more cumbersome to program  Shared-memory (multiprocessor) -single add. space, remote access simplicity( data partitioning, dynamic load distribution)  consume bandwidth, cache coherence

3 DASH (Directory Architecture for Shared memory)  Distributed shared main mem. among the processing nodes to provide scalable mem. bandwidth  Distributed directory-based protocol to support cache coherence

4 DASH architecture  Processing node (cluster) -bus-based multiprocessor -snoopy protocol, amortizes cost of dir. logic & network interface  Set of clusters -mesh interconnected network -distributed directory-based protocol, keeps the summary info for each mem.line specifying the cluster that are caching it.


6 Details  Cache--individual to each processor  Memory-- shared to processors w/in the same cluster  Directory memory-- keep track of all processors caching a block, send point-to- point msg (invalidate/update), avoid broadcast  Remote Access Cache (RAC)– maintaining state of currently outstanding requests, buffering replies from the network to release waiting processor for bus arbitration.

7 Design distributed directory-based protocol  Correctness issues -memory consistency model, strong constrained? Less constrained? -deadlock, loop, generation of previous request is the requirement of the next. -error handling, manage data integrity & fault tolerance.  Performance issues -latency write misses-write buffer, release consistency model read misses-min inter-cluster msg, delay of msg. -bandwidth, reduce serialization (queuing delays), traffic, # of msg, caches & distributed memory in DASH.  Distributed control & complexity issues -distribute control to components, balance system performance & complexity of the components.


9 DASH prototype  Cluster(node) Silicon Graphics PowerStation 4D/240  4 processors (MIPS 3000/3010)  L1(64 Kbyte instruction,64Kbyte write-through data)  L2(256 Kbyte write-back), convert RT  RB, cache tag for snooping, maintaining consistency using Illinois MESI protocol


11  Memory bus  Separated into 32-bit add. bus & 64-bit data bus.  Supporting mem-to-cache & cache-to-cache transfer  16 bytes every 4 bus clocks with a latency of 6 bus clocks, max bandwidth 64 mbps  Retry mechanism, when a request requires services from a remote cluster, remote request are signaled to retry, mask & unmasked requesting processor to avoid unnecessary retries.

12 Modification  Directory controller board -maintaining cache coherence inter-node, interface to interconnection network  Directory controller (DC)-contains the directory mem. corresponding to the portion of main mem. Initiates out-bound network requests  Pseudo-CPU (PCPU)- buffering income requests, issuing requests on bus  Reply controller (RC)- tracks outstanding requests made by local processors, receives & buffers the corresponding replies from remote cluster, acts as mem. In case of request retry.  Interconnection network-2 wormhole routed meshes (request & reply)  HW monitoring logic, miscellaneous control and status registers-logic samples directory board and bus events, derive usage and performance statistics.



15  Directory memory -array of directory entries -one entry for each mem. Block -single state bit (shared/dirty) -a bit vector of pointer to each of the 16 clusters -directory information is combined with bus operation, address, and result of snooping within the cluster -DC generates network msg & bus controls

16 Assume “N" processors. With each cache-block in memory : N presence-bits (bit vector), and 1 dirty-bit (state bit)

17  Remote Access Cache (RAC)  Maintaining state of currently outstanding requests from RC  Buffering replies from the network, waiting processor is released for bus arbitration.  Supplementing the functionality of the processor’s caches  Supplies data cache-to-cache when released processor retry the access

18 DASH cache coherence protocol  Local cluster a cluster that contains the processor originating a given request  Home cluster the cluster which contains the main memory and directory for a given physical memory address  Remote cluster any other cluster  Owning cluster a cluster owns a dirty memory block  Local memory the main memory associated with the local cluster  Remote memory any memory whose home is not the local

19 DASH cache coherence protocol  Invalidation-based ownership protocol  Memory block  Unchached-remote-- not cached by any remote cluster  Shared-remote--cached in an unmodified state by one or more remote clusters  Dirty-remote—cached in a modified state by a single remote cluster  Cache block  Invalid–the copy in cache is stale  Shared—other processors caching that location  Dirty—this cache contains an exclusive copy of the memory block, and the block has been modified.

20 3 primitive operations  Read request (load)  In L1, simply supplies the data  In L2, fill operation find and bring the required block to L1  Others, send a read request on the bus Shares- local, simply transfer over the bus Dirty-local, RAC take ownership of the cache line Unchached-remote/shared-remote, send data over the reply network to requesting cluster Dirty-remote, forward request to owning cluster, owning cluster send data to requesting cluster and sharing write-back request to home cluster.

21 Forward strategy reduce latency by direct responds process many request simultaneously (multithreaded) reduce serialization  Additional latency when simultaneously accesses are made to the same block, 1st request will be satisfied and dirty cluster loses ownership, 2 nd request return negative acknowledge(NAK) that force retry access.

22  Read-exclusive request (store)  In local memory, write and invalidate others copies  Dirty-remote, owning processor invalidate that block from its cache, send granting ownership and data to requesting cluster, send update ownership msg to home cluster.  Unchached-remote/ shared-remote, write, send invalidate request for shared state.

23 Acknowledge -needed for the requesting processor to know when the store has been complete w/ respect to all processors. -maintain consistency, guarantee that new owner will not loose ownership before the directory has been updated

24  Write-back request a dirty cache line that is replaced must be written back to memory  Home cluster is local, write back to main memory  Home cluster is remote, send a message to the remote home cluster, update the main memory in remote home and mark the block unchached-remote.

25 Bus initiated cache transaction  Transactions made by cache snooping the bus  Read operation, dirty cache supplies date and changes to shared state  Read-exclusive operation, invalidate all other cached copies  Line in L2 is invalidated, L1 do the same

26 Exception conditions  A request forwarded to a dirty cluster may arrived there to find that the dirty cluster no longer owns the data.  Prior access, change ownership  Owning cluster perform a write back Sol: requesting cluster is sent a NAK responses and is required to reissure the request(release mask, treating as new request)

27  Ownership bouncing back to two remote clusters, requesting cluster receives multiple NAK’s  Time-out  Return a bus error Sol: add a additional directory states access queue, responds for all read only requests, grants ownership to each exclusive request on a pseudo-random basis.

28  Separate request and reply network, some msg sent between 2 clusters can be received out-of-order Sol: acknowledge reply,out-of-order requests receive NAK response

29  Invalidate request overtakes read reply which try to purge the read copy. Sol: when RAC detects an invalidation request for a pending read, change state of that RAC entry to invalidate- read-pending, RC assumes that any read reply is stale and treats the reply as a NAK response.

30 Deadlock  HW  2 mesh network, point-to-point message passing  consumption of an incoming message may require the generation of another outgoing message.  Protocol  Request message read, read-exclusive, invalidation requests  Reply message read & read-exclusive replies, invalidation ack.  Separate mesh function

31 Error handling  Error checking system  ECC on main memory  Parity checking on directory memory  Length checking of network message  Inconsistent bus and network message checking  Report to processor through bus errors and associated error capture registers.  Issuing processor time-out originating request or fencing operation. OS can clean up the state of a line by using back-door paths the allow direct addressing of the RAC and directory mem.

32 Scalability of the DASH directory  Amount of dir.mem.=mem.size x processors #  Limited pointer per entry, no space for processors that are not caching the line  Allow pointer to be shared between directory entries  Use a cache of directory entries to supplement or replace the normal directory  Sparse-directories, limited pointers and a coarse vector

33 Validation of the protocol  2 SW simulator base testing methods  Low-level DASH system simulator that incorporates the coherence protocol, caches, buses and interconnection network  High-level functional simulator that models the processors and executes parallel programs  2 scheme for testing protocol  Running existing parallel programming and compare output  Test script  Hardware

34 Comparison with scalable coherent interface protocol (SCI)  Similarities -rely on coherence caches maintained by distributed directories -rely on distributed memories to provide scalable memory bandwidth  Differences -in SCI, directory is a distributed sharing list maintained by cache -in DASH, all the directory info is placed with main memory

35  SCI advantages -amount of directory pointer grows naturally with the # of processors -employ SRAM technology used by cache -guarantee forward progress in all cases  SCI disadvantages -directory entries increases the complexity and latency of the directory protocol, additional update msg must be sent bet caches -require more inter-node communication

Download ppt "The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh."

Similar presentations

Ads by Google