Presentation is loading. Please wait.

Presentation is loading. Please wait.

LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:

Similar presentations


Presentation on theme: "LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:"— Presentation transcript:

1 LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by: Sampath Rudravaram

2 Cache Coherence The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessor is increasing. Cache enhance the performance of multiprocessors by reducing network traffic and average memory access time Cache coherence arise because multiple processors may be reading and modifying the same memory block within their own cache Common Solution Snoopy coherence Directory based coherence <-- Compiler directed coherence

3 Directory (Full-map) The message-based protocols allocate a section of the system’s memory  Directory Each block of memory has an associated directory entry which contains a bit for each cache in the system. That bit indicates whether or not the associated cache contains a copy of memory block

4 Directory based Coherence The basic concept is that a processor must ask for permission to load an entry from the primary memory to its cache. When an entry is changed the directory must be notified either before the change is initiated or when it is complete. When an entry is changed the directory either updates or invalidates the other caches with that entry.

5 Directory based Coherence FULL-MAP Directory Entry Advantages ? -> No broadcast is necessary Disadvantages ? -> Coherence traffic is high due to all requests to the directory ->Great need for memory( size grows as Ө(N^2)) Read-only x x....... X State 1 2 3....... N

6 Directory based Coherence Limited Directory Entry Advantages ? ->Its performance is comparable to that of a full-map scheme in case where there is limited sharing of data between processors ->Cheaper to implement Disadvantages ? -> The protocol is susceptible to thrashing when the number of processors sharing data exceeds the number of pointers in the directory entry Read-Only 12 10 13 23 State Node ID Node ID Node ID Node ID

7 LimitLESS ( Limited directory Locally Extended through Software Support. ) The LimitLess scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. Using limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.

8 TypeSymbolNameData? Cache To Memory RREQ WREQ REPM UPDATE ACKC Read Request Write Request Replace Modified Update Invalidate Ack. **** Memory To Cache RDATA WDATA INV BUSY Read Data Write Data Invalidate Busy Signal **** ComponentNameMeaning MemoryRead-Only Read-Write Read-Transaction Write-Transaction Some number of caches have read-only copies of the data Exactly one cache has a read-write copy of the data Holding read request, update is in progress Holding write request, invalidating is in progress CacheInvalid Read-Only Read-Write Cache block may not be read or written Cache block may be read, but not written Cache block may be read or written Transition Label Input Message PreconditionDirectory Entry Change Output Message (s) 1 i-> RREQ --P=P U { i }RDATA -> i 2 i-> WREQ P={ i } P={ } -- P={ i } WDATA -> i 3 i-> WREQ P={k1,…kn}^ i  P P={k1,…kn}^ i  P P={i}, AckCtr = n P={i}, AckCtr = n-1 ¥kj INV-> kj ¥kj≠i INV-> kj 4 j-> WREQP={ i }P={j}, AckCtr = 1INV-> i 5 j-> RREQP={ i }P={j}, AckCtr = 1INV-> i 6 i-> REPMP={ i }P={ }-- 7 j-> RREQ j->WREQ j->ACKC j->REPM -- AckCtr ≠ 1 -- AckCtr = AckCtr -1 -- BUSY->j -- 8 j->ACKC J->UPDATE AckCtr = 1, P={i}, P={ i } AckCtr = 0 WDATA -> i 9 j->RREQ j->WREQ j->REPM -- BUSY->j -- 10 j->UPDATE j->ACKC P={ i } AckCtr = 0 RDATA -> i <- Protocol messages for hardware coherence ^ Directory states Annotation of the state transition diagram

9 Architectural Features LimitLESS Alewife is a large-scale multiprocessor with distributed shared memory and a cost- effective mesh network for communication. An Alewife node consists of a 33MHz SPARCLE processor, 64K bytes of direct- mapped cache, 4M bytes of globally-shared main memory, and a floating-point coprocessor

10

11 A 16-node Alewife machineA 128-node Alewife Chassis

12 Architectural Features LimitLESS Be capable of rapid trap handling (five to ten cycles ). A rapid context switching processor A finely-tuned software trap architecture. The processor needs complete access to coherence related controller state The directory controller must be able to invoke processor trap handlers when necessary. An interface to the network that allows the processor to launch and to intercept coherence protocol packets. ) IPI( Interprocessor-Interrrupt ) ProcessorController Condition Bits Trap Lines Data Bus Address Bus

13 Architectural Features LimitLESS IPI provides a superset of the network functionality -> Used to send and receive cache protocol packets -> Used to send preemptive message to remote processors Network Packet Structure Protocol Opcode ->for cache coherence traffic Interrupt Opcode ->for interprocessor message Transmission of IPI Packets -> enqueue the request on IPI output Queue Reception of IPI packets ->place the packet in the IPI input Queue IPI input traps are synchronous. Source processor Packet Length Opcode Operand 1 Operand 2.. Operand m-1 Data word Data word 2.. Data word n-1

14 Queue based diagram of the Alewife controller

15 Meta States & Trap Handler Meta States Trap Handler First time overflow: -The trap code allocates a full-map bit-vector in local memory. -Empty all hardware pointers, set the corresponding bits in the vector -Directory Mode is set to Trap-On-Write before trap returns Additional overflow: -Empty all hardware pointers, set the corresponding bits in the vector Termination (on WREQ or local write fault ): -Empty all hardware pointers -Record the identity of requester in the directory -Set the ActCtr to the # of bits in the vector that are set -Place directory in Normal Mode, Write Transaction Sate. -Invalidate all caches with the bit set in vector

16 PERFORMANCE MEASUREMENT Comparision of the performance of limited,LimitLESS and full-map directories. Comparision of the performance of limited,LimitLESS and full-map directories. Evaluated in terms of the total number of cycles needed to execute an application on a 64 processor Alewife machine. Evaluated in terms of the total number of cycles needed to execute an application on a 64 processor Alewife machine.

17 Measurement Technique ASIM,The Alewife System Simulator

18 Performance Results Application Dir4NB LimitLESS4Full-Map Multigrid0.7290.7040.665 SIMPLE3.5792.9022.553 Matexpr1.2960.3170.171 Weather1.3560.6540.621 -> four-pointer limited protocol,full-map protocol,LimitLESS scheme with Ts=50 -> 64-node Alewife machine with 64K byte caches and 2D mesh n/ws

19 Performance Results (contd..) -> Result when the variable in Weather is not optimised.

20 Performance Results (contd..) -> Result when the variable in Weather is optimised

21 Performance Results (Contd..) -> Result when emulation latency = 50 for LimitLESS protocol.

22 Conclusion This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. This paper proposed a new scheme for cache coherence, called LimitLess, which is being implemented in Alewife machine. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Hardware requirement includes rapid trap handling and a flexible processor interface to the network. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Preliminary simulation results indicate that the LimitLEss scheme approaches the performance of a full-map directory protocol with the memory efficiency of a limited directory protocol. Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software Furthermore, the LimitLess scheme provides a migration path toward a future in which cache coherence is handled entirely in software


Download ppt "LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:"

Similar presentations


Ads by Google