Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

Similar presentations


Presentation on theme: "CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:"— Presentation transcript:

1 CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented: March 19, 2008 Ankit Jain

2 LimitLESS.2 3/19/08 The Background & Problems Bus-Based Protocols –Do not scale because broadcasts are slow and limit parallelism Traditional Directory-Based Protocols –Monolithic Directories »Implicitly serialize all memory requests –Directory Accesses consume a disproportionately large fraction of available network bandwidth –Full Directories are Large »Full Map Size: Total Memory Size * Number of Processors –Limited Directory Protocols »Allowing a limited number of simultaneous cached copies of any block of data »Pro: Size of directory is smaller »Con: Potential Thrashing since eviction and reassignment when more simultaneous copies needed »Previous studies show small set of pointers is sufficient to capture worker-set of processors

3 LimitLESS.3 3/19/08 Alewife Architecture Cost Effective Mesh Network –Pro: Scales in terms of hardware –Pro: Exploits Locality Directory Distributed along with main memory –Bandwidth scales with number of processors Con: Non-Uniform Latencies of Communication –Have to manage the mapping of processes/threads onto processors due –Alewife employs techniques for latency minimization and latency tolerance so programmer does not have to manage Context Switch in 11 cycles between processes on remote memory request which has to incur communication network latency Cache Controller holds tags and implements the coherence protocol

4 LimitLESS.4 3/19/08 LimitLESS Protocol + Requirements Limited Directory that is Locally Extended through Software Support Handle the common case (small worker set) in hardware and the exceptional case (overflow) in software Processor with rapid trap handling (executes trap code within 5-10 cycles of initiation) State Shared –Processor needs complete access to coherence related controller state in the hardware directories –Directory Controller can invoke processor trap handlers Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets

5 LimitLESS.5 3/19/08 The Protocol Note: In the Read-Only State, the notation S: n>p indicates that the outputs from the state are handled through a software interrupt handler if the size of the pointer set (n) is greater than the size of the limited directory (p).

6 LimitLESS.6 3/19/08 An Example Proc i has data block D from Proc d in Read-Write State Proc j wants to write a value to data block D Processor i Data Block State dRead-Write Processor j Data Block State dInvalid Processor d Directory Entry Data Block StateAckCtrOwning Processors dRead-Write0i

7 LimitLESS.7 3/19/08 An Example Proc i has data block D from Proc d in Read-Write State Proc j wants to write a value to data block D Processor i Data Block State dRead-Write Processor j Data Block State dInvalid j  WREQ Precondition: P = { I } INV  i Data Block StateAckCtrOwning Processors dRead-Write0i Processor d Directory Entry

8 LimitLESS.8 3/19/08 An Example Proc i has data block D from Proc d in Read-Write State Proc j wants to write a value to data block D Processor i Data Block State dInvalid Processor j Data Block State dInvalid Data Block StateAckCtrOwning Processors dRead-Write1j Processor d Directory Entry

9 LimitLESS.9 3/19/08 An Example Proc i has data block D from Proc d in Read-Write State Proc j wants to write a value to data block D Processor i Data Block State dInvalid Processor j Data Block State dInvalid Data Block StateAckCtrOwning Processors dRead-Write1j AckCtr = 1, P = { j } i  ACKC Processor d Directory Entry

10 LimitLESS.10 3/19/08 An Example Proc i has data block D from Proc d in Read-Write State Proc j wants to write a value to data block D Processor i Data Block State dInvalid Processor j Data Block State dRead-Write Data Block StateAckCtrOwning Processors dRead-Write0j Processor d Directory Entry

11 LimitLESS.11 3/19/08 Interprocessor-Interrupt (1/2) Trap routine can either discard packet or store it to memory Store-back capability permits message-passing and block transfers Potential Deadlock Scenario with Processor Stalled and waiting for a remote cache-fill Solution: Synchronous Trap (stored in local memory) to empty input queue

12 LimitLESS.12 3/19/08 Interprocessor-Interrupt (2/2) Overflow Trap Scenario –First Instance: Full-Map bit-vector allocated in local memory and hardware pointers emptied into this and vector entered into hash table –Otherwise: Empty hardware pointers into bit vector –Meta-State Set to “Trap-On-Write” –While emptying hardware pointers, Meta-State: “Trans-In-Progress” Incoming Write Request Scenario –Empty hardware pointers to memory –Set AckCtr to number of bits that are set in bit-vector –Send invalidations to all caches except possibly requesting one –Free vector in memory –Upon invalidate acknowledgement (AckCtr == 0), send Write- Permission and set Memory State to “Read-Write”

13 LimitLESS.13 3/19/08 Performance Technique Notes: Multigrid: Small worker sets  limited directories perform as well as full map SIMPLE implemented barrier synchronization with single lock Matexpr has worker sets up to 16 processors Weather has one variable initialized by one processor and then read by all the other processors

14 LimitLESS.14 3/19/08 Results (1/3)

15 LimitLESS.15 3/19/08 Results (2/3)

16 LimitLESS.16 3/19/08 Results (3/3)

17 LimitLESS.17 3/19/08 Summary LimitLESS directories can closely emulate Full-Map Directories while saving hardware resources LimitLESS is not as sensitive to tuning parameters as the Limited Directory approach The protocol is general enough to apply to other coherence techniques In the future, it can be extended to give feedback to programmers/compilers about hot- spots, etc

18 LimitLESS.18 3/19/08 Full Memory State Transition Diagram


Download ppt "CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:"

Similar presentations


Ads by Google