Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rebound: Scalable Checkpointing for Coherent Shared Memory

Similar presentations


Presentation on theme: "Rebound: Scalable Checkpointing for Coherent Shared Memory"— Presentation transcript:

1 Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

2 Checkpointing in Shared-Memory MPs
HW-based schemes for small CMPs use Global checkpointing All procs participate in system-wide checkpoints Global checkpointing is not scalable Synchronization, bursty movement of data, loss in rollback… rollback save chkpt Fault checkpoint P1 P2 P3 P4

3 Alternative: Coordinated Local Checkpointing
Idea: threads coordinate their checkpointing in groups Rationale: Faults propagate only through communication Interleaving between non-comm. threads is irrelevant P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Chkpt GlobalChkpt Local Chkpt + Scalable: Checkpoint and rollback in processor groups Complexity: Record inter-thread dependences dynamically.

4 Contributions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory Leverages directory protocol to track inter-thread deps. Opts to boost checkpointing efficiency: Delaying write-back of data to safe memory at checkpoints Supporting multiple checkpoints Optimizing checkpointing at barrier synchronization Avg. performance overhead for 64 procs: 2% Compared to 15% for global checkpointing

5 Background: In-Memory Checkpt with ReVive
[Prvulovic-02] Execution Register Dump P1 P2 P3 CHK Displacement Caches Dirty Cache lines Writebacks W W W W WB Checkpoint Application Stalls Writeback old old Logging old Log Memory

6 Background: In-Memory Checkpt with ReVive
[Pvrulovic-02] Old Register restored P1 P2 P3 CHK W WB Fault Caches Cache Invalidated Memory Lines Reverted Log Memory Global Broadcast protocol Local Coordinated Scalable protocol

7 Coordinated Local Checkpointing Rules
wr x rd x P1 P2 Producer rollback Consumer P1 P2 Producer chkpoint Consumer P1 P2 chkpt P checkpoints  P’s producers checkpoint P rolls back  P’s consumers rollback Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

8 Rebound Fault Model Log (in SW) Main Memory Chip Multiprocessor Any part of the chip can suffer transient or permanent faults. A fault can occur even during checkpointing Off-chip memory and logs suffer no fault on their own (e.g. NVM) Fault detection outside our scope: Fault detection latency has upper-bound of L cycles Redo fig

9 Rebound Architecture Main Memory Chip Multiprocessor LW-ID MyProducer
Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Redo fig

10 Rebound Architecture Main Memory Chip Multiprocessor L2 Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Dependence (Dep) registers in the L2 cache controller: MyProducers : bitmap of proc. that produced data consumed by the local proc. MyConsumers : bitmap of proc. that consumed data produced by the local proc. Redo fig

11 Rebound Architecture Main Memory Chip Multiprocessor L2 Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Dependence (Dep) registers in the L2 cache controller: MyProducers : bitmap of proc. that produced data consumed by the local proc. MyConsumers : bitmap of proc. that consumed data produced by the local proc. Processor ID in each directory entry: LW-ID : last writer to the line in the current checkpoint interval. Redo fig

12 Recording Inter-Thread Dependences
P1 writes MyProducers MyProducers MyConsumers MyConsumers Write LW-ID P1 D Log Memory Assume MESI protocol

13 Recording Inter-Thread Dependences
MyConsumers  P2 P2 MyProducers P2 reads MyProducers P1 P2 MyConsumers MyConsumers MyProducers  P1 LW-ID P1 D S Write back Logging Log Memory Assume MESI protocol

14 Recording Inter-Thread Dependences
P1 writes MyProducers MyProducers P1 P2 MyConsumers MyConsumers LW-ID P1 S P1 D Log Memory Assume MESI protocol

15 Recording Inter-Thread Dependences
Clear Dep registers P2 P1 checkpoints MyProducers MyProducers P1 P2 MyConsumers MyConsumers Clear LW-ID LW-ID LW-ID should remain set till the line is checkpointed P1 S Writebacks P1 D Logging Log Memory Assume MESI protocol

16 Lazily clearing Last Writers
Clear LW-IDs  Expensive process ! Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval. At checkpoint, the processors clear their Write Signature Potentially stale LW-ID

17 Lazily clearing Last Writers
P1 P2 NO ! MyProducers P2 reads MyProducers WSig MyConsumers MyConsumers Addr ? Clear LW-ID Stale LW-ID P1 S Log Memory

18 Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 P1 initiate checkpoint

19 Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 P1 Ck? Ck? P2 P3 initiate checkpoint

20 Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Ck? Ck? P2 P3 initiate checkpoint

21 Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ck? initiate checkpoint P4

22 Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ack Ck? Decline initiate checkpoint P4

23 Distributed Checkpointing Protocol in SW
Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ack Ck? Decline initiate checkpoint P4 Checkpointing is a 2-phase commit protocol.

24 Distributed Rollback Protocol in SW
Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers Rollback involves Clearing the Dep. Registers and Write Signature Invalidating the processor caches Restoring the data and register context from the logs up to the latest checkpoint. No Domino Effect

25 Optimization1 : Delayed Writebacks
Time sync Checkpoint Interval I2 sync WB dirty lines Checkpoint Interval I1 Interval I2 Stall Interval I1 Stall WB dirty lines Checkpointing overhead dominated by data writebacks Delayed Writeback optimization Processors synchronize and resume execution Hardware automatically writes back dirty lines in background Checkpoint only completed when all delayed data written back Still need to record inter-thread dependences on delayed data

26 Delayed Writeback Pros/Cons
+ Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit Increased vulnerability A rollback event forces both intervals to roll back

27 Delayed Writeback protocol
MyConsumers0  P2 P1 P2 YES ! MyProducers0 MyProducers0 WSig0 xxx P2 MyConsumers0 MyConsumers0 Addr ? MyProducers1 P2 reads MyProducers1 P1 WSig1 NO ! MyConsumers1 MyConsumers1 Addr ? MyProducers1  P1 LW-ID P1 D S Write back Logging Log Memory

28 Optimization2 : Multiple Checkpoints
Problem: Fault detection is not instantaneous Checkpoint is safe only after max fault-detection latency (L) Fault Detection Latency Dep registers 1 Dep registers 2 Rollback Ckpt 1 Ckpt 2 tf Solution: Keep multiple checkpoints On fault, roll back interacting processors to safe checkpoints No Domino Effect

29 Multiple Checkpoints: Pros/Cons
+ Realistic system: supports non-instantaneous fault detection - Additional support: Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency - Need to track communication across checkpoints - Combination with Delayed Writebacks: one more Dep register set

30 Optimization3 : Hiding Chkpt behind Global Barrier
Global barriers require that all processors communicate Leads to global checkpoints Optimization: Proactively trigger a global checkpoint at a global barrier Hide checkpoint overhead behind barrier imbalance spins

31 Hiding Checkpoint behind Global Barrier
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Update

32 Hiding Checkpoint behind Global Barrier
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Update Processor P1 Processor P2 Processor P3 BarCK? Notify flag = TRUE ICHK = {P3} while(!flag) ICHK = {P2, P3} ICHK = {P1, P3} Update First arriving processor initiates the checkpoint Others: HW writes back data as execution proceeds to barrier Commit checkpoint as last processor arrives After the barrier: few interacting processors

33 Evaluation Setup Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim Applications: SPLASH-2 , some PARSEC, Apache Simulated CMP architecture with up to 64 threads Checkpoint interval : 5 – 8 ms Modeled several environments: Global: baseline global checkpointing Rebound: Local checkpointing scheme with delayed writeback. Rebound_NoDWB: Rebound without the delayed writebacks.

34 Avg. Interaction Set: Set of Producer Processors
Most apps: interaction set is a small set Justifies coordinated local checkpointing Averages brought up by global barriers 64 38

35 Checkpoint Execution Overhead
Rebound’s avg checkpoint execution overhead is 2% Compared to 15% for Global 15 2

36 Checkpoint Execution Overhead
Rebound’s avg checkpoint execution overhead is 2% Compared to 15% for Global Delayed Writebacks complement local checkpointing

37 Rebound Scalability Constant problem size
Rebound is scalable in checkpoint overhead Delayed Writebacks help scalability

38 Also in the Paper Delayed write backs also useful in Global
Barrier optimization is effective but not universally applicable Power increase due to hardware additions < 2% Rebound leads to only 4% increase in coherence traffic

39 Conclusions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory Leverages directory protocol Boosts checkpointing efficiency: Delayed write-backs Multiple checkpoints Barrier optimization Avg. execution overhead for 64 procs: 2% Future work: Apply Rebound to non-hardware coherent machines Scalability to hierarchical directories

40 Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign


Download ppt "Rebound: Scalable Checkpointing for Coherent Shared Memory"

Similar presentations


Ads by Google