Rebound: Scalable Checkpointing for Coherent Shared Memory

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University.

Continuously Recording Program Execution for Deterministic Replay Debugging.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors ： Thomas J. Ashby, Pedro D´ıaz, Marcelo.

1 SigRace: Signature-Based Data Race Detection Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas Computer Science Department University of.

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.

More on Locks: Case Studies

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

Architectural Support for Scalable Speculative Parallelization in Shared- Memory Multiprocessors Marcelo Cintra, José F. Martínez, Josep Torrellas Department.

Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Siva and Osman March 7, 2000 Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal.

COSC6385 Advanced Computer Architecture

Computer Engineering 2nd Semester

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

A Study on Snoop-Based Cache Coherence Protocols

The University of Adelaide, School of Computer Science

Temporally Silent Stores (Alternatively: Louder Silent Stores)

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Cache Coherence Protocols 15th April, 2006

Adaptive Single-Chip Multiprocessing

11 – Snooping Cache and Directory Based Multiprocessors

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Lecture 10: Consistency Models

High Performance Computing

The University of Adelaide, School of Computer Science

BulkCommit: Scalable and Fast Commit of Atomic Blocks

Lecture 17 Multiprocessors and Thread-Level Parallelism

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 23: Transactional Memory

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

University of Wisconsin-Madison Presented by: Nick Kirchem

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Checkpointing in Shared-Memory MPs HW-based schemes for small CMPs use Global checkpointing All procs participate in system-wide checkpoints Global checkpointing is not scalable Synchronization, bursty movement of data, loss in rollback… rollback save chkpt Fault checkpoint P1 P2 P3 P4

Alternative: Coordinated Local Checkpointing Idea: threads coordinate their checkpointing in groups Rationale: Faults propagate only through communication Interleaving between non-comm. threads is irrelevant P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Chkpt GlobalChkpt Local Chkpt + Scalable: Checkpoint and rollback in processor groups Complexity: Record inter-thread dependences dynamically.

Contributions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory Leverages directory protocol to track inter-thread deps. Opts to boost checkpointing efficiency: Delaying write-back of data to safe memory at checkpoints Supporting multiple checkpoints Optimizing checkpointing at barrier synchronization Avg. performance overhead for 64 procs: 2% Compared to 15% for global checkpointing

Background: In-Memory Checkpt with ReVive [Prvulovic-02] Execution Register Dump P1 P2 P3 CHK Displacement Caches Dirty Cache lines Writebacks W W W W WB Checkpoint Application Stalls Writeback old old Logging old Log Memory

Background: In-Memory Checkpt with ReVive [Pvrulovic-02] Old Register restored P1 P2 P3 CHK W WB Fault Caches Cache Invalidated Memory Lines Reverted Log Memory Global Broadcast protocol Local Coordinated Scalable protocol

Coordinated Local Checkpointing Rules wr x rd x P1 P2 Producer rollback Consumer P1 P2 Producer chkpoint Consumer P1 P2 chkpt P checkpoints  P’s producers checkpoint P rolls back  P’s consumers rollback Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

Rebound Fault Model Log (in SW) Main Memory Chip Multiprocessor Any part of the chip can suffer transient or permanent faults. A fault can occur even during checkpointing Off-chip memory and logs suffer no fault on their own (e.g. NVM) Fault detection outside our scope: Fault detection latency has upper-bound of L cycles Redo fig

Rebound Architecture Main Memory Chip Multiprocessor LW-ID MyProducer Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Redo fig

Rebound Architecture Main Memory Chip Multiprocessor L2 Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Dependence (Dep) registers in the L2 cache controller: MyProducers : bitmap of proc. that produced data consumed by the local proc. MyConsumers : bitmap of proc. that consumed data produced by the local proc. Redo fig

Rebound Architecture Main Memory Chip Multiprocessor L2 Directory Cache LW-ID MyProducer MyConsumer Dep Register P+L1 Dependence (Dep) registers in the L2 cache controller: MyProducers : bitmap of proc. that produced data consumed by the local proc. MyConsumers : bitmap of proc. that consumed data produced by the local proc. Processor ID in each directory entry: LW-ID : last writer to the line in the current checkpoint interval. Redo fig

Recording Inter-Thread Dependences P1 writes MyProducers MyProducers MyConsumers MyConsumers Write LW-ID P1 D Log Memory Assume MESI protocol

Recording Inter-Thread Dependences MyConsumers  P2 P2 MyProducers P2 reads MyProducers P1 P2 MyConsumers MyConsumers MyProducers  P1 LW-ID P1 D S Write back Logging Log Memory Assume MESI protocol

Recording Inter-Thread Dependences P1 writes MyProducers MyProducers P1 P2 MyConsumers MyConsumers LW-ID P1 S P1 D Log Memory Assume MESI protocol

Recording Inter-Thread Dependences Clear Dep registers P2 P1 checkpoints MyProducers MyProducers P1 P2 MyConsumers MyConsumers Clear LW-ID LW-ID LW-ID should remain set till the line is checkpointed P1 S Writebacks P1 D Logging Log Memory Assume MESI protocol

Lazily clearing Last Writers Clear LW-IDs  Expensive process ! Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval. At checkpoint, the processors clear their Write Signature Potentially stale LW-ID

Lazily clearing Last Writers P1 P2 NO ! MyProducers P2 reads MyProducers WSig MyConsumers MyConsumers Addr ? Clear LW-ID Stale LW-ID P1 S Log Memory

Distributed Checkpointing Protocol in SW Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 P1 initiate checkpoint

Distributed Checkpointing Protocol in SW Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 P1 Ck? Ck? P2 P3 initiate checkpoint

Distributed Checkpointing Protocol in SW Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Ck? Ck? P2 P3 initiate checkpoint

Distributed Checkpointing Protocol in SW Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ck? initiate checkpoint P4

Distributed Checkpointing Protocol in SW Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ack Ck? Decline initiate checkpoint P4

Distributed Checkpointing Protocol in SW Interaction Set [Pi]: set of producer processors (transitively) for Pi Built using MyProducers P1 P2 P3 P4 chk InteractionSet : P1 , P2, P3 P1 Accept Accept Ck? Ck? P2 P3 Ack Ck? Decline initiate checkpoint P4 Checkpointing is a 2-phase commit protocol.

Distributed Rollback Protocol in SW Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers Rollback involves Clearing the Dep. Registers and Write Signature Invalidating the processor caches Restoring the data and register context from the logs up to the latest checkpoint. No Domino Effect

Optimization1 : Delayed Writebacks Time sync Checkpoint Interval I2 sync WB dirty lines Checkpoint Interval I1 Interval I2 Stall Interval I1 Stall WB dirty lines Checkpointing overhead dominated by data writebacks Delayed Writeback optimization Processors synchronize and resume execution Hardware automatically writes back dirty lines in background Checkpoint only completed when all delayed data written back Still need to record inter-thread dependences on delayed data

Delayed Writeback Pros/Cons + Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit Increased vulnerability A rollback event forces both intervals to roll back

Delayed Writeback protocol MyConsumers0  P2 P1 P2 YES ! MyProducers0 MyProducers0 WSig0 xxx P2 MyConsumers0 MyConsumers0 Addr ? MyProducers1 P2 reads MyProducers1 P1 WSig1 NO ! MyConsumers1 MyConsumers1 Addr ? MyProducers1  P1 LW-ID P1 D S Write back Logging Log Memory

Optimization2 : Multiple Checkpoints Problem: Fault detection is not instantaneous Checkpoint is safe only after max fault-detection latency (L) Fault Detection Latency Dep registers 1 Dep registers 2 Rollback Ckpt 1 Ckpt 2 tf Solution: Keep multiple checkpoints On fault, roll back interacting processors to safe checkpoints No Domino Effect

Multiple Checkpoints: Pros/Cons + Realistic system: supports non-instantaneous fault detection - Additional support: Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency - Need to track communication across checkpoints - Combination with Delayed Writebacks: one more Dep register set

Optimization3 : Hiding Chkpt behind Global Barrier Global barriers require that all processors communicate Leads to global checkpoints Optimization: Proactively trigger a global checkpoint at a global barrier Hide checkpoint overhead behind barrier imbalance spins

Hiding Checkpoint behind Global Barrier Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Update

Hiding Checkpoint behind Global Barrier Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Update Processor P1 Processor P2 Processor P3 BarCK? Notify flag = TRUE ICHK = {P3} while(!flag) ICHK = {P2, P3} ICHK = {P1, P3} Update First arriving processor initiates the checkpoint Others: HW writes back data as execution proceeds to barrier Commit checkpoint as last processor arrives After the barrier: few interacting processors

Evaluation Setup Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim Applications: SPLASH-2 , some PARSEC, Apache Simulated CMP architecture with up to 64 threads Checkpoint interval : 5 – 8 ms Modeled several environments: Global: baseline global checkpointing Rebound: Local checkpointing scheme with delayed writeback. Rebound_NoDWB: Rebound without the delayed writebacks.

Avg. Interaction Set: Set of Producer Processors Most apps: interaction set is a small set Justifies coordinated local checkpointing Averages brought up by global barriers 64 38

Checkpoint Execution Overhead Rebound’s avg checkpoint execution overhead is 2% Compared to 15% for Global 15 2

Checkpoint Execution Overhead Rebound’s avg checkpoint execution overhead is 2% Compared to 15% for Global Delayed Writebacks complement local checkpointing

Rebound Scalability Constant problem size Rebound is scalable in checkpoint overhead Delayed Writebacks help scalability

Also in the Paper Delayed write backs also useful in Global Barrier optimization is effective but not universally applicable Power increase due to hardware additions < 2% Rebound leads to only 4% increase in coherence traffic

Conclusions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory Leverages directory protocol Boosts checkpointing efficiency: Delayed write-backs Multiple checkpoints Barrier optimization Avg. execution overhead for 64 procs: 2% Future work: Apply Rebound to non-hardware coherent machines Scalability to hierarchical directories

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu