Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11.

Similar presentations


Presentation on theme: "Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11."— Presentation transcript:

1 Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11

2 How to deal with failure Do nothing Fail-fast: ethernet Fail-safe: traffic light Fail-soft: Boeing 777 Fail-mask: –Let’s worry about this some more..

3 Fail Masking Must detect error –System must be analyzable –Boundaries must be clearly defined –Must monitor the “health” of the system Must correct the error –Understand the “correct” behavior of the system –Typically employ redundancy

4 Analyzability Use a language that can be easily analyzed –Constrained, domain-specific languages –Formal verification systems –Fine-state automata Regression testing Experimental evaluation –MTTF

5 Monitoring Useful for black-box analysis Periodic Ping –De facto system monitoring –TCP_KEEP_ALIVE Performance monitoring –System slow down beyond a threshold –DDOS Stack state –Java loop termination Overhead? – must keep monitoring overhead low Increase of decrease monitoring after failure?

6 Understand failure What is an ‘error’ –Slow down By how much? –Inconsistency Consistency semantics? –Data corruption Checksum Classification of Errors –Statistical analysis –False positives What is an acceptable rate?

7 Out of place recovery Shadowing –Keep versions, never replace –Only update access paths –Disk space is cheap Differentials –For each file, maintain differentials –Only Insertions, deletions –Update?

8 Fault-recovery Logging –Undo –Redo Durable When to flush?

9 Fault-correction Redundancy –Encode FEC –Replicate Aha ….

10 Overview Ordering –Lazy vs. absolute Transactions –Two-phase commit –Three-phase commit –Quorum-based protocols

11 Availability and Replication Global ordering –Timestamping Absolute –Vector clocks Causal ordering More available But lazy

12 Optimistic Replication Let everyone make changes –Only 3 % transactions ever abort Make changes, send updates –If someone else’s changes come through with T_him < T_you, your changes are overridden Wait for a bit before committing –deadlocks

13 Two Phase Commit Blocking How?

14 Three-phase Commit


Download ppt "Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11."

Similar presentations


Ads by Google