Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11.

Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11

How to deal with failure Do nothing Fail-fast: ethernet Fail-safe: traffic light Fail-soft: Boeing 777 Fail-mask: –Let’s worry about this some more..

Fail Masking Must detect error –System must be analyzable –Boundaries must be clearly defined –Must monitor the “health” of the system Must correct the error –Understand the “correct” behavior of the system –Typically employ redundancy

Analyzability Use a language that can be easily analyzed –Constrained, domain-specific languages –Formal verification systems –Fine-state automata Regression testing Experimental evaluation –MTTF

Monitoring Useful for black-box analysis Periodic Ping –De facto system monitoring –TCP_KEEP_ALIVE Performance monitoring –System slow down beyond a threshold –DDOS Stack state –Java loop termination Overhead? – must keep monitoring overhead low Increase of decrease monitoring after failure?

Understand failure What is an ‘error’ –Slow down By how much? –Inconsistency Consistency semantics? –Data corruption Checksum Classification of Errors –Statistical analysis –False positives What is an acceptable rate?

Out of place recovery Shadowing –Keep versions, never replace –Only update access paths –Disk space is cheap Differentials –For each file, maintain differentials –Only Insertions, deletions –Update?

Fault-recovery Logging –Undo –Redo Durable When to flush?

Fault-correction Redundancy –Encode FEC –Replicate Aha ….

Overview Ordering –Lazy vs. absolute Transactions –Two-phase commit –Three-phase commit –Quorum-based protocols

Availability and Replication Global ordering –Timestamping Absolute –Vector clocks Causal ordering More available But lazy

Optimistic Replication Let everyone make changes –Only 3 % transactions ever abort Make changes, send updates –If someone else’s changes come through with T_him < T_you, your changes are overridden Wait for a bit before committing –deadlocks

Two Phase Commit Blocking How?

Three-phase Commit

Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11.

Similar presentations

Presentation on theme: "Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11.

Similar presentations

Presentation on theme: "Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11."— Presentation transcript:

Similar presentations

About project

Feedback