Fault Tolerance (I).

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Chapter 8 Fault Tolerance
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Fault-Tolerant Systems Design Part 1.
Introduction to Information Technologies
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
CSE 461: Error Detection and Correction. Next Topic  Error detection and correction  Focus: How do we detect and correct messages that are garbled during.
Byzantine Generals Problem: Solution using signed messages.
Chapter 6 Errors, Error Detection, and Error Control.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CSCI 4550/8556 Computer Networks Comer, Chapter 7: Packets, Frames, And Error Detection.
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Chapter 6: Errors, Error Detection, and Error Control
Last Class: Weak Consistency
William Stallings Data and Computer Communications 7 th Edition (Selected slides used for lectures at Bina Nusantara University) Error Control.
Chapter 6 Errors, Error Detection, and Error Control
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Error Detection and Correction
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Scheduling in Distributed Systems There is not really a lot to say about scheduling in a distributed system. Each processor does its own local scheduling.
1 Data Link Layer Lecture 20 Imran Ahmed University of Management & Technology.
Error Coding Transmission process may introduce errors into a message.  Single bit errors versus burst errors Detection:  Requires a convention that.
Part 2: Packet Transmission Packets, frames Local area networks (LANs) Wide area networks (LANs) Hardware addresses Bridges and switches Routing and protocols.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 14.
Cyclic Code. Linear Block Code Hamming Code is a Linear Block Code. Linear Block Code means that the codeword is generated by multiplying the message.
Fault-Tolerant Systems Design Part 1.
Data and Computer Communications by William Stallings Eighth Edition Digital Data Communications Techniques Digital Data Communications Techniques Click.
Cyclic Redundancy Check CRC Chapter CYCLIC CODES Cyclic codes are special linear block codes with one extra property. In a cyclic code, if a codeword.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CS3505: DATA LINK LAYER. data link layer  phys. layer subject to errors; not reliable; and only moves information as bits, which alone are not meaningful.
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
Data Communications & Computer Networks, Second Edition1 Chapter 6 Errors, Error Detection, and Error Control.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Fault-Tolerant Systems Design Part 1.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
CHAPTER 3: DATA LINK CONTROL Flow control, Error detection – two dimensional parity checks, Internet checksum, CRC, Error control, Transmission efficiency.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Introduction to Fault Tolerance By Sahithi Podila.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Hamming Distance & Hamming Code
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fault Tolerance (2). Topics r Reliable Group Communication.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Data Link Layer.
8.2. Process resilience Shreyas Karandikar.
Fault Tolerance In Operating System
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Packetizing Error Detection
Packetizing Error Detection
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
RAID Redundant Array of Inexpensive (Independent) Disks
Packetizing Error Detection
Data Link Layer. Position of the data-link layer.
Seminar on Enterprise Software
Presentation transcript:

Fault Tolerance (I)

Topics Basic concepts Physical Redundancy Information Redundancy Timing Redundancy RAID

Readings Tannenbaum: 7.1,7.2

Introduction A characteristic feature of distributed systems that distinguishes them from single-machine systems is the notion of partial failure A partial failure may happen when one component in a distributed system fails. This failure may affect the proper operation of other components, while at the same time leaving other components unaffected.

Introduction An important goal in design is to construct the system in such a way that it can automatically recover from partial failures without seriously affecting the overall performance. The distributed system should continue to operate in an acceptable way while repairs are being made.

By the way…. Computing systems are not very reliable OS crashes frequently (Windows), buggy software, unreliable hardware, software/hardware incompatibilities Until recently: computer users were “tech savvy” Could depend on users to reboot, troubleshoot problems

By the way…. Computing systems are not very reliable (cont) Growing popularity of Internet/World Wide Web “Novice” users Need to build more reliable/dependable systems Example: what if your TV (or car) broke down every day? Users don’t want to “restart” TV or fix it (by opening it up) Need to make computing systems more reliable

Characterizing Dependable Systems Dependable systems are characterized by Availability This refers to the percentage of time system may be used immediately Reliability Mean time to failure (MTTF) I.e., mean time between failures. Safety How serious is the impact of a failure Maintainability How long does it take to repair the system Security Stress the difference in these characteristics.

Characterizing Dependable Systems Availability and reliability are not the same thing. If a system goes down for a millisecond every hour, it has an availability of over 99.9999 percent, but it is still highly unreliable. A system that never crashes but is shut down for two weeks every August has high reliability but only 96 percent availability.

Definitions A system fails when it does not perform according to its specification. An error is part of a system state that may lead to a failure. A fault is the cause of an error. A system fails when it does something unexpected or incorrect, e.g. outputs an incorrect result. The failure is usually due to an error in the system state, e.g. incorrect values. The incorrect values arise because of faults such as a coding mistake or omission. Explain type of faults.

Definitions Types of Faults Transient Occur once and then disappear. If the operation is repeated, the fault goes away. Example: Bird flying through the beam of a microwave transmitter may cause lost bits on some network (not to mention a roasted bird).

Definitions Types of Faults (continued) Intermittent Occurs and then vanishes of its own accord, then reappears, etc; A loose connector will often cause a intermittent fault. Permanent Continues to exist until the faulty component is repaired. Burnt-out chips, software bugs, and disk head crashes. A fault tolerant system does not fail in the presence of faults.

Server Failure Models Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times We can put failures into different categories. A crash failure (sometimes called fail stop) is the easiest to deal with. A server simple runs correctly and then stops, there are no incorrect results or some actions done and others not, etc. Usually detected by noting the servers failure to respond. Omission failures occur when an otherwise working server fails to respond. Note that it may or may not have received the invocation. Timing failures occur when a response is outside of the specification, e.g. an audio stream delivered so fast it causes buffers to overflow. Response failures occur when a response is incorrect, either a wrong value or the result of being in a wrong state. Arbitrary or Byzantine failures are the most difficult as the server, or a group of cooperating servers, can do anything to trick the client.

Server Failure Models Crash Failure (fail-stop) A server halts, but is working correctly until it halts. Example: An OS that comes to a grinding halt and for which there is only one solution: reboot

Server Failure Models Omission Failure This occurs when a server fails to respond to incoming requests or fails to receive incoming messages or fails to send messages. There are many reasons for an omission failure including: The connection between a client and a server has been correctly established, but there was no thread to listen to incoming requests. A send buffer overflows; The server may need to be prepared that the client will reissue its previous request. An infinite loop where each iteration causes a forked process.

Server Failure Models Timing Failures A server’s response lies outside the specified time interval. An e-commerce site may state that the response to a user should be no more than 5 seconds (actually this is too long). In a video-on-demand application, the client is to receive frames at 25 frames per second give or take 2 frames. Timing failures are very difficult to deal with.

Server Failure Models Response Failure A server’s response is incorrect: a wrong reply to a request is returned or when a server reacts unexpectedly to an incoming request. Example: A search engine that systematically returns web pages not related to any of the used search terms. Example: A server receives a message that it cannot recognize.

Server Failure Models Arbitrary (Byzantine) Failures Arbitrary failures occur Server is producing output it should never have produced, but which cannot be detected as being incorrect. A faulty server may even be maliciously working together with other servers to produce intentionally wrong answers.

Server Failure Models Ideally, we want fail-stop processes. A fail-stop process will simply stop producing output in such a way that its halting can be detected by other processes. The server may be so friendly to announce it is about to crash. The reality is that processes are not that friendly. We rely on other processes to detect the failure.

Server Failure Models Problem: How to tell the difference between a process that has halted and a process that is just slow. Timeouts are great but theoretically you cannot place an exact time on when to expect a response. If most of the time, the timeout interval is too high then you are delaying the system from reacting to the failure.

Failure Masking by Redundancy If a system is to be fault tolerant, the best it can do is to try to hide the occurrence of failures from other processes. Key technique: Use redundancy Types of redundancy Information redundancy Physical redundancy Time redundancy

Physical Redundancy Extra equipment or processes are added to make it possible for the system as a whole to tolerate the loss or malfunctioning of some components. Physical redundancy can thus be done in either hardware or in software. Examples in hardware: Aircraft: 747’s have 4 engines but fly on 3. Space shuttle: Has 5 computers Electronic circuits

Physical Redundancy Triple modular redundancy.

Physical Redundancy For electronic circuits, each device is replicated three times. Following each stage in the circuit is a triplicated voter. Each voter is a circuit that has three inputs and one output. If two or three of the inputs are the same, the output is equal to that input. If all three inputs are different, the output is undefined. This kind of design is known as TMR (Triple Modular Redundancy).

Physical Redundancy TMR can be applied to any hardware unit. The TMR can completely mask the failure of one hardware unit. No explicit actions need to be performed for error detection, recovery, etc; Particularly suitable for transient failures if we assume the basic TMR scheme (one voter, three replicas).

Physical Redundancy This scheme can’t handle the failure of two units. Once an unit fails, it is essential that both units should continue to work correctly. The TMR scheme depends critically on the voting element. The voting element is typically a simple circuit and highly reliable circuits of this complexity can be built. The failure of a single voter cannot be tolerated.

Physical Redundancy The TMR approach can be generalized to replicating N units. This is called the NMR approach. The larger N is then the higher the number of faults that can be completely masked.

Physical Redundancy The basic TMR/NMR scheme is often complemented with sparing. Sparing is often referred to as stand-by redundancy since the redundant or spare units usually are not operating online. The restoring organ for sparing is a switch. An error detector is also required to determine when the on-line unit has failed. Failed units may be replaced by a spare.

Physical Redundancy Some reliability results: Overall reliability decreases when the degree of redundancy is increased above a certain amount. TMR provides the least potential for reliability improvement. NMR systems with spares provide the highest reliability.

Information Redundancy Coding is often used in information redundancy. Coding has been extensively used for improving the reliability of communication. The basic idea is to add check bits to the information bits such that errors in some bits can be detected, and if possible corrected. The process of adding check bits to information bits is called encoding. The reverse process of extracting information from the encoded data is called decoding.

Information Redundancy Detectability/Correctability of a Code A code defines a set of words that are possible for that code. The Hamming distance of a code is the minimum number of bit positions in which any two words in the code differ. If d is the Hamming distance, D is the number of bit errors that it can detect and C is the number of bit errors it can correct, then the following relation is always true: d = C +D +1 with D  C

Information Redundancy Detectability/Correctability of a Code Let’s say that you have a code that looks like this: 000 001 010 011 100 101 110 111 Hamming distance is one. You can’t detect an error. Why? Let’s say that a fault transforms 001 to 011. How do you know this is a fault vs the possibility that 011 is correct?

Information Redundancy Detectability/Correctability of a Code On the otherhand, let’s say that you have the following code of 3 codewords: 0000 0011 1100 If a fault changes one bit in a correct word it will result in a word that is not in the above list. This is not true if two bits are changed. Hence, the above code can only tolerate one fault. You can’t correct. Let’s say that 0000 changes to 0010. You know there is an error, but how do you know that this should go to 0000 and not 0011.

Information Redundancy Simple Parity Bits Simple parity bits have been in common use in computer systems for many years. The parity bit is selected so that the total number of 1’s in the codeword is odd (even) for an odd-parity (even-parity) code. This means that the Hamming distance is 2. The parity bit can only detect single bit errors.

Information Redundancy Simple Parity Bits Example (assume odd-parity): Codeword is 000; the parity bit is 1 Codeword is 001; the parity bit is 0 Codeword is 010; the parity bit is 0 Let’s say that 000 is transferred as 0001. The parity bit is set as 1 which results in a odd number of ones (remember we are only interested in an odd number of ones).

Information Redundancy Simple Parity Bits All errors involving an odd number of bits can be detected because such errors will produce an incorrect parity.

Information Redundancy Hamming Codes Multiple parity bits are added such that each parity bit is a parity of a subset of information bits. The code can detect and also correct errors. Widely used in semiconductor memory and in disk arrays.

Information Redundancy Hamming Codes Parity bits occupy the bit positions 1,2,4,…. (power of 2) in the encoding. The remaining are the data positions. Let k be the number of parity bits. Let m be the number of data bits. The word length of the encoded word is m+k.

Information Redundancy Hamming Codes – Example Let k = 3 and m = 4 Bits in positions 1,2,4 are the parity bits. Label these as c1,c2 and c3. Bits in positions 3,5,6,7 are the data bits. Label these as d1,d2,d3 and d4. The value of parity bits is defined by the following relations: c1 = d1d2d4 c2 = d1d3d4 c3 = d2d3d4 0 000 4 100 1 001 d1 5 101 d2 2 010 6 110 d3 3 011 7 111 d4

Information Redundancy Hamming Codes – Example Let the word to be transmitted be 1011. 001 010 011 100 101 110 111 c1 c2 d1 c3 d2 d3 d4 1 c1 = d1d2d4 c2 = d1d3d4 c3 = d2d3d4

Information Redundancy Hamming Codes – Example How do we come up with these relations? A Hamming code generator computes the check bits according to the following scheme. The binary representation of the position number j is jk-1 ... j1 j0. The value of a check bit is chosen to give odd (or even) parity over all bit positions j such that ji = 1. Thus each bit of the data word participates in several different check bits.

Information Redundancy Hamming Codes – Example Assume the word transferred is 1111. 001 010 011 100 101 110 111 c1 c2 d1 c3 d2 d3 d4 1 Transmitted improperly; was originally a zero.

Information Redundancy Hamming Codes – Example Location of bits in error The check bits obtained from the relationship give above are XORed with the actual check bits obtained from the code. c1’ = d1d2d4 = 1 c2’ = d1d3d4 = 1 c3’ = d2d3d4 = 1 e1 = c1c1’ = 0  1 = 1 e2 = c2c2’ = 1  1 = 0 e3 = c3c3’ = 0  1 = 1 Correction is done by simply complementing the bit. If each error bit is 0, no error; else the error location bits specify the location of the bit in error d2 is common to c1’ and c3’ as well as c1 and c3.

Information Redundancy Hamming Codes – Example The use of Hamming codes becomes more efficient, in terms of numbers of bits needed relative to the number of data bits, as the word size increases. If the data word length is 8 bits, the number of check bits will be 4. This overhead is 50%. If the word length is 84 bits, the number of check bits will be 7 giving an overhead of 9 percent.

Information Redundancy Cyclic Redundancy Code (CRC) These codes are applied to a block of data, rather than independent words. CRCs are commonly used in detecting errors in data communication. A sequence of bits is represented as a polynomial (generator polynomial).

Information Redundancy Cyclic Redundancy Code (CRC) If the kth bit is 1, then the polynomial contains xk. Example:1100101101 x9 + x8 + x5 + x3 + x2 + 1 Encoding To the data bit sequence, add (k+1) bits in the end. The extended data sequence is divided (modula 2) by the generator polynomial. The final remainder is added to the data sequence to form the encoded data.

Information Redundancy Cyclic Redundancy Code (CRC) Decoding The extra (k+1) bits are just discarded to obtain the original data bits. Error checking: The data bits are again divided by the generator polynomial and the final remainder is checked with last (k+1) bits of the received data. If there is a difference, an error has occurred.

Information Redundancy Cyclic Redundancy Code (CRC) Through proper selection of the generating polynomial CRC codes will: Detect all single bit errors in the data stream Detect all double bit errors in the data stream Detect any odd number of errors in the data stream Detect any burst error for which the length of the burst is less than the length of the generating polynomial Detect most all larger burst errors

Time Redundancy An action is performed and if the need arises, it is performed again. Example: If a transaction aborts, it can be redone with no harm. This is especially useful when the faults are transient or intermittent.

Case Study Let’s look at RAID(Redundant Array of Inexpensive Disks). Motivation Improve disk access time by using arrays of disks Disks are getting inexpensive. Lower cost disks: Less capacity. But cheaper, smaller, and lower power.

Disk Organization 1 Interleaving disks. Supercomputing applications. Transfer of large blocks of data at high rates. ... Grouped read: single read spread over multiple disks

Disk Organization 1 What is interleaving? Assume you have 4 disks. Byte interleaving means that byte N is on disk (N mod 4). Block interleaving means that block N is on disk (N mod 4). All reads and writes involve all disks, which is great for large transfers

Disk Organization 2 Independent disks. ... Write Read Transaction processing applications. Database partitioned across disks. Concurrent access to independent items. ... Write Read

Problem: Reliability Disk unreliability causes frequent backups. Fault tolerance is needed, otherwise disk arrays are too unreliable to be useful. RAID: Use of extra disks containing redundant information. Similar to redundant transmission of data.

RAID Levels Different levels provide different reliability, cost, and performance. The mean time to failure (MTTF) is a function of total number of disks, number of data disks in a group (G), number of check disks per group (C), and number of groups. The number C is determined by RAID level.

First RAID Level Mirrors Most expensive approach. All disks duplicated (G=1 and C=1). Every write to data disk results in write to check disk. Reads can be from either disk. Double cost and half capacity.

Second RAID Level Data is split at the bit level and spread over data and redundancy (check) disks. Redundant bits are computed using Hamming code and placed in the redundancy disk. Interleave data across disks in a group. Add enough check disks to detect/correct error. Single parity disk detects single error. Makes sense for large data transfers. Small transfers mean all disks must be accessed (to check if data is correct).

Third and Fourth RAID Level The third RAID level is similar to the second RAID level except that splitting of data is at the byte level. There is one parity disk. The fourth RAID level is similar to the third RAID level except that splitting of data is at the block level. There is one parity disk The fifth RAID level is similar to the fourth RAID level except that check bits are distributed across multiple disks. There are 8 RAID levels.

Process Resilience The key approach to tolerating a faulty process is to organize several identical processes in a group. Design issues include the following: When a message is sent to the group itself, all members of the group receive it. Dealing with process groups

Problems of Agreement A set of processes need to agree on a value (decision), after one or more processes have proposed what that value (decision) should be Examples: mutual exclusion, election, transactions Processes may be correct, crashed, or they may exhibit arbitrary (Byzantine) failures Messages are exchanged on an one-to-one basis, and they are not signed

Problems of Agreement The general goal of distributed agreement algorithms is to have all the nonfaulty processes reach consensus on some issue and to establish that consensus within a finite number of steps. What if processes exhibit Byzantine failures. This is often compared to armies in the Byzantine Empire in which there many conspiracies, intrigue and untruthfulness were alleged to be common in ruling circles.

The Two-Army Problem How can two perfect processes reach agreement about 1 bit of information ? … over an unreliable communication channel Red army: 5000 troops Blue army #1, #2: 3000 troops each How can the blue armies reach agreement on when to attack ? Their only means of communication is by sending messengers … that may be captured by the enemy ! No solution!

The Two-Army Problem Proof by contradiction: Assume there is a solution with a minimum number of messages Suppose commander of blue army 1 is General Alexander and the command of the blue army 2 is General Bonaparte. General Alexander sends a message to General Bonaparte reading “I have a plan; let’s attack at dawn tomorrow”. The messenger gets through and Bonaparte sends him back a message with a note saying “Splendid idea, Alex. See you at dawn tomorrow.” The messenger gets back.

The Two-Army Problem Proof by contradiction (cont) Alexander wants to make sure that Bonaparte does know that the messenger got back safely so that Bonaparte is confident that Alexander will attack. Alexander tells the messenger to go tell Bonaparte that his message arrived and the battle is set. The messenger gets through, but now Bonaparte worries that Alexander does not know if the acknowledgement got through. Bonaparte acknowledges the acknowledgement. Etc etc etc

History Lesson: The Byzantine Empire Time: 330-1453 AD. Place: Balkans and Modern Turkey. Endless conspiracies, intrigue, and untruthfullness were alleged to be common practice in the ruling circles of the day. That is: it was typical for intentionally wrong and malicious activity to occur among the ruling group. A similar occurance can surface in a DS, and is known as ‘byzantine failure’. Question: how do we deal with such malicious group members within a distributed system?

Byzantine Generals Problem Now assume that the communication is perfect but the processes are not. This problem also occurs in military settings and is called the Byzantine Generals Problem. We still have the red army, but n blue generals. Communication is done pairwise by phone; it is instantaneous and perfect. m of the generals are traitors (faulty) and are actively trying to prevent the loyal generals from reaching agreement by feeding them incorrect and contradictory information. Is agreement still possible?

Byzantine Generals Problem We will illustrate by example where there are 4 generals, where one is a traitor (analogous to a faulty process). Step 1: Every general sends a (reliable) message to every other general announcing his troop strength. Loyal generals tell the truth. Traitors tell every other general a different lie. Example: general 1 reports 1K troops, general 2 reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.

Byzantine Generals Problem

Byzantine Generals Problem Step 2: The results of the announcements of step 1 are collected together in the form of vectors.

Byzantine Generals Problem

Byzantine Generals Problem Step 3 Consists of every general passing his vector from the previous step to every other general. Each general gets three vectors from each other general. General 3 hasn’t stopped lying. He invents 12 new values: a through l.

Byzantine Generals Problem

Byzantine Generals Problem Step 4 Each general examines the ith element of each of the newly received vectors. If any value has a majority, that value is put into the result vector. If no value has a majority, the corresponding element of the result vector is marked UNKNOWN.

Byzantine Generals Problem The same as in previous example, except now with 2 loyal generals and one traitor.

Byzantine Generals Problem With m faulty processes, agreement is possible only if 2m+1 processes function correctly The total is 3m+1. If messages cannot be guaranteed to be delivered within a known, finite time, no agreement is possible if even one process is faulty. Why? Slow processes are indistinguishable from crashed ones.

Byzantine Generals Problem Let f be the number of faults to be tolerated. The algorithm needs f+1 rounds. In each round, a process sends to all the other processes the values that it received in the previous round. The number of message sent is on the order of: O(Nf+1) where N is the number of generals. If you do not assume Byzantine faults then you need a lot less infrastructure.