A Fusion-based Approach for Tolerating Faults in Finite State Machines

A Fusion-based Approach for Tolerating Faults in Finite State Machines
Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab

Outline Motivation Related Work Questions and Issues Addressed Model
Partition Lattice Fault Graphs Fault Tolerance in FSMs and (f,m) – fusion Algorithms : Generating Backups and Recovery Implementation Results Conclusion and Future Work -In distributed systems, it is often necessary to maintain the execution state of servers, in case of faults. We provide a space efficient solution to the same. To delve further….program consists of 2 components, they worked on data structures…this was natural progression. we mainly target places where space is at a premium.

Motivation Many real applications modeled as FSMs Embedded Systems :
Traffic controllers, home appliances Sensor networks E.g. hundreds of multiple sensors (like temperature, pressure etc) need to be backed up -In distributed systems, it is often necessary to maintain the execution state of servers, in case of faults. We provide a space efficient solution to the same. To delve further….program consists of 2 components, they worked on data structures…this was natural progression. we mainly target places where space is at a premium.

Problem Given a set of finite state machines (FSMs), some FSMs may either crash (fail-stop faults) or lie about their execution state (Byzantine faults) a a b b a0 a1 a2 b0 b1 b2 a b Counter counting ‘a’s Counter counting ‘b’s

Existing Solution - Replicate
n.f extra FSMs to tolerate k crash faults; 2.n.f extra FSMs to tolerate f Byzantine faults (where n is the # of original FSMs) a a a a a0 a1 a2 a a Counter counting ‘a’s 1-crash fault tolerant setup b b b b b0 b1 b2 b b Counter counting ‘b’s

Related Work Traditional Approach – Redundancy
n.k backup machines to tolerate k faults in n machines Fault Tolerance in Finite State Machines using Fusion (Balasubramanian, Ogale, Garg 08) Exponential algorithm for generating machines which can tolerate crash faults Number of faults = Number of Machines Fusible Data Structures (Garg, Ogale 06) Fuse common data structures such as link lists, hash tables etc – the fused structure smaller than sum of original structures Erasure Coding Fault Tolerance in Data - Fusions are erasure codes

Reachable Cross Product
a Counter counting ‘a’s = <a1, b0> <a1, b1> <a1,b2> b b <a2, b0> <a2, b1> <a2, b2> B b0 b1 b2 R (A, B) b Reachable Cross Product of {A,B} Counter counting ‘b’s 7

Can We Do Better ? “a a b” a a a0 a1 a2 b b a a a F1 a b b b
Counter counting ‘a’s (mod 3) F1 a b b b (a + b ) modulo 3 b0 b1 b2 b Counter counting ‘b’s (mod 3)

2-crash fault tolerant setup
Can We Do Better ? b b a a a a F1 a0 a1 a2 a a b (a + b ) modulo 3 Counter counting ‘a’s (mod 3) 2-crash fault tolerant setup b a a b b b0 b1 b2 F2 b b b (a - b ) modulo 3 a Counter counting ‘b’s (mod 3)

Questions and Issues addressed
Can we do better than the cross product ? How many faults can be tolerated ? What is the minimum number of machines required to tolerate f crash faults ? Can these machines tolerate Byzantine faults? (For example, in previous slide, DFSMs A and B along with F1 and F2 can tolerate one Byzantine fault ) Main Aims : Develop theory to understand and define this problem Efficient algorithms based on this to generate backup machines

Application Scenario: Sensor Network
1000 sensors (simple counters) each recording a parameter (temperature, pressure etc.). Sensors will be collected later and their data analyzed offline 10 sensors are expected to crash Replication requires 1000 x 10 backup sensors to ensure fault tolerant operation Can we use just 10 extra sensors instead of ?

Model Byzantine faults
FSMs (machines) execute independently (in parallel) The inputs to a FSM are not determined by any other FSM. FSMs act concurrently on the same set of events Fail stop (crash) faults Loss of current state, underlying FSM intact Byzantine faults Machines can `lie` about their current state

Join of Two FSMs Join (t) : Reachable cross product: 4 states in this case instead of 9

Less Than Equal To Relation (·)
Given FSMs: A and B A · B , A t B = B Given the state of B, we can determine the current state of A

Partitions Given any FSM, we can partition the states into blocks such that the transitions for all states in a block are consistent E.g. if states t0 and t3 have to be combined to form one partition t0 t3 t1 t2 Input 0 Input 1

Largest Consistent Partition Containing {t0,t3}

Largest Consistent Partition Containing {t0,t1}
t0,t1, t2 t3 t0 t1 t2

Partition Lattice Set of all FSMs corresponding to partitions of a given FSM (say T) forms a lattice with respect to the · relation [HarSte66]. i.e, for any two FSMs, A and B, formed by partitioning T, there exists a unique C · T such that C = A t B : (join/ t ) A · C and B · C and C is the smallest such element C = A u B : (meet/ u) C · A and C · B and C is the largest such FSM

t3 > t0 t1 t2 F2 (B) F3 F4 F1 (A) t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0
Add that the original machine can also be found in the lattice….. S2 S1 S3 S4 t0,t1,t2,t3 

Top Element (>) Given a set of FSMs: A = {A1, …, An}
> = A1 t A2 t … t An All FSMs we consider henceforth are less than or equal to > Intuitively, > has information about the state of every machine in the original set, A Intuition .. repplicatiion

Bottom Element of Lattice (?)
Single state FSM. contains one partition with all the states on any input it transitions to itself conveys no information about the current state of any machine

t3 > t0 t1 t2 F2 F3 F4 F1 t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3 

Tolerating Faults F2 F1

Tolerating Faults F2 F1 X t3 > t0 t1 t2 T: Reachable cross product

Fault Graph: Fault tolerance indicator
1 1 t3 2 > t0 t2 X 2 F2 t0 t1 t2 2 2 t1 t2,t3 t0 t1 T: Reachable cross product Fault Graph G (A , T) A : { F1, F2} : Original machines

t3 t3 A = {FSMs in Yellow Region} 1 > 1 2 t0 t1 t2 t0 t2 2 2 2 F2 t1 F1 F3 F4 t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3 

Hamming Distance dmin(T, A ) = 1
Hamming distance d(ti, tj) : weight of the edge separating the states (ti, tj) in the fault graph e.g. d(t0, t1) = 2 Minimum Hamming distance dmin(T, A ) : The weight of the weakest edge in the fault graph e.g. dmin(T, A ) = 1 t3 1 1 2 t0 t2 2 2 2 t1 dmin(T, A ) = 1

Fault Tolerance in FSMs (crash faults)
Theorem 1 : A set of machines A can tolerate up to f crash faults iff : dmin(T(A), A ) > f e.g. A = {A,B,M1,M2} - dmin(T(A ), A ) = 3 - can tolerate 2 crash faults t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A ) = 3

Fault Tolerance in FSMs (Byzantine faults)
Theorem 2 : A set of machines A can tolerate up to f Byzantine faults iff : dmin(T(A), A ) > 2f e.g. A = {A,B,M1,M2} Let the machines be in the following states: A = {t0, t3}, B = {t0}, M1 = {t0, t2}, M2 ={t3} B and M1 are lying about their state (f = 2) Since dmin(T(A), A ) = 3 < 4, we cannot determine the state of T t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A ) = 3

Fault Tolerance in FSMs (Byzantine faults)
Let the machines be in the following states: A = {t0, t3}, B = {t0}, M1 = {t3}, M2 ={t3} Only B is lying about it’s state (f = 2) Since dmin(T(A), A ) = 3 > 2, we can determine the state of T as t3 Henceforth, dmin(T(A), A ) => dmin(A ) t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A ) = 3

Fault Tolerance and (f,m)- fusion
Given a set of n machines, A , the set of m machines, F , is an (f,m)-fusion of A, if : dmin(A  F ) > f The set of machines in A  F can tolerate f crash faults or f/2 Byzantine faults E.g. A = {A,B}, F = {M1,M2}, dmin(A  F ) = 3 F = {M1,M2} is a (2,2) – fusion of A

Minimal Fusion Given a set of machines A, a fusion set F is minimal if there does not exist another (f, m)- fusion F' such that 8 F 2 F, 9 F' 2 F' : F' · F and 9( F 2 F, F' 2 F') : F' < F

A = {FSMs in Yellow Region}
t3 > t0 t1 t2 (1,1) fusion F2 F1 F3 F4 t0,t3 t1 t2 t2,t3 t0,t2 t1 t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 Minimal (1,1) fusion t0,t1,t2,t3 

Minimal Fusion: Example
t0,t3 t1 t2 t3 2 2 F2 t3 3 > t0 t2 X 2 t2,t3 t0 t1 t0 t1 t2 2 2 S4 t1 t0, t1,t2 t3 Fault Graph : G (A , T) A

Algorithm : Generating Backups
Aim: Add the least possible number of machines that tolerate f faults Input: Set of machines A , number of faults f Output: Minimal fusion set with the least size If |T|= N , size of the event set if |E|, the time complexity of the algorithm is O(N3. |E|. f)

Algorithm overview f: # of faults, A : given set of machines
While dmin (A  F)  f M := > While M   Compute lower cover of M , i.e. LC(M) If  machine F  LC(M): dmin (F  A  F)> dmin (A  F) M := F Else F := F  F Return F

w=1 A = {FSMs in Yellow Region} t3 t3 1 1 > 2 t0 t2 t0 t1 t2 2 2 2




Algorithm : Recovery Aim: Recover the state of the faulty machines for f crash or f/2 Byzantine faults, given the state of the remaining machines Input: Current states of all available machines in A  F Output: Correct state of T The time complexity of the algorithm is O((n+ m) . f )

Algorithm overview S: set of current states of machines in A  F
count : Vector of size |T|, initialized to 0 For all (s in S) do For all (ti in s) do ++count[i] return tc : 1 · c · N and count[c] is the maximal element in count

Algorithm : Example Consider machines A, B, M1,M2 :
dmin ({A, B, M1,M2 }) = 3 ; they can tolerate one Byzantine fault Let the machines be in the following states: A = {t0, t3}, B = {t0}, M1 = {t1, t2,t3}, M2 ={t0} M1 is lying about it’s state The recovery algorithm will return t0 since, count[0] = 3, is greater than, count[1] = 1, count[2] = 1 and count[3] = 2

Experimental Results Original Machines f(faults) State space for
replication State space for fusion MESI, Counter A and B, Shift register 2 7,569 1,521 Even and Odd Parity Checkers, Toggle Switch, Pattern Generator, MESI 3 262,144 32,768 Counters A and B, Divider, Machine A , Machine B 6,724 504 Pattern Generator, TCP, Machine A, Machine B 3,136 2464

Conclusion/Future Work
It is not always necessary to have n.f backups to tolerate f faults Polynomial time algorithm to generate the smallest minimal set that tolerates f faults Implementation of this algorithm shows that many complex state machines have efficient fusions Will machines outside the lattice give better results? Backup Machines need to be given all events ; can we do better?

A Fusion-based Approach for Tolerating Faults in Finite State Machines

Similar presentations

Presentation on theme: "A Fusion-based Approach for Tolerating Faults in Finite State Machines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Fusion-based Approach for Tolerating Faults in Finite State Machines

Similar presentations

Presentation on theme: "A Fusion-based Approach for Tolerating Faults in Finite State Machines"— Presentation transcript:

Similar presentations

About project

Feedback