Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 20161
2 Haifa
GSSI - Feb CS, Technion
Part 0: Part 0: Distributed computing – an overview: basic notions; seminar focus: from lower bounds, via impossibility, to fault tolerance and self-stabilization. Part 1: Part 1: Lower bounds Part 2: Part 2: Computing in spite of faults - impossibility of consensus Part 3: Part 3: Detecting faults - the snapshot algorithm Part 4: Self-stabilization - Self recovery from faults GSSI - Feb 20164
Part 0: Part 0: An overview Part 1: Part 1: Lower bounds Part 2: Part 2: Computing in spite of faults Part 3: Part 3: Detecting faults Part 4: Part 4: Self-stabilization GSSI - Feb 20165
processors communication problem Communication network GSSI - Feb A. The model
Anonymous GSSI - Feb 20167
12 a e 6 c Unique identities GSSI - Feb 20168
d a e b c message passing communication lines, channels topology communication GSSI - Feb 20169
ab c d e directed, undirected (message passing) GSSI - Feb
message delivery mechanism fifo reliable, no faults finite, arbitrary delay queues of messages (message passing) GSSI - Feb
Distributed algorithm, protocol Send a message receive a message do local computation GSSI - Feb (message passing) Execution
R4R4 GSSI - Feb R1R1 R2R2 R3R3 R5R5 e a b d c shared memory
A B read/write (shared memory) GSSI - Feb
synchronization Synchronous, Asynchronous d a e b c GSSI - Feb
Asynchronous Model GSSI - Feb ij time t+???time t Clock Network (synchronization)
Synchronous Model GSSI - Feb ij time t+dtime t Clock Network (synchronization)
Asynchronous Model - many executions GSSI - Feb Synchronous Model - unique execution rounds (synchronization)
Asynchronou s GSSI - Feb Synchronous Shared memory Message passing (synchronization)
GSSI - Feb Asynchronous model: for correctness, for upper bound analysis Synchronous model: for lower bound analysis
Topology GSSI - Feb Ring d a e b c
Clique d a e b c (Topology) GSSI - Feb
General (Topology) GSSI - Feb
Why simple networks? They enable the understanding of many design issues In existing general networks – assume a virtual simple network implemented (e.g. a ring) (Topology) GSSI - Feb
Complexity measures GSSI - Feb Synchronous system time Asynchronous system communication communication (messages, bits) time (synchronous time, longest chain, bounded delay)
Parallel vs. Distributed computing Parallel computing – given a problem … (ex: sorting) Distributed computing – Given a network … (ex: broadcast) GSSI - Feb
(Parallel vs. Distributed computing( Parallel computing : time vs. number of processors Distributed computing: number of messages Complexity goals: Parallel computing: efficiency Distributed computing: correctness GSSI - Feb
problem, task P1P1 P2P2 P3P3 input output Leader election yes no consensus GSSI - Feb b. Problems
issues GSSI - Feb design and analysis of algorithms impossibility, lower bounds fault tolerance
problems GSSI - Feb broadcast snapshot consensus shortest path, maximal flow leader election, breaking symmetry, maximum finding, spanning tree, center termination deadlock
Example: broadcast GSSI - Feb d a e b c f
Broadcast: bfs (breadth-first-search) GSSI - Feb d a e b c f
Broadcast: dfs (depth-first-search) GSSI - Feb d a e b c f
message complexity each edge carries exactly one message at each direction message complexity is 2|E| GSSI - Feb
time complexity GSSI - Feb synchronous time 2|E| longest chain 2|E| bounded delay 2|E|
pi (propogation of information), shout-echo GSSI - Feb d a e b c f
Algorithm pi ( p ropogation of i nformation) send m to each neighbour stop GSSI - Feb if receive m along edge e: send m on all edges except e stop
pi Theorem: The following holds for every execution of the pi algorithm: A processor receives the message m at most once. The execution terminates. each processor receives the message m. The edges on which processors receive m form a spanning tree. The message complexity is 2|E|-|V|+1. The time complexity … GSSI - Feb
pif (propogation of information with feedback) shout-echo GSSI - Feb d a e b c f
Distributed algorithms “Positive” results: design, analysis, upper bounds “Negative” results: lower bounds, impossibility GSSI - Feb c. In this seminar
P1P1 P2P2 P3P3 input output Leader election yes no GSSI - Feb Part 1: Part 1: Lower bounds
GSSI - Feb message passing asynchronous ? x x x x Leader election
GSSI - Feb We’ll see: a lower bound of Ω (n log n) messages
GSSI - Feb d a e b c f Lower bound and fault tolerance Usually all processors need to compute some function Lower bound of Ω(|E|) g
problem, task P1P1 P2P2 P3P3 input output consensus GSSI - Feb Part 2: Part 2: Computing in spite of faults
message passing asynchronous Consensus GSSI - Feb We’ll see: impossibility to reach consensus.
GSSI - Feb Snapshot Part 3: Part 3: Detecting faults
GSSI - Feb We’ll see: snapshot algorithm.
GSSI - Feb Example: clock synchronization Part 4: Self-stabilization
GSSI - Feb
4 Let’s try … GSSI - Feb
4 But … GSSI - Feb
GSSI - Feb We’ll see: self stabilizing algorithms, proofs and performance analysis.