Robustness of Computer Networks is a Matter of the Heart or How to Make Distributed Computer Networks Extremely Robust Ariel Daliot 1, Danny Dolev 1, Hanna.

Robustness of Computer Networks is a Matter of the Heart or How to Make Distributed Computer Networks Extremely Robust Ariel Daliot 1, Danny Dolev 1, Hanna Parnas 2 1 School of Engineering and Computer Science, 2 Department of Neurobiology and the Otto Loewi Center for Cellular and Molecular Neurobiology, The Hebrew University of Jerusalem, Israel ---------------------

Lecture Outline Definition of Robustness Robustness of Biological Systems vs. Engineered Systems Carrying Over a Mechanism for Robustness (pulse synchronization) from Biology to Distributed Networks “Riding Two Tigers” – Superimposing two orthogonal fault models to attain extreme robustness of almost any distributed algorithm

General Definition of Robustness “Robustness is what enables a system to maintain its functionalities in spite of external and internal perturbations”

Robustness as modeled in non- linear dynamics The system state is a point in phase space There is an attractor (point or limit cycle) in phase space which represents the desired functionality of the system Perturbations forcefully move the point representing the system’s state The robustness is the property of attraction The degree of robustness is characterized by the basin of attraction Upon a perturbation robustness can manifest itself in one of two ways: –The system returns to its current attractor (e.g. heart-beat when resting) –The system moves to a new attractor that maintains the system’s functionality (e.g. heart-beat when walking at a constant speed) Otherwise, unstable regions of phase space can be reached (e.g. heart-beat rate that starts to do damage to the organism)

Some Intuitive Conjectures on Robustness It is advantageous for a system to have at least some degree of robustness Demand for more robustness  higher complexity There is a tradeoff between robustness and performance There is a tradeoff between robustness and cost There is a tradeoff between robustness and resource demands The degree of robustness of a system is a function of the nature and probability of the faults The world is dynamic thus robustness facilitates evolvability (the capacity for non-lethal heritable variation) and evolution selects robust traits

What scientists today think about robustness of systems “Robustness is attained by several underlying principles that are universal to both biological organisms and sophisticated engineering systems” - Kitano, Nature, 2004 “Many networks like the Internet, social networks and biological cells display an unexpected degree of robustness but are extremely vulnerable to attacks” - Barabasi, Nature, 2000 “It would be interesting to see whether any design principles connected with [biological] robustness are analogous to those in engineering” Alon, Nature, 1999

Robustness in Biology biological systems rarely “blue-screen”

Robustness in Biology Biological systems have extraordinary evolvability Biological systems are complex and fine tuned over the long course of evolution Power laws are ubiquitous in nature (Pareto distribution), i.e. “20% of the causes account for 80% of the cases” Thus biological systems have evolved to be robust to most of the perturbation occurrences I.e. Most perturbations falls on trajectories in the basin of attraction of some attractor that maintains system functionality On the other hand they are very fragile as certain perturbations can cause catastrophic, cascading failures (e.g. dinosaurs vs. mammals) Thus very robust to random failures but very vulnerable to targeted attacks; are or behave like scale-free networks (Barabasi, Nature, 2000)

Examples of Robustness in Biology Protein Interaction Networks (Giot et. al., Science, 2003) Chemotaxis in bacteria (Alon et. al., Nature, 99) Circadian clock Robustness against mutations Homeostasis (e.g. mammalian temperature regulation) Adaptability of organisms to changing environments Mammals vs. dinosaurs Cardiac pacemaker (Sivan, Dolev & Parnas, 2000)

Mechanisms that Facilitate Biological Robustness (Kitano, Nature 2004) System Control (feedback mechanisms) Fail-safe mechanisms (redundancy and diversity) Modularity Decoupling (containment of faults inside the modules)  Similar principles are used to attain robustness in engineered systems!  Thus it makes much sense to search for and understand biological mechanisms for robustness that can be carried over to computer systems

Robustness in Engineered Systems Low evolvability, sometimes constrained by historical and non- technical considerations Start to be complex but much less than biological systems Systems typically designed according to the very extremes of the power laws, i.e. robust to only the uttermost frequent perturbations, if at all. Design effort usually invested in performance and cost, less in robustness as this is costly and “not needed in the average case” Engineered systems are typically not robust to many perturbations Furthermore are very fragile as certain perturbations can cause catastrophic or cascading failures (e.g. NYC blackout); are typically vulnerable to targeted attacks thought they shouldn't be!

Engineered vs. Biological Robustness Engineered SystemsBiological Systems Evolutionary StageRelatively in “first steps”Billions of years of evolution EvolvabilityLowVery High Basin of AttractionTypically SmallVery Large Complexity of SystemsLow to MediumVery High Sensitivity to Random FailuresTypically Very VulnerableVery Robust Sensitivity to Targeted Attack Typically Vulnerable Very Vulnerable Required Sensitivity to AttackDesiredNot Relevant, Virtually Non- occurring

Importance of Robustness in Distributed Computer Systems Distributed systems become an integral part of daily systems Distributed systems become increasingly more complex This leads to an increased need for robustness

Example of Robustness of an Engineered System – AFCS (maintaining of direction, altitude, velocity)

Fault Models in Distributed Computer Systems Link/Omission faults Crash/Stop/Napping faults Byzantine failures (ongoing “malicious” faults) Transient faults (system temporarily forced into an arbitrary state or total chaos) Tolerating on-going perturbations and converging from any point in state space at the same time is a “wishful” property for a robust distributed system

Byzantine Faults Maliciousness; two-faced behavior; code bugs that express themselves over time; hardware corruption; unpredictable behavior; unpredictable local faults; code corruption Usually requires n>3f to tolerate f faults (without authentication) Byzantine algorithms typically focus on confining the influence of ongoing faults assuming an initial consistent state of correct nodes Can be modeled by arbitrary perturbations in a fraction of the n dimensions of state space in a certain time window Within that time window no perturbations whatsoever are allowed in the rest of the dimensions A transient violation of the above will typically throw the system state forever into an unstable region of state space

Self-Stabilization Addresses the situation when ALL nodes can concurrently be faulty for a limited period of time Self-stabilizing algorithms focus on realizing the task following a “catastrophic” state, once the system is back within the assumption boundaries Typically modeled by an arbitrary perturbation in state space followed by no perturbation whatsoever in any of the dimensions until the state returns to the attractor assumption boundaries

Byzantine Faults and Self- Stabilization Self-stabilization is “orthogonal” to Byzantine failures, i.e. these are uncorrelated fault models Are “complementary” fault models, superimposed both make an algorithm overcome any type of fault from any state Very few protocols (~3) posses both properties despite decades of research. Two of these protocols have super-exponential time complexity

Cardiac ganglion of the lobster (Sivan, Dolev & Parnas, 2000) Four interneurons tightly synchronize their pulses in order to give the heart its optimal pulse, fault tolerantly Able to adjust the synchronized firing pace, up to a certain bound (e.g. while escaping a predator) motor neurons |..|.. |.|.||. |..|..

The target is to synchronize pulses from any state and any faults.....|.............|..................|.....................|...................|.... ……...|.............|..................|.....................|..............|.................|.............|..................|.....................|..................|.....t ……………......|.............|..................|.....................|....................... …......|.............|..................|.....................|................|........ …………….|.............|..................|.....................|.....|....................……......|.............|..................|.....................|...........|..................||||||........||.....|||......||......||......|.......||.||.||.....|......||.......…… …….....|.............|||.||.||.||||...............|.......|||||||||||||||||||||...||||.||||…... cycle Synchronized state ( σ) Arbitrary state Faulty nodes

Fault tolerance in the cardiac ganglion of the lobster (Sivan, Dolev & Parnas, 2000) Must not fire out of synchrony for prolonged times in spite of –Noise –Single neuron death –Inherent variations in the firing rate –Firing frequency regulating Neurohormones –Temperature change The vitality of the cardiac ganglion suggests it has evolved to be optimized for –Fault tolerance –Re-synchronization from any state –Tight synchronization –Fast re-synchronization

Distributed Computer Systems according to Lamport “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.”

“Pulse Synchronization”, in distributed computer systems The computers are required to: Invoke regular pulses in tight synchrony Synchronize from ANY state (self-stabilization) Have a bounded pulse frequency Tolerate upto a third permanent Byzantine faults Examples of other synchronization problems: “Firing Squad”, “Clock Synchronization”, etc.

Making Any* Byzantine Algorithm Stabilize I.e. Robust to any arbitrary transient perturbation in the system state with any arbitrary permanent perturbation in up-to a third of the dimensions Almost as robust as a distributed computer system can get In a sense more robust than typical biological systems as it covers much of state space to defend against targeted attacks but lacks real evolvability and adaptability Adding evolvability and adaptability could make it as robust as algorithms can be

* Restrictions on the Basic Algorithm Can be initialized σ (pulse skew) time units apart Has sampling points where the application state is safe to read Sampling points can be identified by reading PC During legal executions of the basic algorithm all the sampling points are within Δ time of each other A snapshot of application states that are read Δ real time of each other is “consistent”, i.e. meaningful

Outline of the Scheme At “pulse” event –Send local state to all nodes and Byzantine Agree it; –All correct nodes now see the same global snapshot; –Check if global snapshot represents a legal state; –If yes but your state is corrupt then repair state; –If not then reset basic algorithm;

Scheme Pitfalls The general scheme may seem very simple but… When basic algorithm is not synchronized, how close do the sampling points need to be in order to get a “consistent” snapshot? And if they are not close how do you detect that? And if the basic algorithm is synchronized what happens if the sampling points are around the pulse, s.t. some correct nodes send their states and some don’t subsequent to the pulse Assume the global snapshot seems “consistent”, can the predicate detection module always detect if the application is in an illegal state considering the uncertainties in the consistencies?

ByzStabilizer: Stabilizes any* Byzantine Algorithm ByzStabilizer At “pulse” event Begin 1.Abort any other ByzStabilizer; 2.If (must-reset) then reset basic algorithm; 3.When reaching an identified state, exchange the state values and the elapsed time since the pulse ; 4.“Byzantine Agree” on the (state, elapsed-time) sent by each node ; 5.Sift through agreed values for a set of values with elapsed times within some Δ of each other comprising a consistent global snapshot; 6.If no such set then do must-reset:=true and propose pulse; 7.Do predicate evaluation on the consistent global snapshot; 8.If predicate is satisfied but you are not part of the set then repair your state; 9.If predicate is satisfied then basic algorithm is in a legal state  do nothing; 10.Else do must-reset:=true and propose pulse; End

Agreed set of values Pulse uncertainty First “ Δ ” uncertainty Identify the f+1 st value in the safe region Define the end of the region with respect to its “ elapsed time ” Different nodes invoke their pulse at different times Agreement completion time uncertainty Safe region Agreed set within this region

Time Complexity and Convergence Time of ByzStabilizer Ω[σ+Δ+Σ+(2f+1). RTT] ≈ Ω[Σ+(2t+3). RTT] –σ is the pulse skew –Δ is the sampling point skew –Σ is the time complexity of the basic algorithm –RTT is the Round Trip Time –t is the actual number of permanent Byzantine faults This is roughly the complexity of the Byzantine Agreement and the basic algorithm combined Time complexity equals the convergence time. I.e. even when everything is ok you pay the price of the convergence time If solving the basic problem can be reduced to consensus on one value then we can give a scheme that has time complexity of 2 RTTs !! I.e when everything is ok you pay almost nothing

Summary Biological systems are more robust than engineered systems Both use supposedly the same principles for robustness We defined “pulse synchronization” in distributed systems, carried over from the cardiac pacemaker Using this mechanism we presented the first algorithm that stabilizes any Byzantine algorithm that conforms to certain very reasonable restrictions The cost of the algorithm is relatively low and thus we show that self-stabilization in distributed computer systems facing Byzantine faults does not carry a significant additional cost beyond the cost of tolerating Byzantine faults It also implies that robustness does not necessarily mean high cost

Biological synchronization …

“ The importance of being synchronized …”

Sometimes you miss …

“ But everything will be fine again …”

Questions?

Drosophila melanogaster - Protein Interaction Network

After all, you don’t see this so often…

Robustness of Computer Networks is a Matter of the Heart or How to Make Distributed Computer Networks Extremely Robust Ariel Daliot 1, Danny Dolev 1, Hanna.

Similar presentations

Presentation on theme: "Robustness of Computer Networks is a Matter of the Heart or How to Make Distributed Computer Networks Extremely Robust Ariel Daliot 1, Danny Dolev 1, Hanna."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robustness of Computer Networks is a Matter of the Heart or How to Make Distributed Computer Networks Extremely Robust Ariel Daliot 1, Danny Dolev 1, Hanna.

Similar presentations

Presentation on theme: "Robustness of Computer Networks is a Matter of the Heart or How to Make Distributed Computer Networks Extremely Robust Ariel Daliot 1, Danny Dolev 1, Hanna."— Presentation transcript:

Similar presentations

About project

Feedback