EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing
Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC688/788: Secure & Dependable Computing
Outline Recovery oriented computing Overview Application level fault detection Structural behavior monitoring Path shape analysis Microreboot and System-Level Undo/Redo 12/2/2018 EEC688/788: Secure & Dependable Computing

Recovery-Oriented Computing
On availability of soft realtime systems Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as well as reducing MTTR Recovery-oriented computing: focusing on reducing MTTR Making fault detection faster and more accurate Making recovery faster

Fault Detection and Localization
Fault detection: determine if some component in the system has failed Fault localization: pinpoint the particular component that failed Low-level fault detection mechanism Based on timeout, probing each component periodically with a heartbeat message Cannot detect many application-level faults Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault detection

Microreboot and System-Level Undo/Redo
Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging

System Model for Recovery-Oriented Computing
Three-tier architecture Separating application logic and data management Middle-tier is stateless or maintains only session state Component-based middleware Java Platform, Enterprise Edition (Java EE often referred to as J2EE) Key component: Enterprise Java Bean (EJB)

Application-Level Fault Detection
Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the application level One plausible fault detection method: acceptance test Developer would have to develop effective and efficient acceptance test routings Not practical for Internet apps due to their scale, complexity and rapid rate of changes ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app details

Structural Behavior Monitoring
Interaction patterns between different components reflect the app-level functionality Each component implements a specific app function, e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory The internal structural behavior can be monitored to infer whether or not the app is functioning normally To monitor Log runtime path for each end-user request, including all incoming msgs, outgoing msgs, method invocations, etc.

Structural Behavior: Runtime Path Example
Runtime path for a single end-user request Span 5 components Consist of 10 events

Structural Behavior: Machine Learning
Train reference models using machine learning Historical reference model: training with aggregated runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that resembles real workload Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to peer components Must train with real workload Fault (anomaly) detection: comparing observed patterns with those in the reference models

Component Interactions Modeling
Focus on interactions between a component instance and all other component classes More scalable: can cope with cases when there are many instances of each class Suitable for using the Chi-square test for anomaly detection

Component Interactions Modeling
Given a system with n component classes, the interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interact with each other We assume that interactions are symmetric (i.e., request and reply) Weight assigned to each link is the probability of the component instance interacting with the linked component class The sum of the weight on all links is 1, i.e., the component instance has probability of 1 to interact with other component classes

Component Interaction Model: Example
Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-users, 3 instances Class C and Class D: also app logic, representing shared state Class E: database server, persistent state

Component Interaction Model: Example
Machine learning: determine link weight based on training data Training data A issued 400 remote invocations on b1 b1 issued 300 local method invocations on C, and 300 invocations on D Not important what happened between C & E, D & E Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3

Anomaly Detection Comparison of current behavior with the trained behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula: n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i: ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness, it follows a chi-square distribution

Anomaly Detection: Chi-Square Test
Anomaly detected: D > the 1-a quantile of the chi-square distribution with degree of freedom of k=n-1 at a level of significance a Higher level of a => more sensitive => more false positive High level of significance => higher probability to reject that there is no relationship => higher probability to confirm that there is a relationship => higher probability of detecting abnormality Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true The null hypothesis refers to a general statement or default position that there is no relationship between two measured phenomena. Rejecting or disproving the null hypothesis—and thus concluding that there are grounds for believing that there is a relationship between two phenomena

Anomaly Detection: Chi-Square Test: Example
Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C, and 20 invocations on D Link(A-b1): expected value is 100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for a=0.1, 90% quantile is 4.6 => anomaly detected

Path Shapes Modeling The shape of a runtime path is defined to be the ordered set of component classes A path shape is represented as a tree in which a node represents a component class The directional edge represents the causal relationship between two adjacent nodes

Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk, component classes in a path shape form Tk A list of nonterminal symbols, Ni Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules (see below) A list of production rules, Ni -> zj (a list of terminals and nonterminals) A list of probabilities Rij = P(Ni -> zj )

Path Shape Modeling: Example
Path shape for 4 end-user requests 100% probability for the call to transit from A to B R1j: SA, p=1.0 R2j: AB, p=1.0

Path Shape Modeling: Example
For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5 Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0 R5j: DE, p=1.0 E is the last stop for all R5j: E$, p=1.0

Path Shape Modeling: Anomaly Detection
The path shape of new requests can be judged to see if they confirm to the grammar An anomaly is detected if a path shape does not conform to the grammar PCFG itself only detect fault, but not pinpoint root cause (localization of fault) Need to use other method, such as decision tree

Microreboot Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems System design guideline Component based: such as Java EE, with EJB Separating application logic execution and state management Reboot should be cause state loss Loose coupling: to enable localized microreboot Reduce dependency among components: either self-contained, or interaction with other components should be mediated (e.g., via Java EE container) Key: any instance of the referenced component should be able to get the job done => when one under gone microreboot, another instance can provide same service Resilient inter-component interactions Lease-based resource management

Microreboot Automatic recovery with microreboot
Equipping with a fault monitor and a recovery management The fault monitor implements some of the fault detection and localization algorithms described in the previously The recovery manager is responsible to recover the system from the fault recursively: by microrebooting first the identified faulty component, if the symptom does not disappear, a group of components according to a fault-dependency graph. If microrebooting does not work, the entire system is rebooted. The final resort is to notify a human operator

Microreboot Fault-depency graph (f-map): consists of components as nodes and the fault-propagation paths as edges Equipping with a fault monitor and a recovery management Can be obtained using automatic failure-path inference (AFPI) AFPI Constructed by observing the system’s behaviors when faults are injected F-map is then refined during normal operation Cycles in the f-map: nodes in the cycle are grouped as a single node; entire group will be microrebooted as a single unit; f-map => r-map

Microreboot Automatic recovery with microreboot
Reboot both reported faulty component and all components that are immediately downstream from the component If faulty symptom persists, the upstream component in the r-map is also microrebooted Recovery is carried out recursively until entire system is rebooted

Microreboot Implications of microreboot
Microreboot faulty components before node-level failure Tolerating more false positives Proactive microreboot for software rejuvenation Enhance fault transparency for end-users

Overcoming Operator Errors
System dependability is significantly reduced because of human errors Checkpointing and logging useful but not sufficient Operating system level State repair and selective replay System-level undo (rewind), repair, system-level redo (replay)

Exercise 1. Identify the set of most recent checkpoints that can be used to recover the system shown here after the crash of P1 12/2/2018 12/2/2018 EEC693: Secure and Dependable Computing EEC688: Secure & Dependable Computing Wenbing Zhao

EEC688: Secure & Dependable Computing
Exercise 2.Chandy and Lamport distributed snapshot protocol is used to produce a consistent global state of the system shown below. Draw all control msgs sent in the CL protocol, the checkpoints taken at P1 and P2, and specify the channel state for the P0 to/from P1 channels, the P1 to/from P2 channels, and P2 to/from P0 channels Software control will be elaborated in more details in the next slide 12/2/2018 EEC688: Secure & Dependable Computing Wenbing Zhao 30

Exercise 3: Prove that the Chandy and Lamport Distributed Snapshot Protocol produces consistent checkpoints of the system.

Exercise 4: The following are the interactions that occurred in a system at instance b1 during a period, the total invocations on b1 at an instance are The remote invocation on b1 by A, the local method invocation by C, D, E and F are 300,200,300,200 and 200. If remote invocations on A by b1, the local method invocations on C, D, E and F observed are 35, 25,20,15, and 25 then find if anomalies are present in the system?

EEC 688/788 Secure and Dependable Computing

Similar presentations

Presentation on theme: "EEC 688/788 Secure and Dependable Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EEC 688/788 Secure and Dependable Computing

Similar presentations

Presentation on theme: "EEC 688/788 Secure and Dependable Computing"— Presentation transcript:

Similar presentations

About project

Feedback