Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Similar presentations


Presentation on theme: "Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems."— Presentation transcript:

1 Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems Building Dependable Distributed Systems, Copyright Wenbing Zhao 1

2 Wenbing Zhao Outline Recovery oriented computing  Overview  Application level fault detection Structural behavior monitoring Path shape analysis

3 Recovery-Oriented Computing On availability of soft realtime systems  Availability = MTTF/(MTTF+MTTR)  MTTF: mean time to failure  MTTR: mean time to recover  Availability can be improved by increasing MTTF as well as reducing MTTR Recovery-oriented computing: focusing on reducing MTTR  Making fault detection faster and more accurate  Making recovery faster Building Dependable Distributed Systems, Copyright Wenbing Zhao 3

4 Fault Detection and Localization Fault detection: determine if some component in the system has failed Fault localization: pinpoint the particular component that failed Low-level fault detection mechanism  Based on timeout, probing each component periodically with a heartbeat message  Cannot detect many application-level faults Recovery-oriented computing: focusing on application- level fault detection and localization  75% of the recovery time is spent on application-level fault detection Building Dependable Distributed Systems, Copyright Wenbing Zhao 4

5 Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by simply restarting the faulty component  Works best with component-based systems For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo  Based on checkpointing and logging Building Dependable Distributed Systems, Copyright Wenbing Zhao 5

6 System Model for Recovery- Oriented Computing Three-tier architecture  Separating application logic and data management  Middle-tier is stateless or maintains only session state Component-based middleware  Java Platform, Enterprise Edition (Java EE often referred to as J2EE)  Key component: Enterprise Java Bean (EJB) Building Dependable Distributed Systems, Copyright Wenbing Zhao 6

7 Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the application level One plausible fault detection method: acceptance test  Developer would have to develop effective and efficient acceptance test routings  Not practical for Internet apps due to their scale, complexity and rapid rate of changes ROC-based approach: measure and monitor structural behaviors of an app  May detect app-level faults without a priori knowledge of the app details Building Dependable Distributed Systems, Copyright Wenbing Zhao 7

8 Structural Behavior Monitoring Interaction patterns between different components reflect the app-level functionality  Each component implements a specific app function, e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory  The internal structural behavior can be monitored to infer whether or not the app is functioning normally  To monitor Log runtime path for each end-user request, including all incoming msgs, outgoing msgs, method invocations, etc. Building Dependable Distributed Systems, Copyright Wenbing Zhao 8

9 Structural Behavior: Runtime Path Example Runtime path for a single end-user request  Span 5 components  Consist of 10 events Building Dependable Distributed Systems, Copyright Wenbing Zhao 9

10 Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated runtime path data  Objective: anomaly detection based on historical behavior  May use real workload as well as synthetic workload that resembles real workload Peer reference model: train with most recent runtime path data  Objective: anomaly detection with respect to the peer components  Must train with real workload Fault (anomaly) detection: comparing observed patterns with those in the reference models Building Dependable Distributed Systems, Copyright Wenbing Zhao 10

11 Component Interactions Modeling Focus on interactions between a component instance and all other component classes  More scalable: can cope with cases when there are many instances of each class  Suitable for using the Chi-square test for anomaly detection Building Dependable Distributed Systems, Copyright Wenbing Zhao 11

12 Component Interactions Modeling Given a system with n component classes, the interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes  We assume instances of the same class do not interaction with each other  We assume that interactions are symmetric (i.e., request and reply)  Weight assigned to each link is the probability of the component instance intreracting with the linked component class  The sum of the weight on all links is 1, i.e., the component instance has probability of 1 to interact with other component classes Building Dependable Distributed Systems, Copyright Wenbing Zhao 12

13 Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end- users, 3 instances Class C and Class D: also app logic, representing shared state Class E: database server, persistent state Building Dependable Distributed Systems, Copyright Wenbing Zhao 13

14 Component Interaction Model: Example Machine learning: determine link weight based on training data Training data  A issued 400 remote invocations on b1  b1 issued 300 local method invocations on C, and 300 invocations on D  Not important what happened between C & E, D & E Link weight calculation  Total number interactions occurred at b1 instance: 1000  P(b1-A) = 400/1000 = 0.4  P(b1-C) = 300/1000 = 0.3  P(b1-D) = 300/1000 = 0.3 Building Dependable Distributed Systems, Copyright Wenbing Zhao 14

15 Anomaly Detection Comparison of current behavior with the trained behavior: use Chi-Square test  Prepare the observed data as a histogram  Compare distribution using formula: n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i: ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness, it follows a chi-square distribution Building Dependable Distributed Systems, Copyright Wenbing Zhao 15

16 Anomaly Detection: Chi-Square Test Anomaly detected: D > the 1-  quantile of the chi-square distribution with freedom of degree of k=n-1 at a level of significance  Higher level of  => more sensitive => more false positive Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true Building Dependable Distributed Systems, Copyright Wenbing Zhao 16

17 Anomaly Detection: Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C, and 20 invocations on D Link(A-b1): expected value is 100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40) 2 /40 + (35-30) 2 /30+(20-30) 2 /30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for  =0.1, 90% quantile is 4.6 => anomaly detected Building Dependable Distributed Systems, Copyright Wenbing Zhao 17

18 Path Shapes Modeling The shape of a runtime path is defined to be the ordered set of component classes A path shape is represented as a tree in which a node represents a component class  The directional edge represents the causal relationship between two adjacent nodes Building Dependable Distributed Systems, Copyright Wenbing Zhao 18

19 Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used for path shape modeling (in Chomsky Normal Form, CNF)  A list of terminal symbols, Tk, component classes in a path shape form Tk  A list of nonterminal symbols, Ni Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules (see below)  A list of production rules, N i ->  j  a list of terminals and nonterminals)  A list of probabilities R ij = P(N i ->  j ) Building Dependable Distributed Systems, Copyright Wenbing Zhao 19

20 Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B  R 1j : S  A, p=1.0  R 2j : A  B, p=1.0 Building Dependable Distributed Systems, Copyright Wenbing Zhao 20

21 Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability  R 3j : B  C, p=0.25 | B  D, p=0.25 | B  CD, p=0.5 Once a call reaches C or D, it must transit to E, hence:  R 4j : C  E, p=1.0  R 5j : D  E, p=1.0 E is the last stop for all  R 5j : E  $, p=1.0 Building Dependable Distributed Systems, Copyright Wenbing Zhao 21

22 Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if they confirm to the grammar An anomaly is detected if a path shape does not conform to the grammar PCFG itself only detect fault, but not pinpoint root cause (localization of fault)  Need to use other method, such as decision tree Building Dependable Distributed Systems, Copyright Wenbing Zhao 22


Download ppt "Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems."

Similar presentations


Ads by Google