Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

Similar presentations


Presentation on theme: "Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth."— Presentation transcript:

1 Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming Zhang Microsoft Research Presented by Zhenyu Pan

2 Introduction Enterprise Network Service Network service is an (IPaddr, port) pair. Enterprise network –network of a single enterprise. –traffic does not cross the open Internet. –user-perceptible service degradations are rampant.

3 Introduction Sherlock System Conventional approach –treat each service as up or down. –box-centric, blind to the complex set of dependencies –meaningless alerts (15,000 alerts a day, almost universally ignored). The new approach – models service availability as a 3-state value: Up: response time is normal; Down: requests result in either an error status or no response at all; Troubled: response times fall significantly outside of normal response times. –user-centric, does not report problems that do not directly affect users.

4 Introduction Sherlock System System components –Detects faults and performance problems by monitoring the response time of services. Software agents: run on each host, analyze the packets, determine the set of services the host depends. –Determines the set of components that responsible, a service, a router, or a link, etc. Sherlock server: assembles an multi- level, 3-state inference graph that captures the dependencies between all components. –Localizes the problem to the most likely component. Ferret Algorithm: localizes faults using the inference graph. Main contributions: –Inference Graph –Ferret Algorithm

5 Introduction Inference Graph An example

6 Inference Graph Node Types Root-cause node: physical components whose failure can cause an end-user to experience failures. –computer, service, router, IP link, etc; –two special root-cause nodes: always troubled (AT) and always down (AD) to model external factors Observation node: accesses to network services whose performances can be measured by Sherlock. Meta-node: model the dependencies between root causes and observations. Three types of meta-nodes: –noisy-max –selector (load-balancers) –failover (failover redundancy)

7 Inference Graph Node States The state of each node –three-tuple: (Pup, Ptrouble, Pdown) –P stands for probability –Pup + Ptrouble + Pdown = 1 The state of the root-cause node is independent of any other node. The state of observation nodes can be uniquely determined from the state of its ancestors.

8 Inference Graph Edges Edge from node A to B: –the dependency that node A has to be in the up state for node B to be up. –For example, a client cannot retrieve a file from a file server if the path to that file server is down. A client might still be able to retrieve the file even when the DNS server is down, if the file server’s name to IP address mapping is found in the client’s local DNS cache. –dependency probability indicates how strong the dependency is.

9 Inference Graph Propagation of State Noisy-Max Meta-Nodes –Max: The node gets the worst condition of its parents. –Noisy: if the weight of a parent’s edge is d, then with probability (1-d) the child is not affected

10 Inference Graph Propagation of State Selector Meta-Nodes –Used to model load balancing NLB: Network Load Balancer. ECMP: routers send packets to a destination along several paths.

11 Inference Graph Propagation of State Failover Meta-Nodes –Failover: Clients access primary servers and failover to backup servers when the primary server is inaccessible.

12 Inference Graph Time to Propagate the majority of the nodes with more than one parent are noisy-max meta-nodes. –For these nodes, computation time is O(n) for selector and failover meta-nodes: –Still needs O(3n) time. –HOWEVER, those two types of meta-nodes have no more than 6 parents.

13 Inference Graph Fault Localization Assignment-vector –An assignment of state to every root-cause node which has probability of 1 of being either up, troubled or down. Our target –Find the assignment-vector that best explain the observation Ferret –sets the root causes to the states specified in the assignment- vector and then propagate state probabilities downwards until they reach the observation nodes. –for each observation node, computes a score based on how well the probabilities in the state of the observation node agree with the statistical evidence.

14 Inference Graph Fault Localization Impossible to traverse all possible assignment vectors to determine the vector with the highest score –OBSERVATION 1. It is very likely that at any point in time only a few root-cause nodes are troubled or down. –OBSERVATION 2. Since a root-cause is assigned to be up in most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of root cause nodes that are not up.

15 Inference Graph ferret algorithm

16 Sherlock System Three-Step Process Step1: service-level dependency graph –We define the dependency probability of a host on service A when accessing service B as the probability the host needs to communicate with service A before it can successfully communicate with service B. Step2: Inference Graph Step3: Fault Localization using Ferret Score for a given assignment vector: –Track the history of response time and fits two Gaussian distribution to the data, namely Gaussianup and Gaussiantroubled. –If the observation time is t and the predicted observation node is (pup, ptroubled, pdown), then the score of this vector is calculated as: pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)

17 Sherlock System Discovering Service-Level Dependencies OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co- occur. Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B. –Dependency interval: 10ms Chance of co-occurrence: first calculate the average interval, I, between accesses to the same service and estimate the likelihood of “chance co-occurrence” as (10ms)/I. They then retain only the dependencies where the dependency probability is much greater than the likelihood of chance co-occurrence.

18 Sherlock System Constructing the Inference Graph creates a noisy max meta-node to represent the service. creates an observation node and makes the service meta- node a parent of the observation node. examines the service dependency information recursively. creates a root-cause node to represent the host on which the service runs and makes this root-cause a parent of the meta-node. adds network topology information by using trace route results. For each path between hosts, it adds a noisy-max meta node to represent the path and root-cause nodes to represent every router and link on the path. adds each of these root-causes as parents of the path meta-node. put AT and AD. Give each the edges connecting AT/AD to the observation point a weight 0.001. And give the edges between a router and a path meta-node a probability 0.999.

19 Implementation Agent

20 Implementation Service Dependency Graphs

21 Implementation Dependency Probabilities

22 Implementation Test Bed

23 Evaluation Root Cause

24 Evaluation Performance Comparison

25 Evaluation Error Influence

26 Summary Main contribution –Sherlock system Assist IT Admin for troubleshooting. A Multi-level probabilistic inference model Automatic Construction of the Inference Graph An algorithm to localize root cause.


Download ppt "Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth."

Similar presentations


Ads by Google