Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi.

Similar presentations


Presentation on theme: "Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi."— Presentation transcript:

1 Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi

2 Introduction Using a network-based service can be a frustrating experience… Even inside the network of a single enterprise, where traffic does not cross the open Internet, user- perceptible service degradations are rampant.

3 Introduction When 13% of the requests take 10x longer than normal and acceptable time to process, it’s time for us to do something.

4 Introduction Conventional approach: up OR down Our approach: up, troubled or down. Response fall significantly outside of normal response time, including the case when only a subset of service requests are performing poorly.

5 Sherlock Detects the existence of faults and performance problems by monitoring the response time of services. Determines the set of components that could be responsible. Localizes the problem to the most likely component.

6 Inference Graph and Ferret Algorithm Inference Graph: Captures the dependencies between all components of the IT infrastructure. Determine the set of component Ferret Algorithm: Efficiently localizes faults in enterprise-scale networks using the Inference graph and measurements of service response times made by agents. Localize the problem to the most likely component

7 Tradeoffs Sherlock falls short of full problem diagnosis. It has not been evaluated in an environment that dependencies are deliberately and frequently changed. … BUT, measurements indicate that the vast majority of enterprise applications do not fall into these classes.

8 Inference Graph in detail

9 Node structure Root-cause node: physical components whose failure can cause an end-user to experience failures. (two special root- cause nodes: AT and AD) Observation node: accesses to network services whose performances can be measured by Sherlock. Meta-nodes: model the dependencies between root causes and observations.

10 Node State The state of each node is expressed by a three-tuple: (P up, P trouble, P down ) P up + P trouble + P down = 1 The state of the root-cause node is independent of any other node in the Inference Graph.

11 Edges Edge from node A to B: the dependency that node A has to be in the up state for node B to be up. Each edge is given a dependency probability, indicating how strong the dependency is.

12 Meta-Nodes and the Propagation of State Noisy-Max Meta-Nodes Max: The node gets the worst condition of its parents. Noisy: if the weight of a parent’s edge is d, then with probability (1-d) the child is not affected by that parent.

13 Meta-Nodes and the Propagation of State Selector Meta-Nodes Used to model load balancing scenarios. NLB: Network Load Balancer ECMP: A commonly used technique in enterprise networks where routers send packets to a destination along several paths.

14 Meta-Nodes and the Propagation of State Failover Meta-Nodes Failover: Clients access primary production servers and failover to backup servers when the primary server is inaccessible.

15 Calculation of State Propagation For Noisy-Max Mega-Nodes: Time Complexity: O(n) instead of O(3 n ).

16 Calculation of State Propagation For Selector and Failover meta-nodes: Still needs O(3 n ) time. HOWEVER, those two types of meta- nodes have no more than 6 parents.

17 Fault Localization Assignment-vector: An assignment of state to every root-cause node in the Inference Graph where the root-cause node has probability 1 of being either up, troubled or down. Our target: Find the assignment-vector that can explain the observation best.

18 Ferret Algorithm Basically we need to compute a score for how well the vector explains the observations for EACH vector, and give the most possible results in sorting order. But there are 3 r possible vectors for r root- cause nodes, how to search through all these vectors?

19 Ferret Algorithm IMPORTANT OBSERVATION: It is very likely that at any point in time only a few root-cause nodes are troubled or down. Ferret Algorithm evaluates the vectors with no more than k nodes troubled or down. So it processes at most (2*r) k vectors.

20 Ferret algorithm The approximation error of Ferret algorithm becomes vanishingly small for k=4 and onwards

21 Ferret Algorithm OBSERVATION: Since a root-cause is assigned to be up in most assignment vectors, the evaluation of an assignment vector only requires re- evaluation of states at the descendants of root-cause nodes that are not up. So Ferret Algorithm can be further speeded up.

22 Ferret Algorithm Score for a given assignment vector: Track the history of response time and fits two Gaussian distribution to the data, namely Gaussian up and Gaussian troubled. If the observation time is t and the predicted observation node is (p up, p troubled, p down ), then the score of this vector is calculated as: p up *Prob(t|Gaussian up ) + p troubled *Prob(t|Gaussian troubled )

23 Ferret Algorithm We use a statistical test to determine if the prediction is sufficiently meaningful to deserve attention. Null hypothesis: all root causes are up Score(best prediction) – Score(null hypothesis)

24 The Sherlock System Step1: Discover Service-Level Dependencies Step2: Construct the Inference Graph Step3: Fault Localization using Ferret Remark: Sherlock requires no changes to routers, applications, or middlewares used in the enterprise.

25 Discovering Service-Level Dependencies OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co-occur. Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B. Dependency interval: 10ms Chance of co-occurrence: 10ms/I. We retain only the dependency probability which is much greater than the chance of co-occurrence.

26 Aggregating Probabilities across Clients OBSERVATION: many clients in an enterprise network have similar host, software and network configuration and are likely to have similar dependencies. Aggregation allows us to deal with infrequent service and false dependencies better.

27 Constructing the Inference Graph A: Create a noisy-max meta-node to represent the service. And create an observation node for each client.

28 Constructing the Inference Graph B: Identify the set of services Ds that the clients are dependent on accessing S. And recurs.

29 Constructing the Inference Graph C: Create a root-cause node to represent the host on which the service runs and makes this root-cause a parent of the meta-node.

30 Constructing the Inference Graph D: Add network topology information to the Inference Graph. Add a noisy-max meta node to represent the path and a root-cause nodes to represent router and link.

31 Constructing the Inference Graph E: Finally put AT and AD into the Inference Graph. Give each the edges connecting AT/AD to the observation point a weight 0.001. And give the edges between a router and a path meta-node a probability 0.999.

32 Implementation Implement Sherlock Agent as user-level daemon –Windows XP –Pcap-based sniffer

33 Implementation Centralized inference Engine –Aggregate information easily –Scalability Extremely large network (10 5 agents) – bandwidth Req.10Mbps

34 Implementation Computational Complexity of fault localization –Linearly with graph size

35 Evaluation Deploy Sherlock system in our enterprise network –40 servers –34 routers –54 IP links –2 LANs –3 Weeks Agents periodically send requests to the web and file servers –Mimic user behavior Controlled environments: Testbed and simulation

36 Evaluation – Discovering Service Dependencies Evaluate Sherlock’s algorithm –discovering Service-level dependency graphs Two example: –Web Server –File Server In the graph: –Arrows: Point from server provides service to server rely on it. –Edges : Represent strength of the dependencies with weights

37 Evaluation – Discovering Service Dependencies Service-level dependency graphs for visiting web portal and sales website –Clients depend on name lookup servers to access websites –Both websites share substantial portions of their back-end dependencies

38 Evaluation – Discovering Service Dependencies Service-level dependency graphs for visiting file server –Turns out file server is the root name of a distributed file system

39 Evaluation – Discovering Service Dependencies Summary – Service-level dependencies is various – Similar dependencies have different strength – Dependencies change over the time

40 Evaluation – Discovering Service Dependencies Number of samples –For about 4,000 samples, converged. –Once converged, service-level dependencies are stable over several days to weeks.

41 Evaluation – Discovering Service Dependencies Number of clients –Sherlock aggregates samples from more clients –Aggregate 20 clients reduce false-dependency greatly

42 Evaluate “Ferret” –Ability to localize faults –Sensitivity to errors Evaluation – Localizing Faults

43 Testbeds –Three web servers, –SQL server, –Authentication server Failed or Overloaded: –Overloaded link: 5% of packets drop –Overloaded server: high CPU and disk utilization Evaluation – Localizing Faults

44 Service-level dependencies –Arrows at the bottom level Evaluation – Inference Graph

45 Features of Ferret Localizes faults –Correlating observations from multiple points No “Threshold” setting Clients observation of web server’s performance –Probabilistic Correlation Algorithm for reasoning List of suspects of being the root cause Handle multiple simultaneous failures

46 Fault Localization Capability Affected by –The number of vantage points –The number of observations available in time window Having observations from the same time period that exercise all paths in the Inference Graph.

47 Fault Localization Capability In time window of 10 seconds –At least 70 unique clients access every one of the top 20 servers –Top 20 servers alone provide enough observations to localize faults

48 Evaluation - Sherlock Field Deployment The resulting Inference Graph contains: –2565 nodes –358 components –Can fail independently Experiment: –Running Sherlock system over 120 hours –Found 1029 performance problems

49 Evaluation - Sherlock Field Deployment The result –X: time –Y: one component

50 Evaluation - Sherlock Field Deployment Ferret’s computation: –87% of the problems are caused by 16 components –The Server1, Server2, and Link on R1 for instance

51 Features of Sherlock Discover problems overlooked by using traditional threshold-based techniques. –Requests requiring interaction between the web server and the SQL backend experience poor performance No Threshold-based Alerts raised Sherlock caught it!

52 Features of Sherlock Discover problems overlooked by using traditional threshold-based techniques. –No threshold setting that can catches this problem while avoiding false alarms

53 Sherlock vs. prior approaches Multi-level inference graph vs. Two-level Probabilistic dependencies Need a large set of observations that the actual root cause of problems are known. –Simulation!

54 Sherlock vs. prior approaches Conduct Experiment with simulations –Known topology and Inference Graph Randomly set state of each root cause Repeat 1,000 times produced 1,000 sets of observations –Compare different techniques’ ability To identify the correct root cause given the 1000 observation Up to 32% more faults found than Shrink and Score

55 Sherlock vs. prior approaches –Compare different techniques’ ability To identify the correct root cause given the 1000 observation Up to 32% more faults found than Shrink and Score

56 Sherlock vs. prior approaches –Compare different techniques’ ability Shrink: using Two-level inference graph SCORE: using minimal set cover –Dependency either exist or not

57 Time to Localize Faults Experiment: –AMD 1.8 GHz, 1.5GB RAM –Inference Graph: 500,000 nodes 2,300 clients 70 servers Ferret takes 24 minutes to localize an injected fault

58 Time to Localize Faults Running time –Ferret is always less than 4ms times the number of nodes in the Inference Graph Ferret is easily parallelizable

59 Impact of -Errors in Inference Graph Errors are unavoidable when constructing inference graphs –How sensitive Ferret is to these errors? Experiment: 1.Randomly add new parents 2.Randomly swap one of its parents with another node 3.Randomly change weight on edges 4.Randomly add extra hop or permute its intermediate hops –First three types: perturbation to errors in dependency –Last type: errors in traceroute

60 Impact of -Errors in Inference Graph From observation: Each point represents the average of 1000 experiment –When half of paths/nodes/weights are perturbed Ferret correctly localizes faults in 74.3% of the case –Perturbing the paths seems to be most impact

61 Modeling Redundancy Tech. Modeling load-balancing and redundancy –Using specialized meta-node Experiment: –Create 24 failure scenarios Where the root cause is a component connected to a specialized meta-node –Test Ferret by using Inference graphs with specialized meta-nodes Inference graphs with noisy-max meta-nodes instead –Result 14 cases about secondary server or backup path14 cases about secondary server or backup path –No difference for both kind of inference graph 10 cases about primary server or path failed10 cases about primary server or path failed –Using specialized meta-nodes, Ferret got 10 totally –Without that, Ferret got 4 wrong root cause identification

62 Summary of Evaluation Ferret algorithm is able to discover service dependencies in a few hours Since enterprise network might contains huge number of nodes, we justify the need to an automatic approach Effectively identifying performance problems and narrowing the root-cause Sherlock is robust to noise, compare to other approachs.

63 future works Discussion – future works Virtual Machines –Many enterprises put multiple servers on a single piece of hardware via VM. –We can let all the algorithms in this paper, treat each VM as a separate host.

64 future works Discussion – future works Software Failures –Software Failures not modeled –Extending our inference graph to these common failures modes –This is our next step plan

65 Contributions Main contribution –Sherlock system Assist IT Admin for troubleshooting. –We contribute A Multi-level probabilistic inference model Technique for Automatic Construction of the Inference Graph An algorithm with Inference Graph to localize root cause.

66 Thanks~ AnyQuestion?


Download ppt "Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi."

Similar presentations


Ads by Google