Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

Slides:



Advertisements
Similar presentations
Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang.
Advertisements

Communication Networks Recitation 3 Bridges & Spanning trees.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Addressing the Network – IPv4 Network Fundamentals – Chapter 6.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi.
Dr. Zahid Anwar. Simplified Architecture of Linux Cluster Simplified Architecture of a Single Computer Simplified architecture of an enterprise cluster.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
1 A Comparison of Load Balancing Techniques for Scalable Web Servers Haakon Bryhni, University of Oslo Espen Klovning and Øivind Kure, Telenor Reserch.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Author: Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica Presenter :Yinzhi Cao.
Chapter 10 Introduction to Wide Area Networks Data Communications and Computer Networks: A Business User’s Approach.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Routing.
Lesson 1: Configuring Network Load Balancing
Chapter 23: ARP, ICMP, DHCP IS333 Spring 2015.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
1 Chapter 10 Introduction to Metropolitan Area Networks and Wide Area Networks Data Communications and Computer Networks: A Business User’s Approach.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Lecture Week 3 Introduction to Dynamic Routing Protocol Routing Protocols and Concepts.
1 Semester 2 Module 6 Routing and Routing Protocols YuDa college of business James Chen
ROUTING ON THE INTERNET COSC Aug-15. Routing Protocols  routers receive and forward packets  make decisions based on knowledge of topology.
CECS 474 Computer Network Interoperability WAN Technologies & Routing
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Addressing the Network – IPv4 Network Fundamentals – Chapter 6.
CECS 5460 – Assignment 3 Stacey VanderHeiden Güney.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Troubleshooting Your Network Networking for Home and Small Businesses.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
Lecture 2 TCP/IP Protocol Suite Reference: TCP/IP Protocol Suite, 4 th Edition (chapter 2) 1.
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
1. There are different assistant software tools and methods that help in managing the network in different things such as: 1. Special management programs.
Module 4: Planning, Optimizing, and Troubleshooting DHCP
The Internet Trisha Cummings ITE115. What is the Internet? The Internet is a world-wide network of computer networks that use a common communications.
Chi-Cheng Lin, Winona State University CS 313 Introduction to Computer Networking & Telecommunication Chapter 5 Network Layer.
Intro to Network Design
IP ADDRESSES History Classes and relation to first octet Subnetting Subnet mask Reserved Octets Special Classes IP address and Vlan.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
Sem1 - Module 8 Ethernet Switching. Shared media environments Shared media environment: –Occurs when multiple hosts have access to the same medium. –For.
Institute of Technology Sligo - Dept of Computing Sem 2 Chapter 12 Routing Protocols.
TCP/IP (Transmission Control Protocol / Internet Protocol)
Routing and Routing Protocols
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
1 Utilizing Shared Vehicle Trajectories for Data Forwarding in Vehicular Networks IEEE INFOCOM MINI-CONFERENCE Fulong Xu, Shuo Gu, Jaehoon Jeong, Yu Gu,
The University of Bolton School of Games Computing & Creative Technologies LCT2516 Network Architecture CCNA Exploration LAN Switching and Wireless Chapter.
HTTP evolution - TCP/IP issues Lecture 4 CM David De Roure
1 Version 3.1 Module 6 Routed & Routing Protocols.
Introducing a New Concept in Networking Fluid Networking S. Wood Nov Copyright 2006 Modern Systems Research.
Change Is Hard: Adapting Dependency Graph Models For Unified Diagnosis in Wired/Wireless Networks Lenin Ravindranath, Victor Bahl, Ranveer Chandra, David.
A Framework for Reliable Routing in Mobile Ad Hoc Networks Zhenqiang Ye Srikanth V. Krishnamurthy Satish K. Tripathi.
Introduction to Active Directory
Networking Components Assignment 3 Corbin Watkins.
Routing Algorithms Lecture Static/ Dynamic, Direct/ Indirect, Shortest Path Routing, Flooding, Distance Vector Routing, Link State Routing, Hierarchical.
Chapter-5 STP. Introduction Examine a redundant design In a hierarchical design, redundancy is achieved at the distribution and core layers through additional.
Network Layer IP Address.
Sem 2 v2 Chapter 12: Routing. Routers can be configured to use one or more IP routing protocols. Two of these IP routing protocols are RIP and IGRP. After.
Fault Localization via Analysis of Network Dependency Victor Bahl, Ranveer Chandra, Albert Greenberg, Dave Maltz, Ming Zhang (MSR Redmond)
VL2: A Scalable and Flexible Data Center Network
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
Multi Node Label Routing – A layer 2.5 routing protocol
Services DFS, DHCP, and WINS are cluster-aware.
Network Tools and Utilities
Troubleshooting Network Communications
Network Load Balancing
CS 457 – Lecture 12 Routing Spring 2012.
Routing.
An Introduction to Computer Networking
Routing.
Presentation transcript:

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming Zhang Microsoft Research Presented by Zhenyu Pan

Introduction Enterprise Network Service Network service is an (IPaddr, port) pair. Enterprise network –network of a single enterprise. –traffic does not cross the open Internet. –user-perceptible service degradations are rampant.

Introduction Sherlock System Conventional approach –treat each service as up or down. –box-centric, blind to the complex set of dependencies –meaningless alerts (15,000 alerts a day, almost universally ignored). The new approach – models service availability as a 3-state value: Up: response time is normal; Down: requests result in either an error status or no response at all; Troubled: response times fall significantly outside of normal response times. –user-centric, does not report problems that do not directly affect users.

Introduction Sherlock System System components –Detects faults and performance problems by monitoring the response time of services. Software agents: run on each host, analyze the packets, determine the set of services the host depends. –Determines the set of components that responsible, a service, a router, or a link, etc. Sherlock server: assembles an multi- level, 3-state inference graph that captures the dependencies between all components. –Localizes the problem to the most likely component. Ferret Algorithm: localizes faults using the inference graph. Main contributions: –Inference Graph –Ferret Algorithm

Introduction Inference Graph An example

Inference Graph Node Types Root-cause node: physical components whose failure can cause an end-user to experience failures. –computer, service, router, IP link, etc; –two special root-cause nodes: always troubled (AT) and always down (AD) to model external factors Observation node: accesses to network services whose performances can be measured by Sherlock. Meta-node: model the dependencies between root causes and observations. Three types of meta-nodes: –noisy-max –selector (load-balancers) –failover (failover redundancy)

Inference Graph Node States The state of each node –three-tuple: (Pup, Ptrouble, Pdown) –P stands for probability –Pup + Ptrouble + Pdown = 1 The state of the root-cause node is independent of any other node. The state of observation nodes can be uniquely determined from the state of its ancestors.

Inference Graph Edges Edge from node A to B: –the dependency that node A has to be in the up state for node B to be up. –For example, a client cannot retrieve a file from a file server if the path to that file server is down. A client might still be able to retrieve the file even when the DNS server is down, if the file server’s name to IP address mapping is found in the client’s local DNS cache. –dependency probability indicates how strong the dependency is.

Inference Graph Propagation of State Noisy-Max Meta-Nodes –Max: The node gets the worst condition of its parents. –Noisy: if the weight of a parent’s edge is d, then with probability (1-d) the child is not affected

Inference Graph Propagation of State Selector Meta-Nodes –Used to model load balancing NLB: Network Load Balancer. ECMP: routers send packets to a destination along several paths.

Inference Graph Propagation of State Failover Meta-Nodes –Failover: Clients access primary servers and failover to backup servers when the primary server is inaccessible.

Inference Graph Time to Propagate the majority of the nodes with more than one parent are noisy-max meta-nodes. –For these nodes, computation time is O(n) for selector and failover meta-nodes: –Still needs O(3n) time. –HOWEVER, those two types of meta-nodes have no more than 6 parents.

Inference Graph Fault Localization Assignment-vector –An assignment of state to every root-cause node which has probability of 1 of being either up, troubled or down. Our target –Find the assignment-vector that best explain the observation Ferret –sets the root causes to the states specified in the assignment- vector and then propagate state probabilities downwards until they reach the observation nodes. –for each observation node, computes a score based on how well the probabilities in the state of the observation node agree with the statistical evidence.

Inference Graph Fault Localization Impossible to traverse all possible assignment vectors to determine the vector with the highest score –OBSERVATION 1. It is very likely that at any point in time only a few root-cause nodes are troubled or down. –OBSERVATION 2. Since a root-cause is assigned to be up in most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of root cause nodes that are not up.

Inference Graph ferret algorithm

Sherlock System Three-Step Process Step1: service-level dependency graph –We define the dependency probability of a host on service A when accessing service B as the probability the host needs to communicate with service A before it can successfully communicate with service B. Step2: Inference Graph Step3: Fault Localization using Ferret Score for a given assignment vector: –Track the history of response time and fits two Gaussian distribution to the data, namely Gaussianup and Gaussiantroubled. –If the observation time is t and the predicted observation node is (pup, ptroubled, pdown), then the score of this vector is calculated as: pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)

Sherlock System Discovering Service-Level Dependencies OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co- occur. Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B. –Dependency interval: 10ms Chance of co-occurrence: first calculate the average interval, I, between accesses to the same service and estimate the likelihood of “chance co-occurrence” as (10ms)/I. They then retain only the dependencies where the dependency probability is much greater than the likelihood of chance co-occurrence.

Sherlock System Constructing the Inference Graph creates a noisy max meta-node to represent the service. creates an observation node and makes the service meta- node a parent of the observation node. examines the service dependency information recursively. creates a root-cause node to represent the host on which the service runs and makes this root-cause a parent of the meta-node. adds network topology information by using trace route results. For each path between hosts, it adds a noisy-max meta node to represent the path and root-cause nodes to represent every router and link on the path. adds each of these root-causes as parents of the path meta-node. put AT and AD. Give each the edges connecting AT/AD to the observation point a weight And give the edges between a router and a path meta-node a probability

Implementation Agent

Implementation Service Dependency Graphs

Implementation Dependency Probabilities

Implementation Test Bed

Evaluation Root Cause

Evaluation Performance Comparison

Evaluation Error Influence

Summary Main contribution –Sherlock system Assist IT Admin for troubleshooting. A Multi-level probabilistic inference model Automatic Construction of the Inference Graph An algorithm to localize root cause.