Automatic Network Management: Graphical Models for Fault Location Ricardo Morla INESC Porto / FEUP.

Slides:



Advertisements
Similar presentations
Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang.
Advertisements

Motorola General Business Use MOTOROLA and the Stylized M Logo are registered in the US Patent & Trademark Office. All other product or service names are.
1 Cross-layer Visibility as a Service Ramana Rao Kompella Albert Greenberg, Jennifer Rexford Alex C. Snoeren, Jennifer Yates.
Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis Hung Nguyen (Univ. of Adelaide, Australia) Renata Teixeira.
Distributed Processing, Client/Server and Clusters
Firewalls By Tahaei Fall What is a firewall? a choke point of control and monitoring interconnects networks with differing trust imposes restrictions.
1 Requirements Catalog Scott A. Moseley Farbum Scotus.
1 Scalability is King. 2 Internet: Scalability Rules Scalability is : a critical factor in every decision Ease of deployment and interconnection The intelligence.
Oracle Database Architectures Are Extremely Complex, And Very Expensive. All of Their Complexity Goes Away ! The Snippet Engine Network Architectures Are.
Chapter 9 Designing Systems for Diverse Environments.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
6/4/2015Page 1 Enterprise Service Bus (ESB) B. Ramamurthy.
Monitoring network traffic of Cisco 2950 switch and Cisco 1600 router Group 4 Ishan Shah (CIN: ) Jyotsna Mishra (CIN: ) Parth Chavda (CIN: )
Multimedia Robert Grimm New York University. Before We Get Started…  Digest access authentication  What is the basic idea?  What is the encoding? 
Chapter 13 Physical Architecture Layer Design
Overview Distributed vs. decentralized Why distributed databases
Multimedia Robert Grimm New York University. Content: Multimedia Overview  Multimedia = audio and video  Saroiu et al.—An Analysis of Internet Content.
August 7, 2003 Sensor Network Modeling and Simulation in Ptolemy II Philip Baldwin University of Virginia Motivation With.
Lecture 11 Reliability and Security in IT infrastructure.
Distributed Systems Management What is management? Strategic factors (planning, control) Tactical factors (how to do support the strategy practically).
SIMPLEStone – A presence server performance benchmarking standard SIMPLEStone – A presence server performance benchmarking standard Presented by Vishal.
Bahar Qarabaqi Azar 19 th, FC Inferencing Initial information about the problem being asserted into working memory. Database Sensors User.
OpStor V A multi vendor storage resource management and capacity forecasting software.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Peer-to-peer Multimedia Streaming and Caching Service by Won J. Jeon and Klara Nahrstedt University of Illinois at Urbana-Champaign, Urbana, USA.
A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 12 Slide 1 Distributed Systems Design 1.
VMware vCenter Server Module 4.
Robots at Work Dr Gerard McKee Active Robotics Laboratory School of Systems Engineering The University of Reading, UK
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
CECS 5460 – Assignment 3 Stacey VanderHeiden Güney.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.

IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.
1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.
Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.
Survivable Logical Topology Design in WDM Optical Ring Networks Hwajung Lee, Hongsik Choi, Suresh Subramaniam, and Hyeong-Ah Choi* The George Washington.
MARCH 27, Meeting Agenda  Prototype 1 Design Goals  Prototype 1 Demo  Framework Overview  Prototype 2 Design Goals  Timeline Moving Forward.
1 1 Local vs. remote intelligence A quick look at two different architecture management systems Copyright Nitrosoft 2010.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
The Role of High Availability Software in Quality of Service Joe McFadden Vice President, Marketing, Nuasis.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco PublicITE I Chapter 6 1 Connecting to the Network Networking for Home and Small Businesses – Chapter.
COMP1321 Digital Infrastructure Richard Henson February 2014.
Module 4: Planning, Optimizing, and Troubleshooting DHCP
Bootstrap and Autoconfiguration Chapter 23. Introduction Each computer attached to a TCP/IP internet needs to know: –its IP address –the address of a.
Standard for a Convergent Digital Home Network for Heterogeneous Technologies Zhimeng Du 12/5/2013.
EMC Smarts Managing IT From A Business Perspective
Planning and Analysis Tools to Evaluate Distribution Automation Implementation and Benefits Anil Pahwa Kansas State University Power Systems Conference.
Chapter 1: Overview of Workflow Management Dr. Shiyong Lu Department of Computer Science Wayne State University.
Clever Framework Name That Doesn’t Violate Copyright Laws MARCH 27, 2015.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Connecting to the Network Networking for Home and Small Businesses.
Distributed Databases
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Connecting to the Network Introduction to Networking Concepts.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Oracle Database Architecture By Ayesha Manzer. Automatic Storage Management Spreads database data across all disks Creates and maintains a storage grid.
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
COMP1321 Digital Infrastructure Richard Henson March 2016.
System Architecture CS 560. Project Design The requirements describe the function of a system as seen by the client. The software team must design a system.
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
EDGE WP5 Communication Infrastructure - Status February
FRD Examples November 28, 2017 L. Ong.
The University of Adelaide, School of Computer Science
Distributed Systems CS
Connecting to the Network
Team 6: Ali Nickparsa, Yoshimichi Nakatsuka, Yuya Shiraki
In-network computation
Presentation transcript:

Automatic Network Management: Graphical Models for Fault Location Ricardo Morla INESC Porto / FEUP

Motivation: Managing ICT networks  ICT networks are large and heterogeneous  Desktops, servers, applications, services, databases, sensors, routers, links, messages  Multiple vendors and goals  Non-deterministic  Exponential back-off, concurrency, user interaction  Difficult to manage  Configuration options  No single person with unified view of network  Log data  Expensive to manage  Rule of thumb  €1 acquisition :: €2 operations and management

Failures, Faults, and Fault Location  Failures  Request timeout, anomalous values, …  Faults  Hardware fault, software bugs, mis-configurations, …  Cannot monitor faults  infer faults from failures  Single component: trivial to locate fault  Network: non-trivial  May not be able to monitor all failures  Failures cause failures in other components

Toy example – Fault location – I  Simple adding example  C1.out=C1.in+1  C2.out=C2.in+1  Cannot read C1.out/C2.in  What’s the faulty component?  C1.in = 10  C2.out = 13 C1 or C2? C1C2

Toy example – Fault location – I  What’s the faulty component?  C1.in = 10  C2.out = 13  C1.out = C1.in + 1 (99.9%)  C1.out = C1.in + 2 (0.01%)  C2.out = C2.in + 1 (99.9%)  C2.out = C2.in + 10 (0.01%) C1 or C2? C1C2

Toy example – Fault location – II  Message forwarding example  Fault: message drop  Fault Propagation Model  Fault in component A  Failure in component A A A A BC B A-B B C C B-C A A

The fault location problem in ICT  Motivating example (I)  Fiber-based IP network  Faults: Fiber/Splitter cuts  Failures: loss of connectivity between IP nodes  Map IP topology (routers etc) with fiber topology  IP links (failures) share fiber faults  Shared risk [Kompella05]  Smallest best possible fault explanation for observed failures

The fault location problem in ICT  Motivating example (II)  IMS Networks  Complex architecture  Session Director, SIP Server; Home subscriber server  Distributed geographically, tree-based  Various software and hardware faults  Multimedia-specific KPI and failures/alarms  Codebook approach [Reali09]  Minimum set of alarms  Robustness against spurious/missing alarms

The fault location problem in ICT  Motivating example (III)  Enterprise Networks  [Kandula09]  Symptoms  Intermittent response time from server  DB server refusing to start  Faults  Configuration  Software bugs  Difficult to get topology info and dependencies

PGM for fault location  Graphical model encodes  P(Fault | Failures)  What’s the most likely set of faults that explains a set of given observed values?  Posterior probability  Highest P(Failures | Fault)  This is hard:  topology of PGM  detail of probabilistic model

Challenges  Define fault location models  In addition to FPM  Higher model complexity in the PGM  Include time functions  Adequate models of ICT systems for Fault Location  From topology/application domain  Automatically from data  Hybrid  Fault location-based system redesign/reconfiguration  Performance metrics vs. fault location metrics  Tradeoff

Current effort  Modeling different ICT systems for better fault location  Enterprise networks  IP Multimedia Subsystems  Ambient intelligent environments  …  How:  From network topology  Directly from data  With expert input (a-priori rules)

Concluding Remarks  ICT systems are increasingly complex  We must be able to manage them automatically including locating faults  Automatic fault location has the potential for cutting operations and management costs  Applicable world-wide, across market domains