Automatic Network Management: Graphical Models for Fault Location Ricardo Morla INESC Porto / FEUP
Motivation: Managing ICT networks ICT networks are large and heterogeneous Desktops, servers, applications, services, databases, sensors, routers, links, messages Multiple vendors and goals Non-deterministic Exponential back-off, concurrency, user interaction Difficult to manage Configuration options No single person with unified view of network Log data Expensive to manage Rule of thumb €1 acquisition :: €2 operations and management
Failures, Faults, and Fault Location Failures Request timeout, anomalous values, … Faults Hardware fault, software bugs, mis-configurations, … Cannot monitor faults infer faults from failures Single component: trivial to locate fault Network: non-trivial May not be able to monitor all failures Failures cause failures in other components
Toy example – Fault location – I Simple adding example C1.out=C1.in+1 C2.out=C2.in+1 Cannot read C1.out/C2.in What’s the faulty component? C1.in = 10 C2.out = 13 C1 or C2? C1C2
Toy example – Fault location – I What’s the faulty component? C1.in = 10 C2.out = 13 C1.out = C1.in + 1 (99.9%) C1.out = C1.in + 2 (0.01%) C2.out = C2.in + 1 (99.9%) C2.out = C2.in + 10 (0.01%) C1 or C2? C1C2
Toy example – Fault location – II Message forwarding example Fault: message drop Fault Propagation Model Fault in component A Failure in component A A A A BC B A-B B C C B-C A A
The fault location problem in ICT Motivating example (I) Fiber-based IP network Faults: Fiber/Splitter cuts Failures: loss of connectivity between IP nodes Map IP topology (routers etc) with fiber topology IP links (failures) share fiber faults Shared risk [Kompella05] Smallest best possible fault explanation for observed failures
The fault location problem in ICT Motivating example (II) IMS Networks Complex architecture Session Director, SIP Server; Home subscriber server Distributed geographically, tree-based Various software and hardware faults Multimedia-specific KPI and failures/alarms Codebook approach [Reali09] Minimum set of alarms Robustness against spurious/missing alarms
The fault location problem in ICT Motivating example (III) Enterprise Networks [Kandula09] Symptoms Intermittent response time from server DB server refusing to start Faults Configuration Software bugs Difficult to get topology info and dependencies
PGM for fault location Graphical model encodes P(Fault | Failures) What’s the most likely set of faults that explains a set of given observed values? Posterior probability Highest P(Failures | Fault) This is hard: topology of PGM detail of probabilistic model
Challenges Define fault location models In addition to FPM Higher model complexity in the PGM Include time functions Adequate models of ICT systems for Fault Location From topology/application domain Automatically from data Hybrid Fault location-based system redesign/reconfiguration Performance metrics vs. fault location metrics Tradeoff
Current effort Modeling different ICT systems for better fault location Enterprise networks IP Multimedia Subsystems Ambient intelligent environments … How: From network topology Directly from data With expert input (a-priori rules)
Concluding Remarks ICT systems are increasingly complex We must be able to manage them automatically including locating faults Automatic fault location has the potential for cutting operations and management costs Applicable world-wide, across market domains