Challenges in Making Tomography Practical

Challenges in Making Tomography Practical
Yiyi Huang, Georgia Tech Nick Feamster, Georgia Tech Renata Teixeira, LIP6 Christophe Diot, Thomson

Problem Network operators need to detect and isolate faults quickly, before customers complain Plenty of existing alarms SNMP traps Active probes Anomaly detection systems Unfortunately, this set of alarms does not help operators locate and eliminate problems that induce problems on end-to-end paths

Network Tomography to the Rescue
Monitor Targets y x Send end-to-end probes through the network Monitor paths for differences in reachability Infer location of reachability problem from these differences

Some Problems Scalability vs. speed: Detection must be fast
Ambiguity: Losses are one-way but don’t always have access to both ends of the path Lack of synchronization: Different monitors see different conditions Dynamics: Topology can change, loss can be transient

Doppler: Making Tomography Practical
Fast, scalable detection Solution: Monitor selection algorithm to reduce the number of monitors and targets so that “cycle times” are fast Transient packet loss Solution: Triggered confirmation of failed paths One-way losses Solution: New algorithm based on IP spoofing Dynamic routing Solution: Periodic snapshots of the network topology Controlled evaluation on VINI, plus limited wide-area experiments.

Fast, Scalable Detection
Select monitors, targets to satisfy two conditions All interfaces are “covered” (or diagnosable) The number of monitors is small enough to ensure a short round time Two goals Coverage: When a failure occurs, system detects it Every interface is covered by at least one path Diagnosability: When a failure occurs, system locates it Every interface is covered by a unique set of paths

Offline Path Selection: Diagnosability
Step 1: Compute the set of paths that cover all interfaces (greedy set cover heuristic) Step 2: Compute hitting set for each interface Step 3: Build equivalence classes for interfaces with common hitting set For each interface in a set with more than one interface, find path that crosses only that interface

Detection, Confirmation, Correlation
Periodic (once per 5 minutes) topology snapshot from all monitors to all destinations keeps track of underlying topology before the failure Detection: Periodic probes (once per “cycle time”) detect failure Confirmation: When a probe is lost, the monitor sends three additional probes. If all three are lost, path is determined to have failed. Correlation: Paths that fail within 10 seconds of one another are grouped.

Disambiguating One-Way Losses: Spoofing
Monitor sends request to spoofer to send probe Probe has IP address of the monitor If reply reaches the monitor, reverse path is working T M Spoofer: Send spoofed packet with source address of M

Identification: NetDiagnoser
Binary network tomography algorithm [Dhamdhere et al.] Input: hosts, destinations, topology before the failure Output: Set of possible locations for the fault

Evaluation of Detection Algorithms
Controlled experiments on the VINI testbed Emulated copy of Abilene network on wide-area paths Probing strategy emulates the paths that would be probed in monitor selection algorithm Compare reduced set of paths to “aggressive” measurement approach Varied failure location and duration Duration varied from 5 to 80 seconds Test repeated for each failed link Measure detection and false alarm rates Preliminary experiments using data from real-world networks

Detection: Scale and Speed
Compute reduction in the number of paths required to achieve coverage and diagnosability Reduction from about 27,000 paths to 151 paths For real-world networks, compute corresponding reduction in cycle time Reduction from aout 3.5 minutes to < 5 seconds

Single-Link Failures More selective probing identifies more of the shorter link failures (due to shorter cycle time) Also results in fewer false alarms

Single-Node Failures Similar results to single-link failures
Selective measurements result in faster detection, fewer false alarms

Does Failure Confirmation Reduce the Total Number of Alarms?
Confirmation reduces the number of failures by > 35% Correlation further reduces the number of alarms (by about a factor of 10)

How Quickly can Doppler Identify Failures?
Answer: Roughly 20 seconds using the reduced set of paths Two main components Detection/Confirmation: Time from when failure was injected to the time Doppler could detect and confirm the failure Correlation: Time to group failures and construct reachability matrix

Detection and Confirmation Delay
Most failures are detected within 3-5 seconds

Correlation Delay Reducing the number of paths to probe significantly reduces total correlation time

Summary Making tomography practical is challenging
Asynchronous measurements Scale and speed Changing topologies Ambiguity about forward and reverse paths Doppler: Set of techniques to address many of these problems Current analysis is still performed offline Many additional challenges remain to coordinate online measurements

Challenges in Making Tomography Practical

Similar presentations

Presentation on theme: "Challenges in Making Tomography Practical"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Challenges in Making Tomography Practical

Similar presentations

Presentation on theme: "Challenges in Making Tomography Practical"— Presentation transcript:

Similar presentations

About project

Feedback