Presentation on theme: "Challenges in Making Tomography Practical"— Presentation transcript:
1 Challenges in Making Tomography Practical Yiyi Huang, Georgia Tech Nick Feamster, Georgia Tech Renata Teixeira, LIP6 Christophe Diot, Thomson
2 ProblemNetwork operators need to detect and isolate faults quickly, before customers complainPlenty of existing alarmsSNMP trapsActive probesAnomaly detection systemsUnfortunately, this set of alarms does not help operators locate and eliminate problems that induce problems on end-to-end paths
3 Network Tomography to the Rescue MonitorTargetsyxSend end-to-end probes through the networkMonitor paths for differences in reachabilityInfer location of reachability problem from these differences
4 Some Problems Scalability vs. speed: Detection must be fast Ambiguity: Losses are one-way but don’t always have access to both ends of the pathLack of synchronization: Different monitors see different conditionsDynamics: Topology can change, loss can be transient
5 Doppler: Making Tomography Practical Fast, scalable detectionSolution: Monitor selection algorithm to reduce the number of monitors and targets so that “cycle times” are fastTransient packet lossSolution: Triggered confirmation of failed pathsOne-way lossesSolution: New algorithm based on IP spoofingDynamic routingSolution: Periodic snapshots of the network topologyControlled evaluation on VINI, plus limited wide-area experiments.
6 Fast, Scalable Detection Select monitors, targets to satisfy two conditionsAll interfaces are “covered” (or diagnosable)The number of monitors is small enough to ensure a short round timeTwo goalsCoverage: When a failure occurs, system detects itEvery interface is covered by at least one pathDiagnosability: When a failure occurs, system locates itEvery interface is covered by a unique set of paths
7 Offline Path Selection: Diagnosability Step 1: Compute the set of paths that cover all interfaces (greedy set cover heuristic)Step 2: Compute hitting set for each interfaceStep 3: Build equivalence classes for interfaces with common hitting setFor each interface in a set with more than one interface, find path that crosses only that interface
8 Detection, Confirmation, Correlation Periodic (once per 5 minutes) topology snapshot from all monitors to all destinations keeps track of underlying topology before the failureDetection: Periodic probes (once per “cycle time”) detect failureConfirmation: When a probe is lost, the monitor sends three additional probes. If all three are lost, path is determined to have failed.Correlation: Paths that fail within 10 seconds of one another are grouped.
9 Disambiguating One-Way Losses: Spoofing Monitor sends request to spoofer to send probeProbe has IP address of the monitorIf reply reaches the monitor, reverse path is workingTMSpoofer: Send spoofed packet with source address of M
10 Identification: NetDiagnoser Binary network tomography algorithm [Dhamdhere et al.]Input: hosts, destinations, topology before the failureOutput: Set of possible locations for the fault
11 Evaluation of Detection Algorithms Controlled experiments on the VINI testbedEmulated copy of Abilene network on wide-area pathsProbing strategy emulates the paths that would be probed in monitor selection algorithmCompare reduced set of paths to “aggressive” measurement approachVaried failure location and durationDuration varied from 5 to 80 secondsTest repeated for each failed linkMeasure detection and false alarm ratesPreliminary experiments using data from real-world networks
12 Detection: Scale and Speed Compute reduction in the number of paths required to achieve coverage and diagnosabilityReduction from about 27,000 paths to 151 pathsFor real-world networks, compute corresponding reduction in cycle timeReduction from aout 3.5 minutes to < 5 seconds
13 Single-Link FailuresMore selective probing identifies more of the shorter link failures (due to shorter cycle time)Also results in fewer false alarms
14 Single-Node Failures Similar results to single-link failures Selective measurements result in faster detection, fewer false alarms
15 Does Failure Confirmation Reduce the Total Number of Alarms? Confirmation reduces the number of failures by > 35%Correlation further reduces the number of alarms (by about a factor of 10)
16 How Quickly can Doppler Identify Failures? Answer: Roughly 20 seconds using the reduced set of pathsTwo main componentsDetection/Confirmation: Time from when failure was injected to the time Doppler could detect and confirm the failureCorrelation: Time to group failures and construct reachability matrix
17 Detection and Confirmation Delay Most failures are detected within 3-5 seconds
18 Correlation DelayReducing the number of paths to probe significantly reduces total correlation time
19 Summary Making tomography practical is challenging Asynchronous measurementsScale and speedChanging topologiesAmbiguity about forward and reverse pathsDoppler: Set of techniques to address many of these problemsCurrent analysis is still performed offlineMany additional challenges remain to coordinate online measurements
Your consent to our cookies if you continue to use this website.