Presentation on theme: "Internet monitoring is essential"— Presentation transcript:
0 Making Network Tomography Practical Renata TeixeiraLaboratoire LIP6CNRS and UPMC Paris Universitas
1 Internet monitoring is essential For network operatorsMonitor service-level agreementsTroubleshoot failuresDiagnose anomalous behaviorFor users or content/application providersVerify network performance
2 Challenge 1: Nobody controls end-to-end path AS3AS2AS4AS1Network operators only have data of one ASEnd-hosts can only monitor end-to-end paths
3 Challenge 2: Available data not direct Network operatorsUsers, applicationsIs my network performance good?Only have per-link counts or active probesIs there a problem? Where?There may be no alarmIs my provider’s performance good?Only have end-to-end delay and loss
4 Network tomography to rescue Inference of unknown network properties from measurable onesSophisticated inference algorithmsGiven a model and available measurementsApply statistical inference to estimate propertiesMaximum likelihood estimator, Bayesian inferenceUnfortunately, limited practical deploymentMeasuring the required inputs is difficult
5 Monitoring techniques to make network tomography practical This tutorialMonitoring techniques to make network tomography practical
7 Network tomography problems Estimation of a network’s traffic matrixGiven total traffic in network linksWhat is the traffic between a network’s entry and exit points?Inference of link performanceGiven end-to-end probesWhat is the loss rate or delay of a link?Inference of network topologyGiven end-to-end loss measurementsWhat is the logical network topology?
8 Inference of link performance What are the properties of network links?Loss rateDelayBandwidthConnectivityGiven end-to-end measurementsNo access to routersFDAS 2EAS 1CAB
9 Multicast-based Inference of Network-internal Characteristics MeasurementsMulticast probesTraces collected at receiversInferenceExploit correlation in traces to estimate link propertiesIntroduced by MINC projectprobesenderprobecollectors
10 Inferring link loss rates AssumptionsKnown, logical-tree topologyLosses are independentMulticast probesMethodologyMaximum likelihood estimates for αkmsuccessprobabilitiesα1α2α3t1t211111estimatedsuccessprobabilitiesα1^α2^α3^
11 Binary tomography Labels links as good or bad m Loss rate estimation requires tight correlationInstead, separate good/bad performanceIf link is bad, all paths that cross the link are badmα1α2α3t1t21111badgood
12 Single-source tree “Smallest Consistent Failure Set” algorithm m Assumes a single-source tree and known topologyFind the smallest set of links that explains bad pathsGiven bad links are uncommonBad link is the root of maximal bad subtreembadt1t21111badgood
13 Binary tomography with multiple sources and targets Problem becomes NP-hardMinimum hitting set problemHitting set of a link = paths that traverse the linkIterative greedy heuristicGiven the set of links in bad pathsIteratively choose link that explains the max number of bad pathsPromising for fault identificationm1m2t1t2
14 Practical issues Topology is often unknown Need to measure accurate topologyLimited deployment of multicastNeed to extract correlation from unicast probesEven using probes from different monitorsControl of targets is not always practicalNeed one-way performance from round-trip probesLinks can fail for some paths, but not allNeed to extend tomography algorithms
18 Detection techniques Active probing: ping Send probe and collect responseNo control of targetsPassive analysis of user’s traffictcpdump: tap all incoming and outgoing packetsMonitoring of TCP connections
19 Detection with ping If receives reply If no reply before timeout probe ICMPecho requesttIf receives replyThen, path is goodIf no reply before timeoutThen, path is badreplyICMPecho replym
20 Persistent failure or measurement noise? Many reasons to lose probe or replyTimeout may be too shortRate limiting at routersSome end-hosts don’t respond to ICMP requestTransient congestionRouting changeNeed to confirm that failure is persistentOtherwise, may trigger false alarms
21 Failure confirmation Upon detection of a failure, trigger extra probes Goal: minimize detection errorsSending more probesWaiting longer between probesTradeoff: detection error and detection timeloss burstpackets ona pathtimeDetection error
22 Passive detection tcpdump captures all packets Track status of each TCP connectionRTTs, timeouts, retransmissionsMultiple timeouts indicate path is badIf current seq. number > last seq. number seenPath is goodIf current seq. number = last seq. number seenTimeout has occurredAfter four timeouts, declare path as bad
23 Passive vs. active detection No need to inject trafficDetects all failures that affect user’s trafficResponses from targets that don’t respond to pingNo need to tap user’s trafficDetects failures in any desired pathNot always possible to tap user’s trafficOnly detects failures in paths with trafficProbing overheadCover a large number of pathsDetect failures fast
24 Active monitoring: reducing probing overhead targetnetworktarget hostsM1CADT3BGoaldetect failures of any of theinterfaces in the target networkwith minimum probing overheadmonitorsM2
25 Simple solution: Coverage problem DT3BInstead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s networkCoverage problem is NP-hardSolution: greedy set-cover heuristicM2
26 Coverage solution doesn’t detect all types of failures Detects fail-stop failuresFailures that affect all packets that traverse the faulty interfaceEg., interface or router crashes, fiber cuts, bugsBut not path-specific failuresFailures that affect only a subset of paths that cross the faulty interfaceEg., router misconfigurations
27 New formulation of failure detection problem Select the frequency to probe each pathLower frequency per-path probing can achieve a high frequency probing of each interfaceT1T21 every 9 minsM1CADT31 every 3 minsBM2
28 Is failure in forward or reverse path? probePaths can be asymmetricLoad balancingHot-potato routingreplym
29 Disambiguating one-way losses: Spoofing Monitor requests to spoofer to send probeProbe has IP address of the monitorIf reply reaches the monitor, reverse path is goodSpoofermSpoofer: Send spoofed packet with source address of m
30 Summary: Fault detection Techniques to measure path reachabilityActive probing: ping + failure confirmationPassive analysis of TCP connectionsReducing overhead of active monitoringSelect the set of paths to probeTrade-off: set of paths and probing frequencyNo control of targetsOnly have round-trip measurementsSpoofing differentiates forward/reverse failures
32 Uncorrelated measurements lead to errors Lack of synchronization leads to inconsistenciesProbes cross links at different timesPath may change between probesmt1t2mistakenlyinferred failure
33 Sources of inconsistencies In measurements from a single monitorProbing all targets can take timeIn measurements from multiple monitorsHard to synchronize monitors for all probes to reach a link at the same timeImpossible to generalize to all links
35 Solution: Reprobe paths after failure mK…path reachabilitym1m1,t1goodmK,t1good…………m1, tNbadmK, tNbad…tNConsistency has a costDelays fault identificationCannot identify short failurest1
36 Summary: Correlated measurements Correlation is essential to tomographyLack of correlation leads to false alarmsCorrelation is hard with unicast probesProbing multiple targets takes timeMultiple monitors cannot probe a link simultaneouslySolution: probe paths again after fault detectionTrade-off: consistency vs. detection speed
38 Measuring router topology With access to routers (or “from inside”)Topology of one networkRouting monitors (OSPF or IS-IS)No access to routers (or “from outside”)Multi-AS topology or from end-hostsMonitors issue active probes: traceroute
39 Topology from inside Routing protocols flood state of each link Periodically refresh link stateReport any changes: link down, up, cost changeMonitor listens to link-state messagesActs as a regular routerAT&T’s OSPFmon or Sprint’s PyRT for IS-ISCombining link states gives the topologyEasy to maintain, messages report any changes
40 Inferring a path from outside: traceroute Actual pathTTL exceeded from B.1TTL exceeded from A.1A.1A.2B.1B.2mABtTTL = 1TTL = 2Inferred pathA.1B.1mt
41 A traceroute path can be incomplete Load balancing is widely usedTraceroute only probes one pathSometimes taceroute has no answer (stars)ICMP rate limitingAnonymous routersTunnelling (e.g., MPLS) may hide routersRouters inside the tunnel may not decrement TTL
42 Traceroute under load balancing Actual pathACTTL = 2ELtmBDTTL = 3Missing nodesand linksInferred pathACFalse linkELmtBD
43 Errors happen even under per-flow load balancing TTL = 2Port 2ELtmBDTTL = 3Port 3Traceroute uses the destination port as identifierPer-flow load balancers use the destination port as part of the flow identifier
44 Paris traceroute Solves the problem with per-flow load balancing Probes to a destination belong to same flowChanges the location of the probe identifierUse the UDP checksumACTTL = 3Port 1TTL = 2Port 1ELtmChecksum 3Checksum 2BD
45 Topology from traceroutes Actual topologyInferred topologyDt1321211D.1t1AC4C.1m1A.1m132C.2t21B23t2B.3m2m2Inferred nodes = interfaces, not routersCoverage depends on monitors and targetsMisses links and routersSome links and routers appear multiple times
46 Alias resolution: Map interfaces to routers Direct probingProbe an interface, may receive response from anotherResponses from the same router will have close IP identifiers and same TTLRecord-route IP optionRecords up to nine IP addresses of routers in the pathInferred topologyD.1t1A.1C.1m1C.2t2B.3m2same router
47 Large-scale topology measurements Probing a large topology takes timeE.g., probing 1200 targets from PlanetLab nodes takes 5 minutes on average (using 30 threads)Probing more targets covers more linksBut, getting a topology snapshot takes longerSnapshot may be inaccuratePaths may change during snapshotHard to get up-to-date topologyTo know that a path changed, need to re-probe
48 Faster topology snapshots Probing redundancyIntra-monitorInter-monitorDoubletreeCombines backward and forward probing to eliminate redundancyDt1m1ACt2Bm2
49 Summary of techniques to measure topology Routing messagesComplete and accurateBut, need access to routersCombining traceroutesAnyone can use it, no privileged access to routersBut, false or missing links and nodesTopologies for tomography: some uncertaintiesMultiple topologies close to the time of an eventMultiple paths between a monitor and a target
51 Open issues Fault detection Fault identification How to detect faults or performance degradations that impact end-users?What is the overhead and speed of large-scale deployments?Will spoofing work in a large-scale deployments?Fault identificationHow to keep the topology up-to-date for fast identification?Do we need new tomography techniques to cope with partial failures?Could inference be easier with cooperation from routers?
53 Network tomography theory Survey on network tomographyR. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004),Traffic matrix estimationY. Vardi, “Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.Inference of link performance/connectivityMINC project:A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.
54 Binary tomography Single-source tree algorithm N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.Applying tomography in one networkR. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.Applying tomography in multiple network topologyA. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.
55 Topology from inside IS-IS monitoring OSPF monitoring R. Mortier, “Python Routeing Toolkit (`PyRT')”,OSPF monitoringA. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience”, NSDI 2004Commercial productsPacket Design:
56 Topology with traceroute Tracing accurate paths under load-balancingB. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.Reducing overhead to trace topology of a network and alias resolution with direct probingN. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.Use of record route to obtain more accurate topologiesR. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet Cartographer”, SIGCOMM, 2008.Reducing overhead to trace a multi-network topologyB. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.
57 Reducing overhead of active fault detection Selection of paths to probeH. Nguyen and P. Thiran, “Active measurement for multiple link failures diagnosis in IP networks”, PAM, 2004.Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003.Selection of the frequency to probe pathsH. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.
58 Internet-wide fault detection systems Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faultsE. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faultsM. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.