Presentation on theme: "Internet monitoring is essential"— Presentation transcript:
0Making Network Tomography Practical Renata TeixeiraLaboratoire LIP6CNRS and UPMC Paris Universitas
1Internet monitoring is essential For network operatorsMonitor service-level agreementsTroubleshoot failuresDiagnose anomalous behaviorFor users or content/application providersVerify network performance
2Challenge 1: Nobody controls end-to-end path AS3AS2AS4AS1Network operators only have data of one ASEnd-hosts can only monitor end-to-end paths
3Challenge 2: Available data not direct Network operatorsUsers, applicationsIs my network performance good?Only have per-link counts or active probesIs there a problem? Where?There may be no alarmIs my provider’s performance good?Only have end-to-end delay and loss
4Network tomography to rescue Inference of unknown network properties from measurable onesSophisticated inference algorithmsGiven a model and available measurementsApply statistical inference to estimate propertiesMaximum likelihood estimator, Bayesian inferenceUnfortunately, limited practical deploymentMeasuring the required inputs is difficult
5Monitoring techniques to make network tomography practical This tutorialMonitoring techniques to make network tomography practical
7Network tomography problems Estimation of a network’s traffic matrixGiven total traffic in network linksWhat is the traffic between a network’s entry and exit points?Inference of link performanceGiven end-to-end probesWhat is the loss rate or delay of a link?Inference of network topologyGiven end-to-end loss measurementsWhat is the logical network topology?
8Inference of link performance What are the properties of network links?Loss rateDelayBandwidthConnectivityGiven end-to-end measurementsNo access to routersFDAS 2EAS 1CAB
9Multicast-based Inference of Network-internal Characteristics MeasurementsMulticast probesTraces collected at receiversInferenceExploit correlation in traces to estimate link propertiesIntroduced by MINC projectprobesenderprobecollectors
10Inferring link loss rates AssumptionsKnown, logical-tree topologyLosses are independentMulticast probesMethodologyMaximum likelihood estimates for αkmsuccessprobabilitiesα1α2α3t1t211111estimatedsuccessprobabilitiesα1^α2^α3^
11Binary tomography Labels links as good or bad m Loss rate estimation requires tight correlationInstead, separate good/bad performanceIf link is bad, all paths that cross the link are badmα1α2α3t1t21111badgood
12Single-source tree “Smallest Consistent Failure Set” algorithm m Assumes a single-source tree and known topologyFind the smallest set of links that explains bad pathsGiven bad links are uncommonBad link is the root of maximal bad subtreembadt1t21111badgood
13Binary tomography with multiple sources and targets Problem becomes NP-hardMinimum hitting set problemHitting set of a link = paths that traverse the linkIterative greedy heuristicGiven the set of links in bad pathsIteratively choose link that explains the max number of bad pathsPromising for fault identificationm1m2t1t2
14Practical issues Topology is often unknown Need to measure accurate topologyLimited deployment of multicastNeed to extract correlation from unicast probesEven using probes from different monitorsControl of targets is not always practicalNeed one-way performance from round-trip probesLinks can fail for some paths, but not allNeed to extend tomography algorithms
18Detection techniques Active probing: ping Send probe and collect responseNo control of targetsPassive analysis of user’s traffictcpdump: tap all incoming and outgoing packetsMonitoring of TCP connections
19Detection with ping If receives reply If no reply before timeout probe ICMPecho requesttIf receives replyThen, path is goodIf no reply before timeoutThen, path is badreplyICMPecho replym
20Persistent failure or measurement noise? Many reasons to lose probe or replyTimeout may be too shortRate limiting at routersSome end-hosts don’t respond to ICMP requestTransient congestionRouting changeNeed to confirm that failure is persistentOtherwise, may trigger false alarms
21Failure confirmation Upon detection of a failure, trigger extra probes Goal: minimize detection errorsSending more probesWaiting longer between probesTradeoff: detection error and detection timeloss burstpackets ona pathtimeDetection error
22Passive detection tcpdump captures all packets Track status of each TCP connectionRTTs, timeouts, retransmissionsMultiple timeouts indicate path is badIf current seq. number > last seq. number seenPath is goodIf current seq. number = last seq. number seenTimeout has occurredAfter four timeouts, declare path as bad
23Passive vs. active detection No need to inject trafficDetects all failures that affect user’s trafficResponses from targets that don’t respond to pingNo need to tap user’s trafficDetects failures in any desired pathNot always possible to tap user’s trafficOnly detects failures in paths with trafficProbing overheadCover a large number of pathsDetect failures fast
24Active monitoring: reducing probing overhead targetnetworktarget hostsM1CADT3BGoaldetect failures of any of theinterfaces in the target networkwith minimum probing overheadmonitorsM2
25Simple solution: Coverage problem DT3BInstead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s networkCoverage problem is NP-hardSolution: greedy set-cover heuristicM2
26Coverage solution doesn’t detect all types of failures Detects fail-stop failuresFailures that affect all packets that traverse the faulty interfaceEg., interface or router crashes, fiber cuts, bugsBut not path-specific failuresFailures that affect only a subset of paths that cross the faulty interfaceEg., router misconfigurations
27New formulation of failure detection problem Select the frequency to probe each pathLower frequency per-path probing can achieve a high frequency probing of each interfaceT1T21 every 9 minsM1CADT31 every 3 minsBM2
28Is failure in forward or reverse path? probePaths can be asymmetricLoad balancingHot-potato routingreplym
29Disambiguating one-way losses: Spoofing Monitor requests to spoofer to send probeProbe has IP address of the monitorIf reply reaches the monitor, reverse path is goodSpoofermSpoofer: Send spoofed packet with source address of m
30Summary: Fault detection Techniques to measure path reachabilityActive probing: ping + failure confirmationPassive analysis of TCP connectionsReducing overhead of active monitoringSelect the set of paths to probeTrade-off: set of paths and probing frequencyNo control of targetsOnly have round-trip measurementsSpoofing differentiates forward/reverse failures
32Uncorrelated measurements lead to errors Lack of synchronization leads to inconsistenciesProbes cross links at different timesPath may change between probesmt1t2mistakenlyinferred failure
33Sources of inconsistencies In measurements from a single monitorProbing all targets can take timeIn measurements from multiple monitorsHard to synchronize monitors for all probes to reach a link at the same timeImpossible to generalize to all links
34Inconsistent measurements with multiple monitors mK…path reachabilitym1m1,t1goodmK,t1good…………m1, tNgoodmK, tNbadinconsistent measurements…tNt1
35Solution: Reprobe paths after failure mK…path reachabilitym1m1,t1goodmK,t1good…………m1, tNbadmK, tNbad…tNConsistency has a costDelays fault identificationCannot identify short failurest1
36Summary: Correlated measurements Correlation is essential to tomographyLack of correlation leads to false alarmsCorrelation is hard with unicast probesProbing multiple targets takes timeMultiple monitors cannot probe a link simultaneouslySolution: probe paths again after fault detectionTrade-off: consistency vs. detection speed
38Measuring router topology With access to routers (or “from inside”)Topology of one networkRouting monitors (OSPF or IS-IS)No access to routers (or “from outside”)Multi-AS topology or from end-hostsMonitors issue active probes: traceroute
39Topology from inside Routing protocols flood state of each link Periodically refresh link stateReport any changes: link down, up, cost changeMonitor listens to link-state messagesActs as a regular routerAT&T’s OSPFmon or Sprint’s PyRT for IS-ISCombining link states gives the topologyEasy to maintain, messages report any changes
40Inferring a path from outside: traceroute Actual pathTTL exceeded from B.1TTL exceeded from A.1A.1A.2B.1B.2mABtTTL = 1TTL = 2Inferred pathA.1B.1mt
41A traceroute path can be incomplete Load balancing is widely usedTraceroute only probes one pathSometimes taceroute has no answer (stars)ICMP rate limitingAnonymous routersTunnelling (e.g., MPLS) may hide routersRouters inside the tunnel may not decrement TTL
42Traceroute under load balancing Actual pathACTTL = 2ELtmBDTTL = 3Missing nodesand linksInferred pathACFalse linkELmtBD
43Errors happen even under per-flow load balancing TTL = 2Port 2ELtmBDTTL = 3Port 3Traceroute uses the destination port as identifierPer-flow load balancers use the destination port as part of the flow identifier
44Paris traceroute Solves the problem with per-flow load balancing Probes to a destination belong to same flowChanges the location of the probe identifierUse the UDP checksumACTTL = 3Port 1TTL = 2Port 1ELtmChecksum 3Checksum 2BD
45Topology from traceroutes Actual topologyInferred topologyDt1321211D.1t1AC4C.1m1A.1m132C.2t21B23t2B.3m2m2Inferred nodes = interfaces, not routersCoverage depends on monitors and targetsMisses links and routersSome links and routers appear multiple times
46Alias resolution: Map interfaces to routers Direct probingProbe an interface, may receive response from anotherResponses from the same router will have close IP identifiers and same TTLRecord-route IP optionRecords up to nine IP addresses of routers in the pathInferred topologyD.1t1A.1C.1m1C.2t2B.3m2same router
47Large-scale topology measurements Probing a large topology takes timeE.g., probing 1200 targets from PlanetLab nodes takes 5 minutes on average (using 30 threads)Probing more targets covers more linksBut, getting a topology snapshot takes longerSnapshot may be inaccuratePaths may change during snapshotHard to get up-to-date topologyTo know that a path changed, need to re-probe
48Faster topology snapshots Probing redundancyIntra-monitorInter-monitorDoubletreeCombines backward and forward probing to eliminate redundancyDt1m1ACt2Bm2
49Summary of techniques to measure topology Routing messagesComplete and accurateBut, need access to routersCombining traceroutesAnyone can use it, no privileged access to routersBut, false or missing links and nodesTopologies for tomography: some uncertaintiesMultiple topologies close to the time of an eventMultiple paths between a monitor and a target
51Open issues Fault detection Fault identification How to detect faults or performance degradations that impact end-users?What is the overhead and speed of large-scale deployments?Will spoofing work in a large-scale deployments?Fault identificationHow to keep the topology up-to-date for fast identification?Do we need new tomography techniques to cope with partial failures?Could inference be easier with cooperation from routers?
53Network tomography theory Survey on network tomographyR. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004),Traffic matrix estimationY. Vardi, “Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.Inference of link performance/connectivityMINC project:A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.
54Binary tomography Single-source tree algorithm N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.Applying tomography in one networkR. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.Applying tomography in multiple network topologyA. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.
55Topology from inside IS-IS monitoring OSPF monitoring R. Mortier, “Python Routeing Toolkit (`PyRT')”, https://research.sprintlabs.com/pyrt/OSPF monitoringA. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience”, NSDI 2004Commercial productsPacket Design:
56Topology with traceroute Tracing accurate paths under load-balancingB. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.Reducing overhead to trace topology of a network and alias resolution with direct probingN. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.Use of record route to obtain more accurate topologiesR. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet Cartographer”, SIGCOMM, 2008.Reducing overhead to trace a multi-network topologyB. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.
57Reducing overhead of active fault detection Selection of paths to probeH. Nguyen and P. Thiran, “Active measurement for multiple link failures diagnosis in IP networks”, PAM, 2004.Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003.Selection of the frequency to probe pathsH. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.
58Internet-wide fault detection systems Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faultsE. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faultsM. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.