Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Similar presentations

Presentation on theme: "Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas."— Presentation transcript:

1 Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas

2 Internet monitoring is essential For network operators – Monitor service-level agreements – Troubleshoot failures – Diagnose anomalous behavior For users or content/application providers – Verify network performance 1

3 Challenge 1: Nobody controls end-to-end path Network operators only have data of one AS End-hosts can only monitor end-to-end paths 2 AS1 AS2 AS3 AS4

4 Challenge 2: Available data not direct Network operators Is my network performance good? – Only have per-link counts or active probes Is there a problem? Where? – There may be no alarm Users, applications Is my providers performance good? – Only have end-to-end delay and loss 3

5 Network tomography to rescue Inference of unknown network properties from measurable ones Sophisticated inference algorithms – Given a model and available measurements – Apply statistical inference to estimate properties Maximum likelihood estimator, Bayesian inference Unfortunately, limited practical deployment – Measuring the required inputs is difficult 4

6 This tutorial Monitoring techniques to make network tomography practical 5

7 Outline Examples of network tomography problems Case study: fault diagnosis – Fault detection: continuous path monitoring – Fault identification: binary tomography Correlated path reachability Topology measurements Open issues 6

8 7 Network tomography problems Estimation of a networks traffic matrix – Given total traffic in network links – What is the traffic between a networks entry and exit points? Inference of link performance – Given end-to-end probes – What is the loss rate or delay of a link? Inference of network topology – Given end-to-end loss measurements – What is the logical network topology?

9 Inference of link performance What are the properties of network links? – Loss rate – Delay – Bandwidth – Connectivity Given end-to-end measurements – No access to routers 8 D F E A C B AS 2 AS 1

10 Multicast-based Inference of Network-internal Characteristics Measurements – Multicast probes – Traces collected at receivers Inference – Exploit correlation in traces to estimate link properties Introduced by MINC project 9 probe sender probe collectors

11 Inferring link loss rates Assumptions – Known, logical-tree topology – Losses are independent – Multicast probes Methodology – Maximum likelihood estimates for αk α1α1 α2α2 α3α3 α1α1 ^ α2α2 ^ α3α3 ^ m t1 t2 success probabilities estimated success probabilities

12 Binary tomography Labels links as good or bad – Loss rate estimation requires tight correlation – Instead, separate good/bad performance – If link is bad, all paths that cross the link are bad α1α1 α2α2 α3α3 m t1 t2 goodbad

13 Single-source tree Smallest Consistent Failure Set algorithm – Assumes a single-source tree and known topology – Find the smallest set of links that explains bad paths Given bad links are uncommon Bad link is the root of maximal bad subtree 12 m t1 t2 bad goodbad

14 Binary tomography with multiple sources and targets Problem becomes NP-hard – Minimum hitting set problem Hitting set of a link = paths that traverse the link Iterative greedy heuristic – Given the set of links in bad paths – Iteratively choose link that explains the max number of bad paths Promising for fault identification 13 m2 t1t2 m1

15 Practical issues Topology is often unknown – Need to measure accurate topology Limited deployment of multicast – Need to extract correlation from unicast probes – Even using probes from different monitors Control of targets is not always practical – Need one-way performance from round-trip probes Links can fail for some paths, but not all – Need to extend tomography algorithms 14

16 Outline Examples of network tomography problems Case study: fault diagnosis – Fault detection: continuous path monitoring – Fault identification: binary tomography Correlated path reachability Topology measurements Open issues 15

17 16 Steps of fault diagnosis AS1 AS2 AS3 AS4 Detection: continuous path monitoring Identification: binary tomography


19 Detection techniques Active probing: ping – Send probe and collect response – No control of targets Passive analysis of users traffic – tcpdump: tap all incoming and outgoing packets – Monitoring of TCP connections 18

20 Detection with ping If receives reply – Then, path is good If no reply before timeout – Then, path is bad 19 m t probe ICMP echo request reply ICMP echo reply

21 Persistent failure or measurement noise? Many reasons to lose probe or reply – Timeout may be too short – Rate limiting at routers – Some end-hosts dont respond to ICMP request – Transient congestion – Routing change Need to confirm that failure is persistent – Otherwise, may trigger false alarms 20

22 Upon detection of a failure, trigger extra probes Goal: minimize detection errors – Sending more probes – Waiting longer between probes Tradeoff: detection error and detection time 21 Failure confirmation time loss burst packets on a path Detection error

23 Passive detection tcpdump captures all packets Track status of each TCP connection – RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad 22 – If current seq. number > last seq. number seen Path is good – If current seq. number = last seq. number seen Timeout has occurred After four timeouts, declare path as bad

24 Passive vs. active detection Passive + No need to inject traffic + Detects all failures that affect users traffic + Responses from targets that dont respond to ping Active + No need to tap users traffic + Detects failures in any desired path 23 Not always possible to tap users traffic Only detects failures in paths with traffic Probing overhead – Cover a large number of paths – Detect failures fast

25 24 Active monitoring: reducing probing overhead M1 M2 T3 T1 T2 A C B D target hosts monitors Goal detect failures of any of the interfaces in the target network with minimum probing overhead target network

26 25 Simple solution: Coverage problem M1 M2 T3 T1 T2 A C B D Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscribers network Coverage problem is NP-hard – Solution: greedy set-cover heuristic

27 26 Coverage solution doesnt detect all types of failures Detects fail-stop failures – Failures that affect all packets that traverse the faulty interface Eg., interface or router crashes, fiber cuts, bugs But not path-specific failures – Failures that affect only a subset of paths that cross the faulty interface Eg., router misconfigurations

28 27 New formulation of failure detection problem Select the frequency to probe each path – Lower frequency per-path probing can achieve a high frequency probing of each interface M1 M2 T3 T1 T2 A C B D 1 every 9 mins 1 every 3 mins

29 Is failure in forward or reverse path? Paths can be asymmetric – Load balancing – Hot-potato routing 28 m t probe reply

30 Disambiguating one-way losses: Spoofing Monitor requests to spoofer to send probe Probe has IP address of the monitor If reply reaches the monitor, reverse path is good 29 m t Spoofer: Send spoofed packet with source address of m Spoofer

31 Summary: Fault detection Techniques to measure path reachability – Active probing: ping + failure confirmation – Passive analysis of TCP connections Reducing overhead of active monitoring – Select the set of paths to probe – Trade-off: set of paths and probing frequency No control of targets – Only have round-trip measurements – Spoofing differentiates forward/reverse failures 30


33 Uncorrelated measurements lead to errors Lack of synchronization leads to inconsistencies – Probes cross links at different times – Path may change between probes 32 m t1t2 mistakenly inferred failure

34 33 Sources of inconsistencies In measurements from a single monitor – Probing all targets can take time In measurements from multiple monitors – Hard to synchronize monitors for all probes to reach a link at the same time – Impossible to generalize to all links

35 Inconsistent measurements with multiple monitors 34 m1 t1 tN mK … … mK,t1 mK, tN … m1,t1 m1, tN … path reachability good … bad … inconsistent measurements

36 Solution: Reprobe paths after failure 35 Consistency has a cost – Delays fault identification – Cannot identify short failures m1 t1 tN mK … … mK,t1 mK, tN … m1,t1 m1, tN … path reachability good bad … good bad …

37 Summary: Correlated measurements Correlation is essential to tomography – Lack of correlation leads to false alarms Correlation is hard with unicast probes – Probing multiple targets takes time – Multiple monitors cannot probe a link simultaneously Solution: probe paths again after fault detection – Trade-off: consistency vs. detection speed 36


39 Measuring router topology With access to routers (or from inside) – Topology of one network – Routing monitors (OSPF or IS-IS) No access to routers (or from outside) – Multi-AS topology or from end-hosts – Monitors issue active probes: traceroute 38

40 39 Topology from inside Routing protocols flood state of each link – Periodically refresh link state – Report any changes: link down, up, cost change Monitor listens to link-state messages – Acts as a regular router AT&Ts OSPFmon or Sprints PyRT for IS-IS Combining link states gives the topology – Easy to maintain, messages report any changes

41 Inferring a path from outside: traceroute 40 AB TTL = 1 A.1A.2B.2B.1 TTL = 2 TTL exceeded from A.1 TTL exceeded from B.1 Actual path Inferred path A.1B.1 m t m t

42 A traceroute path can be incomplete Load balancing is widely used – Traceroute only probes one path Sometimes taceroute has no answer (stars) – ICMP rate limiting – Anonymous routers Tunnelling (e.g., MPLS) may hide routers – Routers inside the tunnel may not decrement TTL 41

43 42 Traceroute under load balancing L B AC D L A D C TTL = 2 TTL = 3 B E E Missing nodes and links False link Actual path Inferred path m m t t

44 43 Errors happen even under per-flow load balancing L B AC D TTL = 2 Port 2 TTL = 3 Port 3 E Traceroute uses the destination port as identifier Per-flow load balancers use the destination port as part of the flow identifier Flow 1 m t

45 44 Paris traceroute Solves the problem with per-flow load balancing – Probes to a destination belong to same flow Changes the location of the probe identifier – Use the UDP checksum L B AC D TTL = 2 Port 1 TTL = 3 Port 1 E Checksum 3 Checksum 2 m t

46 Topology from traceroutes Inferred nodes = interfaces, not routers Coverage depends on monitors and targets – Misses links and routers – Some links and routers appear multiple times 45 1 A D 3 B m1 t1 m2 t2 C Actual topology A.1 m1 t1 m2 t2 Inferred topology C.1 D.1 C.2 B.3 2

47 Alias resolution: Map interfaces to routers Direct probing – Probe an interface, may receive response from another – Responses from the same router will have close IP identifiers and same TTL Record-route IP option – Records up to nine IP addresses of routers in the path 46 A.1 m1 t1 m2 t2 Inferred topology C.1 D.1 C.2 B.3 same router

48 Large-scale topology measurements Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes takes 5 minutes on average (using 30 threads) – Probing more targets covers more links – But, getting a topology snapshot takes longer Snapshot may be inaccurate – Paths may change during snapshot Hard to get up-to-date topology – To know that a path changed, need to re-probe 47

49 Faster topology snapshots Probing redundancy – Intra-monitor – Inter-monitor Doubletree – Combines backward and forward probing to eliminate redundancy 48 A D B m1 t1 m2 t2 C

50 Summary of techniques to measure topology Routing messages – Complete and accurate – But, need access to routers Combining traceroutes – Anyone can use it, no privileged access to routers – But, false or missing links and nodes Topologies for tomography: some uncertainties – Multiple topologies close to the time of an event – Multiple paths between a monitor and a target 49

51 Outline Examples of network tomography problems Case study: fault diagnosis – Fault detection: continuous path monitoring – Fault identification: binary tomography Correlated path reachability Topology measurements Open issues 50

52 Open issues Fault detection – How to detect faults or performance degradations that impact end-users? – What is the overhead and speed of large-scale deployments? – Will spoofing work in a large-scale deployments? Fault identification – How to keep the topology up-to-date for fast identification? – Do we need new tomography techniques to cope with partial failures? – Could inference be easier with cooperation from routers? 51


54 Network tomography theory Survey on network tomography – R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, Network Tomography: Recent Developments, Statistical Science, Vol. 19, No. 3 (2004), Traffic matrix estimation – Y. Vardi, Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data, Journal of the American Statistical Association, Vol. 91, Inference of link performance/connectivity – MINC project: – A. Adams et al., The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior, IEEE Communications Magazine, May

55 Binary tomography Single-source tree algorithm – N. Duffield, Network Tomography of Binary Network Performance Characteristics, IEEE Transactions on Information Theory, Applying tomography in one network – R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, Detection and Localization of Network Blackholes, IEEE INFOCOM, Applying tomography in multiple network topology – A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data, CoNEXT,

56 Topology from inside IS-IS monitoring – R. Mortier, Python Routeing Toolkit (`PyRT'), OSPF monitoring – A. Shaikh and A. Greenberg, OSPF Monitoring: Architecture, Design and Deployment Experience, NSDI 2004 Commercial products – Packet Design: 55

57 Topology with traceroute Tracing accurate paths under load-balancing – B. Augustin et al., Avoiding traceroute anomalies with Paris traceroute, IMC, Reducing overhead to trace topology of a network and alias resolution with direct probing – N. Spring, R. Mahajan, and D. Wetherall, Measuring ISP Topologies with Rocketfuel, SIGCOMM Use of record route to obtain more accurate topologies – R. Sherwood, A. Bender, N. Spring, DisCarte: A Disjunctive Internet Cartographer, SIGCOMM, Reducing overhead to trace a multi-network topology – B. Donnet, P. Raoult, T. Friedman, and M. Crovella, Efficient Algorithms for Large-Scale Topology Discovery, SIGMETRICS,

58 Reducing overhead of active fault detection Selection of paths to probe – H. Nguyen and P. Thiran, Active measurement for multiple link failures diagnosis in IP networks, PAM, – Yigal Bejerano and Rajeev Rastogi, Robust monitoring of link delays and faults in IP networks, INFOCOM, Selection of the frequency to probe paths – H. X. Nguyen, R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM,

59 Internet-wide fault detection systems Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults – E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, Studying Black Holes in the Internet with Hubble, NSDI, Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults – M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services, OSDI,

Download ppt "Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas."

Similar presentations

Ads by Google