Presentation is loading. Please wait.

Presentation is loading. Please wait.

Internet Routing (COS 598A) Today: Root-Cause Analysis Jennifer Rexford Tuesdays/Thursdays 11:00am-12:20pm.

Similar presentations


Presentation on theme: "Internet Routing (COS 598A) Today: Root-Cause Analysis Jennifer Rexford Tuesdays/Thursdays 11:00am-12:20pm."— Presentation transcript:

1 Internet Routing (COS 598A) Today: Root-Cause Analysis Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm

2 Outline Network troubleshooting –Motivation for network troubleshooting –Investigating from the edge vs. inside Active probing –Traceroute –Mapping IP addresses to AS numbers Passive monitoring –Analyzing BGP update streams –Identifying location and cause of routing change –Limitations of the approach

3 Network Troubleshooting www.cnn.com “Why can’t I reach www.cnn.com?” “Why is the performance bad?” Internet

4 Reachability Problems: What Could be Wrong? End-host problem –Web server down –DNS server down, or misconfigured Forwarding-path problem –Packet filter or firewall restricting access –Mismatch in Maximum Transmission Unit (MTU) Routing problem –User or server disconnected from Internet –Blackhole dropping all packets –Persistent loop

5 Performance Problem: What Could be Wrong? End-host problems –Overloaded Web server –Overloaded DNS server –Overloaded user machine Forwarding-path problem –High round-trip time –Link congestion Routing problem –Long-term routing instability –Transient disruption during convergence

6 Motivation for Troubleshooting Improving performance –Detect, diagnose, and fix the problem –Pick a path through another provider –Pick a different path in any overlay network Establishing accountability –Enforce Service Level Agreements –Rate service providers Characterizing the Internet –Understand causes of performance problems –Understand challenges of troubleshooting

7 Troubleshooting Outside vs. Inside Outside: from network edge –Who: users and researchers, and operators troubleshooting problems outside their network –Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms –Challenges: inference from very limited data Inside: from inside the network –Who: operators running a network –Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files –Challenges: collecting and joining the data Today

8 Active Probing

9 Pros and Cons of Active Probing Advantages –Can run from any end system –Measure the actual forwarding path See black-holes, loops, and delays directly Disadvantages –Effects of routing changes, not the cause –Current path, not the path used in the past Requires frequent probes to observe the changes –Shows only properties of round-trip path Hard to tell if problem is on forward vs. reverse

10 Traceroute: Measuring the Forwarding Path Time-To-Live field in IP packet header –Source sends a packet with a TTL of n –Each router along the path decrements the TTL –“TTL exceeded” sent when TTL reaches 0 Traceroute tool exploits this TTL behavior source destination TTL=1 Time exceeded TTL=2 Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

11 Example Traceroute Output (Berkeley to CNN) 1 169.229.62.1 2 169.229.59.225 3 128.32.255.169 4 128.32.0.249 5 128.32.0.66 6 209.247.159.109 7 * 8 64.159.1.46 9 209.247.9.170 10 66.185.138.33 11 * 12 66.185.136.17 13 64.236.16.52 Hop number, IP address, DNS name inr-daedalus-0.CS.Berkeley.EDU soda-cr-1-1-soda-br-6-2 vlan242.inr-202-doecev.Berkeley.EDU gigE6-0-0.inr-666-doecev.Berkeley.EDU qsv-juniper--ucb-gw.calren2.net POS1-0.hsipaccess1.SanJose1.Level3.net ? pos8-0.hsa2.Atlanta2.Level3.net pop2-atm-P0-2.atdn.net ? pop1-atl-P4-0.atdn.net www4.cnn.com No response from router No name resolution

12 Example Troubleshooting Results No packets go beyond your gateway –Gateway’s connection to Internet is dead Traceroute stops at intermediate point –Perhaps a blackhole Traceroute path has a loop –Transient or persistent forwarding loop Traceroute shows a very long path –Routing anomaly, route hijacking, etc. Traceroute shows very long delays –Delay or congestion on forward or reverse path

13 Problems with Traceroute Missing responses –Routers might not send “Time-Exceeded” –Firewalls may drop the probe packets –“Time-Exceeded” reply may be dropped Misleading responses –Probes taken while the path is changing –Name not in DNS, or DNS entry misconfigured Mapping IP addresses –Mapping interfaces to a common router –Mapping interface/router to Autonomous System

14 Map Traceroute Hops to ASes 1 169.229.62.1 2 169.229.59.225 3 128.32.255.169 4 128.32.0.249 5 128.32.0.66 6 209.247.159.109 7 * 8 64.159.1.46 9 209.247.9.170 10 66.185.138.33 11 * 12 66.185.136.17 13 64.236.16.52 Traceroute output: (hop number, IP) AS25 AS11423 AS3356 AS1668 AS5662 Berkeley CNN Calren Level3 AOL Need accurate IP-to-AS mappings (for network equipment).

15 Candidate Ways to Get IP-to-AS Mapping Routing address registry –Voluntary public registry such as whois.radb.net –Used by prtraceroute and “NANOG traceroute” –Incomplete and quite out-of-date Mergers, acquisitions, delegation to customers Origin AS in BGP paths –Public BGP routing tables such as RouteViews –Used to translate traceroute data to an AS graph –Incomplete and inaccurate… but usually right Multiple Origin ASes, no mapping, wrong mapping

16 Example: BGP Table (“show ip bgp” at RouteViews) Network Next Hop Metric LocPrf Weight Path * 3.0.0.0/8 205.215.45.50 0 4006 701 80 i * 167.142.3.6 0 5056 701 80 i * 157.22.9.7 0 715 1 701 80 i * 195.219.96.239 0 8297 6453 701 80 i * 195.211.29.254 0 5409 6667 6427 3356 701 80 i *> 12.127.0.249 0 7018 701 80 i * 213.200.87.254 929 0 3257 701 80 i * 9.184.112.0/20 205.215.45.50 0 4006 6461 3786 i * 195.66.225.254 0 5459 6461 3786 i *> 203.62.248.4 0 1221 3786 i * 167.142.3.6 0 5056 6461 6461 3786 i * 195.219.96.239 0 8297 6461 3786 i * 195.211.29.254 0 5409 6461 3786 i AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T AS 3786 is DACOM (Korea), AS 1221 is Telstra

17 Why Would IP-to-AS Mapping Be Wrong? IP addresses of equipment –Interfaces on the routers, not end hosts –Identifies equipment in routing protocols –Doesn’t need to be globally visible consistent Three reasons the mappings may be “wrong” –Addresses of Internet Exchange Points –Sibling ASes that share address space –ASes that don’t announce their addresses Look at traceroute path vs. BGP AS path –Traceroute path after IP-to-AS mapping –BGP AS path taken from the BGP table

18 Extra AS due to Internet eXchange Points IXP: shared place where providers meet –E.g., Mae-East, Mae-West, PAIX –Large number of fan-in and fan-out ASes A B C D E F G Traceroute AS pathBGP AS path Ignore extra traceroute AS hop with high fan-in and fan-out B C F G AE

19 Extra AS due to Sibling ASes Sibling: organizations with multiple ASes: –E.g., Sprint AS 1239 and AS 1791 –AS numbers equipment with addresses of another Traceroute AS pathBGP AS path A B C D E F G H A B C D E F G Merge sibling ASes “belong together” as if they were one AS.

20 Unannounced Infrastructure Addresses AB C A C B A C B C C does not announce part of its address space in BGP (e.g., 12.1.2.0/24) 12.0.0.0/8 Fix the IP-to-AS map to associate 12.1.2.0/24 with C

21 Refining Initial IP-to-AS Mapping Start with initial IP-to-AS mapping –Mapping from BGP tables is usually correct –Good starting point for computing the mapping Collect many BGP and traceroute paths –Signaling and forwarding AS path usually match –Good way to identify mistakes in IP-to-AS map Successively refine the IP-to-AS mapping –Find add/change/delete that makes big difference –Base these “edits” on operational realities http://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf

22 Research Areas Better version of traceroute –Router support for active measurement –IPPM (IP Performance Measurement) –http://www1.ietf.org/mail-archive/web/imrg/current/msg00154.html Peer-to-peer troubleshooting www.cnn.com “No” “Yes”

23 Passive Monitoring

24 Limitations of Active Measurements Active measurements: traceroute-like tools –Can’t probe in the past –Shows the effect, not the cause User (s) Web Server (d) AS 1 AS 2 AS 3 AS 4

25 Appealing to Peek Inside Passive measurements: public BGP data Data Correlation BGP update feeds root cause Data Collection (RouteViews, RIPE)

26 Inspect BGP Routing Changes Changes in paths to reach destination d –AS 1: “1 3 4”  “1 2 4” –AS 2: “2 4” (no change) –AS 3: “3 4”  “3 1 2 4” –AS 4: “4” (no change) User (s) Web Server (d) AS 1 AS 2 AS 3 AS 4

27 Idea #1: ASes in Paths Undergoing Change Key assumption –“The AS responsible for the change appears in the old and/or the new AS path to the destination.” If an AS has a routing change –All ASes in old and new paths may be responsible –Call these ASes the “suspect set” Combining across vantage points –Consider all ASes that had a routing change –Perform the intersection across the suspect sets

28 Idea #2: Excluding ASes in Non-Changing Paths Key assumption –“If an AS has no routing change, the ASes in the path are not responsible and can be excluded.” Example –AS 1: “1 2 4”  “1 2 3 4”: suspects {1, 2, 3, 4} –AS 2: “2 4”  “2 3 4”: suspects {2, 3, 4} –AS 3: “3 4” (no change): non-suspects {3, 4} AS 1 AS 2 AS 3 AS 4

29 Idea #3: Blaming the ASes in the Better Path Key assumption –“The better path is the one that contains the AS responsible for the change.” Example –“1 2 4”  “1 2 3 4”: better path to worse path, with ASes {1,2,4} as the suspects (not AS 3) Heuristics for identifying the “better” path –E.g., the shorter AS path AS 1 AS 2 AS 3 AS 4

30 Idea #4: Combining Across Destinations Key assumption –“All destinations experiencing routing changes in a short period of time have a common cause.” Exploiting the observation –Form suspect sets for each destination –Perform intersections of the sets across the destinations

31 Difficulties With Root-Cause Analysis Misleading BGP routing changes –Responsible AS not on old or new path –Looking across destinations doesn’t resolve Missing routing changes –Some routers in an AS don’t have a change –Some subnets are not visible in BGP –Some internal changes are not visible in BGP

32 Misleading BGP Changes BGP data collection Myth:The AS responsible for the change appears in the old or the new AS path. 1 4 5 6 23 7 8 9 10 11 old: 1,2,8,9,10 new: 1,4,5,6,7,10

33 Misleading BGP Changes Myth:Looking at routing changes across prefixes resolves causes AB C BGP data collection 10 7 AS 1 AS 2 AS 3 d1d1 d2d2 d3d3 12 Changes for d 2, but not for d 1 and d 3

34 Missing Routing Changes Myth: The BGP updates from a single router accurately represent the AS C AB D BGP data collection dst 6 12 10 7 AS 1 AS 2 No change

35 Missing Routing Changes A BGP data collection Myth:BGP data from a router accurately represents changes on that router. 12.1.1.0/24 12.1.0.0/16

36 Missing Routing Changes C AB D BGP data collection dst 6 12 10 5 7 AS 1 AS 2 Myth:Routing changes visible in eBGP have greater impact end-to-end impact than changes with local scope.

37 Hybrid of Active and Passive Monitoring i j Omni 1 Omni 3 Omni 2 Omni 4 User (s) Web Server (d) (i,s,d,t) (j,s,d,t’) failure link (3,4) AS 1 AS 2 AS 3 AS 4

38 Research Questions Understanding if root-cause analysis can work –How many vantage points are needed? –Do the assumptions usually hold? –Can algorithms tolerate occasional violations? –Can some additional information help? Distributed algorithms for root-cause analysis –Can ASes cooperate in distributed fashion? –How to prevent or detect ASes that cheat? –Do all ASes have to participate? –Other hybrids of active and passive monitoring?

39 Conclusions Troubleshooting is important –Detect, diagnose, and fix problems –Accountability and service-level agreements Troubleshooting is hard –Active measurement (e.g., traceroute) not enough –Root-cause analysis techniques are not enough New innovation necessary –Hybrid active/passive approaches –Router support for active measurement –Routing protocol extensions for troubleshooting

40 For Next Time: From Inside an AS Two papers –“OSPF monitoring: Architecture, design, and deployment experience” –“Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network” Optional reading –Materials from Packet Design and Ipsum Networks Review only of first paper –Summary –Why accept –Why reject –Future work


Download ppt "Internet Routing (COS 598A) Today: Root-Cause Analysis Jennifer Rexford Tuesdays/Thursdays 11:00am-12:20pm."

Similar presentations


Ads by Google