Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring Traffic Exits in a Multihomed I2 Environment Joe St Sauver UO Computing Center NLANR/I2 Joint Techs, Minneapolis May.

Similar presentations

Presentation on theme: "Monitoring Traffic Exits in a Multihomed I2 Environment Joe St Sauver UO Computing Center NLANR/I2 Joint Techs, Minneapolis May."— Presentation transcript:

1 Monitoring Traffic Exits in a Multihomed I2 Environment Joe St Sauver UO Computing Center NLANR/I2 Joint Techs, Minneapolis May 18th, 2000

2 Today's Message in a Nutshell... What: You should pay attention to how application traffic is exiting your campus. Why: In an I2 multihomed environment there are important differences between the network exits you've got available. How: We have a simple and freely available tool written in perl which you can use to monitor your application exits

3 Definition: Multihomed By saying that a site is "multihomed" we mean that the site is: "connected to two or more wide area networks, such as the commodity Internet and a high performance limited access network such as Abilene"

4 ALL I2 Schools Are Multihomed ALL I2 schools are multihomed, since all are required to have both commodity AND high performance connectivity In fact, many I2 schools have: -- multiple commodity transit providers, and -- multiple high performance exits, and -- local or national peerage, and -- local intraconsortia traffic…

5 Many people don't know/care how their traffic flows "It just works" "I couldn't change how my traffic is routed even if I didn't like how it is exiting" "I wouldn't do anything different even if I could get my traffic did go a different way" "I can't tell the difference between and connectivity, anyhow."

6 Should They Care? Yes! Users need to be helped to understand that they should care about how traffic exits. (1) In a multihomed environment, all exits are not alike. (2) Users can adjust their applications and their collaborations to make efficient use of available exits, if they can determine what's going on, and if they are motivated.

7 Some crucial differences among various exits you may have…. Different network exits may have: (1) different available capacity (a primary motivator in our case) (2) different cost structures apply (also important to us, as at most places) (3) different latency and loss properties (4) different wide area reachability

8 (1) Available capacity... Classic Example: I2 school with congested fractional DS3 commodity Internet transit connectivity, but lightly loaded OC3 connectivity via Abilene. A huge difference in available capacity exists between those two exits.

9 (2) Different Costs Structures for Different Connections… Oregon, like most places, has different cost structures for different sorts of connectivity. For example, we pay $1K/Mbps/month for contracted inbound commodity Internet transit capacity (excluding some categories of traffic)

10 Different Costs Structures for Different Connections… (cont.) While in the case of peerage or HPC connectivity, usage may have no direct incremental cost (except at step boundries where capacity would need to be added), and in fact, there may be an effective "negative incremental cost" associated with shifting inbound traffic off of commodity transit links and over to HPC connectivity.

11 Inbound Traffic? I Thought This Was About Monitoring Exits? We can't easily/systematically do reverse traceroutes to watch our commodity traffic inbound (even though that's what we pay for) We'll assume that if we've got outbound traffic going to you, you've probably got inbound traffic going to us We'll watch what we can and hope that the world is symmetric (hah hah, I know, I know)

12 (3) Latency and Loss Different exits may have dramatically different latency and loss characteristics. Sustained throughput at a level which might be perfectly reasonable over a low latency and low loss circuit might be completely unrealistic to attempt in the face of higher latency or higher loss

13 (4) Reachability In some cases, a change of exit may result in a material change in reachability. For example, consider a block of addresses that are advertised ONLY to high performance connectivity partners (the intent being that if the HPC connectivity falls over, the load will not flop over onto commodity transit circuits and swamp those exits).

14 How Does All This Relate to Network Engineering? "Okay, sure it is true that there's a difference between the various exits that might be available. "But how does this relate to network engineering tasks? What knobs do I need to tweak on my Cisco?"

15 Focus is on the user/applications, not on tweaking the network We assume that the network configuration is "a given" and is NOT changeable -- e.g., network engineers [are not|cannot|will not|should not] tweak routes for load balancing or other traffic management. We DO assume that end users can make application level decisions about what they do, or who they collaborate with.

16 … plus user network monitoring Assuming we really need and are really using our I2 connectivity, users who are running crucial applications over I2 need to know when it isn't there for a given collaboration, if only so they can call up and complain. :-)

17 Why not just look at periodic route snapshots? For example, why not just check: tml Answer: too much data Answer: HPC connectivity only Answer: only updates every half hour Answer: too hard to see what's changed Answer:...

18 Oregon's Particular Application We run news servers connecting with hundreds of Usenet News peers located all around the world, some normally I2 connected, some connected via commodity transit, some connected via peerage. News traffic must not be allowed to interfere with other network traffic ==> we need to closely monitor and manage traffic flowing out these multiple exits

19 Sample Decisions Made In Part Based on How Traffic Exits: Should we peer with this site at all? What should we feed them? What should we have them feed us? If we will be feeding this site over a high performance connection, do we want to insure that that HPC traffic doesn't flop onto our commodity transit links in case HPC connectivity falls over?

20 Usenet is NOT a unique case... Usenet News is not unique when it comes to distributing data to many different sites over potentially diverse exits. Examples include: -- Unidata's LDM weather data network -- web cache hierarchies -- IRC server traffic -- any coordinated server-to-server traffic

21 Stage 1 Exit Selection Analysis: "I know, let's use traceroute..." Inititially, we did what everyone else does when we wanted to figure our where traffic was exiting, and just said: % traceroute with our traffic involving five main exits...

22 Sample traceroute #1 % traceroute traceroute to amber.Berkeley.EDU ( ), 30 hops max, 40 byte packets 1 ( ) ms ms ms 2 ( ) ms ms ms 3 ( ) ms ms ms 4 ( ) ms ms ms ( ) ms ms ms 6 ( ) ms ms ms 7 pos1-0.inr-000-eva.Berkeley.EDU ( ) ms ms ms 8 pos5-0-0.inr-001-eva.Berkeley.EDU ( ) ms ms ms 9 fast1-0-0.inr-007-eva.Berkeley.EDU ( ) ms ms ms 10 f8-0.inr-100-eva.Berkeley.EDU ( ) ms ms ms 11 amber.Berkeley.EDU ( ) ms ms ms But wait, there's more… we also have HPC connectivity via Denver...

23 Sample traceroute #2 % traceroute traceroute to ( ), 30 hops max, 40 byte packets 1 ( ) ms ms ms 2 ( ) ms ms ms 3 ( ) ms ms ms 4 ( ) ms ms ms 5 ( ) ms ms ms 6 ( ) ms ms ms ( ) ms ms ms 8 ( ) ms ms ms 9 ( ) ms ms ms 10 ( ) ms ms ms 11 ( ) ms ms ms 12 ( ) ms ms ms 13 ( ) ms ms ms 14 ( ) ms ms ms Oregon to BC, via Indianapolis… Guess we still need a west coast HPC peering point… :-;

24 Sample traceroute #3 % traceroute traceroute to ( ), 30 hops max, 40 byte packets 1 ( ) ms ms ms 2 ( ) ms ms ms 3 ( ) ms ms ms 4 ( ) ms ms ms 5 ( ) ms ms ms Hssi8-0-0.GW1.POR2.ALTER.NET ( ) ms ms Serial4-0.GW1.POR2.ALTER.NET ( ) ms ATM3-0.XR2.SEA4.ALTER.NET ( ) ms ms ms ATM2-0.TR2.SEA1.ALTER.NET ( ) ms ms ms ATM7-0.TR2.EWR1.ALTER.NET ( ) ms ms ms ATM7-0.XR2.EWR1.ALTER.NET ( ) ms ms ms ATM9-0-0.GW1.NYC2.ALTER.NET ( ) ms ms ms 12 AngolaTel-gw.customer.ALTER.NET ( ) ms ms ms ( ) ms ms ms ( ) ms ms ms Dual fractional DS3s==multiple gateway addresses we need to watch out for...

25 Sample traceroute #4 % traceroute traceroute to ( ), 30 hops max, 40 byte packets 1 ( ) ms ms ms 2 ( ) ms ms ms 3 ( ) ms ms ms 4 ( ) ms ms ms 5 xcore2-serial ( ) ms ms ms 6 ( ) ms ms ms 7 ( ) ms ms ms 8 ( ) ms ms ms 9 ( ) ms ms ms ( ) ms ms ms 11 ( ) ms ms ms 12 ( ) ms ( ) ms ( ) ms 13 fe ( ) ms ms ms 14 ( ) ms ms ms

26 Sample traceroute #5 % traceroute traceroute to ( ), 30 hops max, 40 byte packets 1 ( ) ms ms ms 2 ( ) ms ms ms 3 ( ) ms ms ms 4 d ( ) ms ms ms 5 ( ) ms ms ms 6 p ( ) ms ms ms 7 ( ) ms ms ms 8 p ( ) ms ms ms 9 ( ) ms ms ms 10 p ( ) ms ms ms 11 ( ) ms ms ms 12 fa ( ) ms ms ms 13 ( ) ms ms ms ( ) ms ms ms 15 ( ) ms ms ms 16 ( ) ms ms ms

27 Couple of Slight Problem(s)... Manually tracerouting to several hundred hosts gets to be really tedious, really fast Traffic could (and did) shift w/o notice (sometimes dramatically) based on MRTG graphs, yet we'd only end up doing traceroutes on rare occaisions, often missing interesting phenomena

28 And once we were done tracerouting all over the place... We still needed to consolidate that information into a useful format (other than jotted notes on the back of recycled memos) We learned that we did badly when it came to noticing changes in exit behavior for one or two particular hosts out of hundreds

29 Stage 2 Exit Selection Analysis: "Let's write a filter!!!" Simple repetitive task ==> create a filter Input: list of FQDNs or dotted quads whose exit selection we want to monitor Output: same as input list, but with exit info prepended to each entry Approach: do a traceroute, grep for the gateways, tag and print the exits accordingly. Write in perl. Sort by exit.

30 Stage 2 Philosophy: Like Unix itself, build small simple tools that can be chained together to dolarger complex tasks "If we can just see where the traffic is going, that will be good enough…."

31 Stage 2 code... #!/usr/local/bin/perl $Cmd = "/usr/etc/traceroute"; open (SAVERR,">&STDERR"); open (STDERR,">temp.tmp"); while ($Host = <>) { chop $Host; open(IN,"$Cmd $Host |"); while($line = ) { if ($line =~ m/ /) { print "OGIG-S: $Host\n"; } elsif ($line =~ m/ /) { print "OGIG-D: $Host\n"; }

32 Stage 2 code (cont.) [continued] elsif ($line =~ m/ /) { print "UUNet: $Host\n"; } elsif ($line =~ m/ /) { print "UUNet: $Host\n"; } elsif ($line =~ m/ /} { print "CWIX: $Host\n"; } elsif ($line =~ m/ /} { print "Verio: $Host\n"; } } }

33 Sample Stage 2 Run... % cat |./chexit Verio: CWIX: [etc.] and/or pipe it through sort.

34 Lessons from Stage 2... Even automated, it can take a long time to traceroute all the way to several hundred sites, particularly if you go a full thirty hops (e.g., hit a firewall and loop); added -m 6 option to the traceroute (max hops of 6) Default time-to-wait is too long (use -w 2) If doing this on an automated basis, there's no need to resolve addresses on the traces (add -n option)

35 Lessons from Stage 2 (cont.) No need to send three packets; one will usually be enough (and will be faster than sending three) -- add the option -q 1 Still can't spot what's changed between runs Other folks want to see the output; need some way to share this data with others Phenomena existed that we hadn't expected

36 Unexpected phenomena included... We learned that some peers oscillated routinely between commodity providers Needed to think about what to do when sites became completely unreachable Decided that maybe we should draw an inference when ALL hosts which had been using a given exit would suddenly stopped doing so

37 Stage 3 Exit Selection Analysis: The Great Webification... Abiel Reinhart joined us as an intern, and we needed a project for him to work on.

38 Goals of Stage 3…. We wanted the perl filter converted into a web cgi-bin with two main deliverables: (1) snapshot web page showing current mapping of specified peers to exits, and (2) change file web page showing what peers had changed exits

39 Sample snapshot.html output (see This report generated at 16:30:00 on 4/9/100 CWIX (68) [etc.] OGIG-D (62) [etc.] [etc.]

40 Interperting snapshot.html Each exit has its own section Total number of peers using that exit is reported in parentheses Peers are alphabetized by reversed FQDN within section Updated every fifteen minutes (via cron)

41 Sample changes.html output (see 15:15:00 UUNet --> OGIG-D [green] 14:45:00 CWIX --> UUNet [black] [etc.] 12:45:00 CWIX --> OGIG-D [green] [etc.] 06:30:00: Note: OGIG-D is now being used. [green] 06:30:00: Note: OGIG-S is now being used. [green] [etc] 06:30:00 UUNet --> OGIG-D [green] 06:30:00 CWIX --> OGIG-D [green] 06:30:00 Verio --> OGIG-D [black] 06:30:00 Verio --> OGIG-S [black] [etc] 06:15:00: Warning: OGIG-D is not being used. [red] 06:15:00: Warning: OGIG-S is not being used. [red] 06:15:00 OGIG-D --> UUNet [red] 06:15:00 OGIG-D --> CWIX [red] 06:15:00 OGIG-D --> Verio [black] 06:15:00 OGIG-S --> Verio [black]

42 Interpreting changes.html Entries are in reverse chronological order "Positive" changes are green, "negative" changes are red, neutral changes are black Spaces separate clumps of entries (by time) Example shows examples of actual routing changes from 4/9 to actual news peers Also captured a 6:15-6:30 OGIG maintenance window (Greg loading S1 on the GSR...)

43 Lessons from Stage 3: Abiel did a great job. :-) Need to set up a policy to handle growth of the changes page (we set a max file size of 1000 lines) Running the code from a subnet other than the one that you're really interested in is okay, but not ideal (misses local breakage, and any applicable locally applied static routes for hosts with multiple interfaces)

44 Lessons from Stage 3 (cont.) Selection of hosts to trace to may be crucial (and should be the actual host(s) you really care about, not "straw dogs" such as since at least some I2 schools are selectively advertising just portions of their network blocks

45 Lessons from Stage 3 (cont.) Assumption of symmetric routes is a bad one to make. Practically speaking, this means that if you nail up I2-only routes (based on exit data or other info), you do so at your peril. This problem may be relatively widespread; see, for example, Hank Nussbacher's paper:

46 Lessons from Stage 3 (cont.) General case: you can build useful network monitoring tools by automating simple interactively accessible building blocks Next project likely to be focused on generating daily SNMP-ish input octets/output octet summaries for all connectors of interest via the Abilene core node router proxy web interface

47 Want to try monitoring your exits? Send to and we'll be glad to share our code with you. Code includes some UO specific stuff that you'll need to manually tweak (it isn't really productized for automatic/"zombie" installation); you'll also see it also does some extra stuff we haven't talked about (like generating I2-only static route entries), but you can simply comment that out.

Download ppt "Monitoring Traffic Exits in a Multihomed I2 Environment Joe St Sauver UO Computing Center NLANR/I2 Joint Techs, Minneapolis May."

Similar presentations

Ads by Google