Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC Passive and Active Monitoring Workshop Antibes, Juan-les-Pins, France April 19-20, 2004 Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

Outline Set of integrated measurement tools to aid in troubleshooting for end “user” Traceroute measurements/analysis Topology visualization Lightweight bandwidth estimation Overall visualization Level change anomaly automated detection Correlation of performance & route changes

Traceroute measurement
Every 10 minutes for each host Run standard traceroute 2 sec timeout, 1 query/hop, <= 30hops For some hosts use ICMP traceroute End host responds (7/40) Intermediate host responds (1/40) Two cases UDP probes better than ICMP One case neither ICMP or UDP probes help Both forward & reverse (use ssh for reverse route) Need ssh access to remote host for rev trace Else no reverse route (not a disaster)

Significant changes Compare current and previous traceroutes:
If traceroute reports “unknown host” => unknown (!) Else for each hop/node If both current & previous hops have valid IP addresses (i.e. router does not respond & traceroute reports “*”) If different i.e. some kind of Route Change has occurred If IPs same for 1st 3 octets then => same subnet/colo (:) Else if IPs in same AS then => same AS (a) Else significant change => assign unique route number If only one hop different => color route # orange ( ) Else color route => color route # red ( ) Elseif 30 hops => no route change but last hop unreachable (|) If last hop not pingable => color red (|) Else => no route change (●) Elseif one or both IPs are “*” => route change unclear (*) If “Icmp checksum is wrong” color character orange If significant bandwidth change color cell Get ASNs from whois servers. By caching we reduce the hits on the whois servers by factor of 10 (90% hit rate).

Route table Compact so can see many routes at once History navigation
Route # at start of day, gives idea of root stability Multiple route changes (due to GEANT), later restored to original route Mouseover for hops & RTT Available bandwidth Raw traceroute logs for debugging Textual summary of traceroutes for to ISP Description of route numbers with date last seen User readable (web table) routes for this host for this day

Another example Get AS information for routes Level change
Host not pingable TCP probe type Intermediate router does not respond ICMP checksum error

Midnight maintenance Xtraffic SLAC to NIKHEF Avail BW

Topology Choose times and hosts and submit request Alternate route
Hour of day SLAC ESnet Alternate rt GEANT JAnet Nodes colored by ISP Mouseover shows node names Click on node to see subroutes Click on end node to see its path back Also can get raw traceroutes with AS’ IN2P3 CESnet CLRC DL CLRC

Available bandwidth Uses ABwE/Abing (packet pair dispersion)
Needs server at remote end or ssh to launch server Fast (< 1 sec) Lightweight < 40 packets for both forward & reverse estimates (5800 Bytes) Uses min delay for capacity Inter packet dispersion for cross-traffic Available BW = Capacity (min RTT) – Cross-traffic (var) Good agreement with other methods Even if poor absolute agreement (25% cases) can spot changes Also provides RTT Make measurements to about 60 hosts at 5 minute intervals (deployed in IEPM, MonALISA, PlanetLab)

Available Bandwidth From SLAC to Caltech Mar 19, 2004
Dynamic bandwidth capacity (DBC) Iperf Available bandwidth = DBC – X-traffic Cross-traffic

Achievable throughput & file transfer
IEPM-BW High impact (iperf, bbftp, GridFTP …) measurements min intervals Fwd route change Iperf abing bbftp iperf1 Min RTT Rev route change Min RTT Select focal area

Put it all together Two examples Agreement of iperf & abing
Route changes and available bandwidth

Forward Routing changes Reverse Routing changes
New CENIC path 1000 Mbits/s Forward Routing changes AbWE Iperf back to new CENIC path Bbftp Iperf 1 stream Drop to 100 Mbits/s by Routing (BGP) errors RTT Drop to 622 Mbits/s path Reverse Routing changes 28 days bandwidth history. During this time we can see several different situations caused by different routing from SLAC to CALTECH ABwE also works well on DSL and wireless networkss. Scatter plot graphs of Iperf versus ABw on different paths (range 20–800 Mbits/s) showing agreement of two methods (28 days history)

Esnet-LosNettos segment in the path
Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

Automatic Step change Detection
Too many graphs to review each morning! Motivated by drop in bandwidth between SLAC &Caltech Started late August 2003 Reduced achievable throughput by factor of 5 Not noticed until October 2003 Caused by faulty routing over commercial network After notifying ISP, it was fixed in 4 hours! See for details SLAC Caltech achievable throughput April – November 2003 Started

Automatic available bandwidth step change detection
Still developing, evolving from earlier work: Arithmetic weighted moving averages NLANR work, see Roughly speaking: Has a history buffer to describe past behavior History buffer duration currently 600 mins Plus a trigger buffer of data suggesting a change Trigger buffer duration (evaluating typically mins) indicates how long the change has to occur for History mean (m) and std. dev. (s) use by trigger selector If new_value outside m +- sensitivity*s add to trigger buffer If new_value outside m +- 2*sensitivity*s then also an outlier (don’t add to stats) Else goes in history buffer Current parameter settings: History buffer duration: 600 mins trigger buffer duration: 30 mins threshold: 40% sensitivity: 2 (NLANR typically set to 1) Differences from NLANR algorithm: Use standard deviations instead of variances Use threshold instead of count of triggers Low variation prohibition (flatlining avoidance) use 10% instead of 20% Track both increases an decreases In progress & futures: Evaluate optimum parameters Generate alarms with filters Possibly look at medians and percentile to replace mean and standard deviations See if can fold in periodic (e.g. diurnal) changes Look at increases AND decreases (e.g. restoration)

Algorithm If this is a trigger value compare with m and save direction of change If this is a trigger and the direction has changed, reset trigger buffer Move trigger data to history buffer, recalculate stats, clear trigger buffer If trigger buffer full calculate trigger mean mt and st If (m - mt)/ m > threshold then a & reset trigger buffer Else remove oldest value from trigger buffer

Examples SLAC to Caltech available bandwidth April 6-8, 2004 Alerts
Route change History duration: 600 mins, trigger duration: 30 mins, threshold: 40%, sensitivity: 2 With trigger duration: 60 only see one alert, with trigger duration: 10 catch alerts Unreachable SLAC - NIKHEF Route changes SLAC to NIKHEF (Amsterdam) Mbit/s Avail BW

BW vs Route changes Location (# nodes) # route chgs # with thru inc.
Route & throughput changes from 11/28/03 thru 2/2/04 Most (80%) route changes do not result in throughput change About half throughput changes are due to route changes Location (# nodes) # route chgs # with thru inc. # with thru decr. # thru chgs # thru with rte # thru chg w/o rte Europe (8) 370 2 4 10 6 Canada & US (21) 1206 24 25 71 49 221 Japan (13) 142 9 5 1. Of the 22 throughput changes w/o route change for Canada & US, 9 were regular variation on Friday nights due to regularly schedules large data transfers on one of the target networks.

More Information ABwE: IEPM Traceroute examples: Step change analysis
IEPM Traceroute examples: Step change analysis

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

Similar presentations

Presentation on theme: "Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

Similar presentations

Presentation on theme: "Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC."— Presentation transcript:

Similar presentations

About project

Feedback