Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

Similar presentations


Presentation on theme: "1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network."— Presentation transcript:

1 1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network Operations & Management Symposium www.slac.stanford.edu/grp/scs/net/talk06/noms-detect-apr06.ppt Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM)

2 2 Outline Why do we want forecasting & anomaly detection? What are we using for the input data –And what are the problems How do we make forecasts, detect anomaly? –First approaches –The real world Results Conclusions & Futures Possible uses

3 3 Uses of Techniques Automated problem identification: –Admins cannot review 100’s of graphs each day –Alerts for network administrators, e.g. Bandwidth changes in time-series, iperf, SNMP –Alerts for systems people OS/Host metrics –Anomalies for security Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement

4 4 Data

5 5 Measurement Topology 40 target hosts in 13 countries Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites

6 6 Using Active IEPM-BW measurements Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model Makes regular measurements –Ping (RTT, connectivity), traceroute –pathchirp, pathload, ABwE (packet pair dispersion) –iperf (single & multi-stream), thrulay, Lots of analysis and visualization Running at CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites –http://www.slac.stanford.edu/comp/net/iepm- bw.slac.stanford.edu/slac_wan_bw_tests.htmlhttp://www.slac.stanford.edu/comp/net/iepm- bw.slac.stanford.edu/slac_wan_bw_tests.html

7 7 abing Uses packet pair dispersion of 20 packets to provide: –Capacity, X-traffic, available bandwidth –At 3 minute intervals –Very noisy time series data Capacity Moving averaged over 1 hour Bottleneck Min spacing At bottleneck Spacing preserved On higher speed links Started with this: Knew developer Simple method

8 8 Available bandwidth Accuracy: –From PAM paper, pathload most accurate, followed by pathchirp, abing has problems at lower speeds Overhead: –Pathload has 100 times network overhead of pathchirp and abing Measurement duration: –Abing take < 1sec for measurement, –Pathchirp takes ~ 10 secs –pathload takes tens of seconds depends on RTT and can timeout Consistency of results: –Abing very noisy, –Pathchirp in between –Pathload smoother, multi-modal Pathload Pathchirp SLAC-Caltech March ‘06

9 9 Iperf vs thrulay RTT ms Achievable throughput Mbits/s Minimum RTT Maximum RTT Average RTT Give TCP achievable throughput Thrulay more manageable & gives RTT They agree well Throughput ~ 1/avg(RTT) For big RTT need multi- streams Thrulay

10 10 Forecasting and Anomaly detection

11 11 Anomaly Detection Anomaly is when the actual value significantly differs from the expected value –So need forecasts to find anomalies –Focus was initially on abing time-series measurements: Measurement each 3 minutes Low network impact BUT very noisy so hard test case

12 12 Plateau, most intuitive Each observation: –If outside history buffer mean m h ±  s h then add to trigger buffer –Else add to history, and remove oldest from trigger buffer When trigger buffer >  points then trigger issued –Check if (m h - m t ) / m h > D  90% trigger in last T mins then have trigger –Move trigger buffer to history buffer Observations History mean Trigger % full History mean – 2 * stdev * Event = history length = 1 day,  = trigger length = 3 hours  = standard deviations = 2

13 13 K-S For each observation: for the previous 100 observations with next 100 observations –Compare the vertical difference in CDFs –How does it differ from random CDFs –Expressed as % difference Compare K-S with Plateau

14 14 Compare Results between K-S & plateau very similar, using K-S coefficient threshold = 70% Current plateau only finds negative changes –Useful to see when condition returns to normal K-S implemented in C and executes faster than Plateau (in Perl), depends on parameters K-S more formalized Plateau and K-S work well for non seasonal observations (e.g. small changes day/night)

15 15 Seasons & false alerts Congestion on Monday following a quiet weekend causes a high forecast, gives an alert Also a history buffer of not a day causes History mean to be out of sync with observations

16 16 Effect on events Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

17 17 Seasonal Changes Use Holt-Winters (H-W) technique: –Uses triple exponential weighted moving average –Three terms each with its own parameter (  ) that take into account local smoothing, long term seasonal smoothing, and trends

18 18 H-W Implementation Need regularly spaced data (else going back one season is difficult, and gets out of sync): –Interpolate data: select bin size Average points in bin If no points in first week bin then get data from future weeks For following weeks, missing data bins filled from previous week Initial values for smoothing from NIST “Engineering Statistics Handbook” Choose parms by minimizing (1/N)Σ(F t -y t ) 2 –F t =forecast for time t as function of parameters, y t = observation at time t

19 19 H-W Implementation Three implementations evaluated (two new) –FNAL (Maxim Grigoriev) Inspiration for evaluating this method –Part of RRD (Brutlag) Limited control over what it produces and how it works –SLAC Implemented NIST formulation, different formulation/parameter values from Brutlag/FNAL, also added minimize sums of squares to get parms

20 20 Results

21 21 Example Local smoothing 99% weight for last 24 hours Linear trend 50% last 24 hours Seasonal mainly from last week, but includes several weeks Within an 80 minute window, 80% points outside deviation envelope ≡ event Deviations Observations Forecast Weekend Weekdays 1 hr avg

22 22 Evaluation Created a library of time series for 100 days from June through Sep 2004 for 40 hosts Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) –23 hosts had 120 candidate events –Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds Classify ~120 events as to whether interesting –Large, sharp drop in bandwidth, persist for >> 3hrs

23 23 Results K-S shows similar results to Plateau As adjust parameters to reduce false positives then increase missed events –E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(m h - m t )/m h DFalseMiss 10%16%8% 30%2%32% K-S with ± 100 observations Plateau (  =2) We are generating emails from events & gathering extra diagnostics, send as email to net admins.

24 24 Conclusions A few paths (10%) have strong seasonal effects Plateau & K-S work well if only weak seasonal effects –K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) H-W promising for seasonal effects, but –Is more complex, and requires more parameters which may not be easy to estimate –Requires regular data (interpolation step) –Can use to remove seasonal effects and then apply Plateau CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 H-W works, still need to quantify its effectiveness

25 25 Current & Future Work Different objective function to minimize for HW parms Effect of other metrics: –frequency of measurements, speed of detection –noisiness (min-RTT and pathload smoother) Future Development in PCA –Enable looking at multiple measurements simultaneously E.g. RTT, loss, capacity …; multiple routes Neural networks, wavelets, ARIMA … Interpolate heavyweight/infrequent measurements based on light weight more frequent Netflow passive exploration Manually study events, leading to development of automated diagnosis of events: traceroutes, router utilizations, host measurements

26 26 More information SLAC Plateau implementation –www.acm.org/sigs/sigcomm/sigcomm2004/workshop_paper s/nts26-logg1.pdfwww.acm.org/sigs/sigcomm/sigcomm2004/workshop_paper s/nts26-logg1.pdf SLAC H-W implementation –www-iepm.slac.stanford.edu/monitoring/forecast/hw.htmlwww-iepm.slac.stanford.edu/monitoring/forecast/hw.html Eng. Statistics Handbook –http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm IEPM-BW Measurement Infrastructure –http://www-iepm.slac.stanford.edu/http://www-iepm.slac.stanford.edu/


Download ppt "1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network."

Similar presentations


Ads by Google