Les Cottrell & Yee-Ting Li, SLAC

Slides:

Advertisements

Similar presentations

QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.

Advertisements

1 SLAC Site Report By Les Cottrell for UltraLight meeting, Caltech October 2005.

1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.

1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

MAGGIE NIIT- SLAC On Going Projects Measurement & Analysis of Global Grid & Internet End to end performance.

Network Traffic Measurement and Modeling CSCI 780, Fall 2005.

Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.

Internet Traffic Management Prafull Suryawanshi Roll No - 04IT6008.

Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger

Network Monitoring School of Electronics and Information Kyung Hee University. Choong Seon HONG Selected from ICAT 2003 Material of James W. K. Hong.

1 ESnet Network Measurements ESCC Feb Joe Metzger

PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.

Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.

workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.

1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February

DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.

Measurement & Analysis of Global Grid & Internet End to end performance (MAGGIE) Network Performance Measurement.

1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.

1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan

Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.

1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.

1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,

1 Lessons Learned Monitoring Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia Partially funded by DOE and by Internet2.

1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.

1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,

1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, Presented at the Internation Symposium on Grid Computing 2006, Taiwan

Chapter 3 Part 1 Switching and Bridging

Digital Access Cross Connect Switch

Instructor Materials Chapter 6: Quality of Service

Whirlwind Tour Of Lectures So Far

Local Area Networks Honolulu Community College

California Institute of Technology

Lessons Learned Monitoring the WAN

Monitoring 10Gbps and beyond

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

Paola Grosso SLAC October

Networking for the Future of Science

Internet2 Performance Update

Deployment & Advanced Regular Testing Strategies

Network Administration CNET-443

Tools for High Performance Network Monitoring

Terapaths: DWMI: Datagrid Wide Area Monitoring Infrastructure

Using Netflow data for forecasting

Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.

Connie Logg, Joint Techs Workshop February 4-9, 2006

ESnet Network Measurements ESCC Feb Joe Metzger

Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the

Wide Area Networking at SLAC, Feb ‘03

End-to-end Anomalous Event Detection in Production Networks

ABwE: Available Bandwidth Estimator Jiri Navratil R. Les

Connie Logg February 13 and 17, 2005

End-to-end Anomalous Event Detection in Production Networks

Experiences in Traceroute and Available Bandwidth Change Analysis

Pong: Diagnosing Spatio-Temporal Internet Congestion Properties

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

High Performance Network Monitoring for UltraLight

High Performance Network Monitoring for UltraLight

Experiences in Traceroute and Available Bandwidth Change Analysis

Forecasting Network Performance

Network Performance Measurement

Wide-Area Networking at SLAC

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

MAGGIE NIIT- SLAC On Going Projects

pathChirp Efficient Available Bandwidth Estimation

pathChirp Efficient Available Bandwidth Estimation

Presentation transcript:

Terapaths Monitoring DWMI: Datagrid Wide area Monitoring Infrastructure Les Cottrell & Yee-Ting Li, SLAC US-LHC End-To-End Networking Meeting, FNAL October 25, 2006 Forecasting Network Performance Predicting how long a file transfer will take, requires forecasting network and application performance. However, such forecasting is beset with problems. These include seasonal (e.g. diurnal) variations in the measurements, the increasing difficulty of making accurate active low network intrusiveness measurements especially on high speed (>1 Gbits/s) networks and with Network Interface Card (NIC) offloading, the intrusivenss of making more realistic active measurements on the network, the differences in network and large file transfer performance, and the difficulty of getting sufficient relevant passive measurements to enable forecasting. We will discuss each of these problems, compare and contrast the effectiveness of various solutions, look at how some of the methods may be combined, and identify practical ways to move forward. Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM), and by Internet2

Active E2E Monitoring Layer 3 or 4. Layers 1 and 2 are less well exploited/understood/related to apps. Also lots of instances: FC, FICON, 10GE, SONET, OC192, SDH … Check vendor specs. E.g. Cisco, Juniper etc. SONET monitoring (from Endace): The PHYMON occupies 1U (OC3/12) or 2U (OC48/192) of vertical rack space and is equipped with two 10/100/1000 copper Ethernet interfaces for control and reporting via LAN. Key Features Monitors up to two OC3/OC12/OC48/OC192 network links Detects link-layer failures: LOS-S, LOF-S, AIS-L, REI-L, RDI-L, AIS-P, LOP_P, UNEQ-P, REI-P, RDI-P Derive errors: CV, ES, ESA, ESB, SES and UAS according to Bellcore GR-253, Issue 2 Rev 2 standard Sends SNMP traps for all failures and error thresholds according to user configuration Reports current status in real time via telnet, ssh or serial connection Reports accumulated status for 15m, 1h, 8h, 24h, 7d intervals Retains historical data for 35 days Supplies all the underlying data for SNMP SONET MIB (RFC2558) Techniques: Loop back, test patterns (BERT), e.g. ones & zeroes, various ITU-T specs, Loss of Signal, out of Frame, loss of frame, errored seconds, code violations, unavailable seconds, alarms, near & far end, how often, history

Active IEPM-BW measurements Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model Makes regular measurements with probe tools ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) pathchirp, pathload (available bandwidth) iperf (one & multi-stream), thrulay, (achievable throughput) supports bbftp, bbcp (file transfer applications, not network) Looking at GridFTP but complex requiring renewing certificates Choice of probes depends on importance of path, e.g. For major paths (tier 0, 1 & some 2) use full suite For tier 3 use just ping and traceroute Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

IEPM-BW Measurement Topology 40 target hosts in 13 countries Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites Added Sunnyvale for UltraLight Covers all USATLAS tier 0, 1, 2 sites Adding FZK Karlsruhe

Top page

Visualization & Forecasting in Real World

Examples of real data Caltech: thrulay Misconfigured windows New path Very noisy 800 Mbps Nov05 Mar06 UToronto: miperf Seasonal effects Daily & weekly 250 Mbps Jan06 Nov05 Pathchirp UTDallas Some are seasonal Others are not Events may affect multiple-metrics 120 thrulay Mbps Mar-10-06 iperf Mar-20-06 Events can be caused by host or site congestion Few route changes result in bandwidth changes (~20%) Many significant events are not associated with route changes (~50%)

Scattter plots & histograms Scatter plots: quickly identify correlations between metrics Thrulay Pathchirp Iperf Thrulay (Mbps) RTT (ms) Pathchirp & iperf (Mbps) Throughput (Mbits/s) Pathchirp Thrulay Histograms: quickly identify variability or multimodality

Esnet-LosNettos segment in the path Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

However… Elegant graphics are great to understand problems BUT: Can be thousands of graphs to look at (many site pairs, many devices, many metrics) Need automated problem recognition AND diagnosis So developing tools to reliably detect significant, persistent changes in performance Initially using simple plateau algorithm to detect step changes Holt-Winters for forecasting if seasonal effects

Seasonal Effects on events Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

Forecasting Over-provisioned paths should have pretty flat time series Short/local term smoothing Long term linear trends Seasonal smoothing But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths Use Holt-Winters triple exponential weighted moving averages Predicting how long a file transfer will take, requires forecasting network and application performance. However, such forecasting is beset with problems. These include seasonal (e.g. diurnal) variations in the measurements, the increasing difficulty of making accurate active low network intrusiveness measurements especially on high speed (>1 Gbits/s) networks and with Network Interface Card (NIC) offloading, the intrusivenss of making more realistic active measurements on the network, the differences in network and large file transfer performance, and the difficulty of getting sufficient relevant passive measurements to enable forecasting. We will discuss each of these problems, compare and contrast the effectiveness of various solutions, look at how some of the methods may be combined, and identify practical ways to move forward.

Experimental Alerting Have false positives down to reasonable level (few per week), so sending alerts to developers Saved in database Links to traceroutes, event analysis, time-series

Passive Active monitoring What about Passive? Pro: regularly spaced data on known paths, can make on-demand Con: adds data to network, can interfere with real data and measurements What about Passive? Need replacement for active packet pair and throughput measurements. Evaluating whether we can use Netflow for this.

Netflow et. al. Switch identifies flow by sce/dst ports, protocol Cuts record for each flow: src, dst, ports, protocol, QoS, start, end time Collect records and analyze Can be a lot of data to collect each day, needs lot cpu Hundreds of MBytes to GBytes No intrusive traffic, real: traffic, collaborators, applications No accounts/pwds/certs/keys No reservations etc Characterize traffic: top talkers, applications, flow lengths etc. NetraMet, SCAMPI are a couple of non-commercial flow projects. IPFIX is an IETF standardization effort for Netflow type passive monitoring.

Application to LHCnet LHC-OPN requires edge routers to provide Netflow data SLAC developing Netflow visualization at BNL Allows selection of destinations, services Displays time series, tables, pie charts, spider plots Will port to other LHCOPN sites Choose aggregation Choose services

Netflow limitations Use of dynamic ports makes harder to detect app. GridFTP, bbcp, bbftp can use fixed ports (but may not) P2P often uses dynamic ports Discriminate type of flow based on headers (not relying on ports) Types: bulk data, interactive … Discriminators: inter-arrival time, length of flow, packet length, volume of flow Use machine learning/neural nets to cluster flows E.g. http://www.pam2004.org/papers/166.pdf Aggregation of parallel flows (needs care, but not difficult) Can use for giving performance forecast Unclear if can use for detecting steps in performance For FTP port 21 only used as a control channel. SCAMPI/FFPF/MAPI allows more flexible flow definition than Netflow See www.ist-scampi.org/

perfSONAR (pS) See Joe Metzger’s talk later today SLAC/IEPM formally joined pS De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance, provide rich source of data to analyze, leverages IEPM group skills PerfSONAR allows transparent data access Bring closer US HEP influence to pS Make iepm-bw data available via pS infrastructure Porting of our analysis tools to work with pS Test perfSONAR api’s Provide useful features such as analysis, visualization, event detection, alerting and diagnosis. pS enables the unification of both end-to-end and router metric representation Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data.

Questions, More information Comparisons of Active Infrastructures: www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html Some active public measurement infrastructures: www-iepm.slac.stanford.edu/ www-iepm.slac.stanford.edu/pinger/ e2epi.internet2.edu/owamp/ amp.nlanr.net/ # No longer funded Monitoring tools www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html www.caida.org/tools/ Google for iperf, thrulay, bwctl, pathload, pathchirp, pathneck Event detection www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-d.doc Netflow: Internet 2 backbone http://netflow.internet2.edu/weekly/ SLAC: www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html BNL (SLAC developed, work in progress) http://iepmbw.bnl.org/netflow/index.html