1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,

Slides:

Advertisements

Similar presentations

Pathload A measurement tool for end-to-end available bandwidth Manish Jain, Univ-Delaware Constantinos Dovrolis, Univ-Delaware Sigcomm 02.

Advertisements

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers Tsung-i (Mark) Huang Jaspal Subhlok University of Houston GAN ’ 05 / May 10, 2005.

Monitoring a Large-Scale Network: Selecting the Right Tool Sayadur Rahman United International University & Network Manager, Financial Service.

1 SLAC Site Report By Les Cottrell for UltraLight meeting, Caltech October 2005.

Internet Traffic Patterns Learning outcomes –Be aware of how information is transmitted on the Internet –Understand the concept of Internet traffic –Identify.

1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.

1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

MAGGIE NIIT- SLAC On Going Projects Measurement & Analysis of Global Grid & Internet End to end performance.

INCITE – Edge-based Traffic Processing for High-Performance Networks R. Baraniuk, E. Knightly, R. Nowak, R. Riedi Rice University L. Cottrell, J. Navratil,

Measurement and Monitoring Nick Feamster Georgia Tech.

1 Tools for High Performance Network Monitoring Les Cottrell, Presented at the Internet2 Fall members Meeting, Philadelphia, Sep

Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.

1 WAN Measurements Carey Williamson Department of Computer Science University of Calgary.

Bandwidth Estimation: Metrics Mesurement Techniques and Tools By Ravi Prasad, Constantinos Dovrolis, Margaret Murray and Kc Claffy IEEE Network, Nov/Dec.

Sven Ubik, CESNET TNC2004, Rhodos, 9 June 2004 Performance monitoring of high-speed networks from NREN perspective.

Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger

Network Monitoring School of Electronics and Information Kyung Hee University. Choong Seon HONG Selected from ICAT 2003 Material of James W. K. Hong.

Reading Report 14 Yin Chen 14 Apr 2004 Reference: Internet Service Performance: Data Analysis and Visualization, Cross-Industry Working Team, July, 2000.

KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.

1 ESnet Network Measurements ESCC Feb Joe Metzger

PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.

Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.

1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February

Chapter 6 – Connectivity Devices

IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.

Comparison of Public End-to-End Bandwidth Estimation tools on High-Speed Links Alok Shriram, Margaret Murray, Young Hyun, Nevil Brownlee, Andre Broido,

Comparison of Public End-to-End Bandwidth Estimation tools on High- Speed Links Alok Shriram, Margaret Murray, Young Hyun, Nevil Brownlee, Andre Broido,

Measurement & Analysis of Global Grid & Internet End to end performance (MAGGIE) Network Performance Measurement.

1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.

1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.

Multiplicative Wavelet Traffic Model and pathChirp: Efficient Available Bandwidth Estimation Vinay Ribeiro.

1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan

Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,

1 Lessons Learned Monitoring Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia Partially funded by DOE and by Internet2.

1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.

Access Link Capacity Monitoring with TFRC Probe Ling-Jyh Chen, Tony Sun, Dan Xu, M. Y. Sanadidi, Mario Gerla Computer Science Department, University of.

1 Terapaths: DWMI: Datagrid Wide Area Monitoring Infrastructure Les Cottrell, SLAC Presented at DoE PI Meeting BNL September

Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.

1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.

Network Monitoring Sebastian Büttrich, NSRC / IT University of Copenhagen Last edit: February 2012, ICTP Trieste

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, Presented at the Internation Symposium on Grid Computing 2006, Taiwan

Les Cottrell & Yee-Ting Li, SLAC

Lessons Learned Monitoring the WAN

Monitoring 10Gbps and beyond

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

Networking between China and Europe

Tools for High Performance Network Monitoring

Terapaths: DWMI: Datagrid Wide Area Monitoring Infrastructure

Using Netflow data for forecasting

Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.

Connie Logg, Joint Techs Workshop February 4-9, 2006

Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the

Wide Area Networking at SLAC, Feb ‘03

Connie Logg February 13 and 17, 2005

End-to-end Anomalous Event Detection in Production Networks

Experiences in Traceroute and Available Bandwidth Change Analysis

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

High Performance Network Monitoring for UltraLight

High Performance Network Monitoring for UltraLight

Experiences in Traceroute and Available Bandwidth Change Analysis

Forecasting Network Performance

Network Performance Measurement

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

pathChirp Efficient Available Bandwidth Estimation

pathChirp Efficient Available Bandwidth Estimation

Presentation transcript:

1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations, Sinaia, Romania Oct 13-18, 2006 Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM), and by Internet2

2 Why & Outline Data intensive sciences (e.g. HEP) needs to move large volumes of data worldwide –Requires understanding and effective use of fast networks –Requires continuous monitoring and interpretation For HEP LHC-OPN focus on tier 0, tier 1 and a few tier 2 sites, i.e. just a few sites Outline of talk: –What does monitoring provide? –Active E2E measurements today and some challenges –Visualization, forecasting, problem ID –Passive monitoring Netflow, Some conclusions

3 Uses of Measurements Automated problem identification & trouble shooting: –Alerts for network administrators, e.g. Baselines, bandwidth changes in time-series, iperf, SNMP –Alerts for systems people OS/Host metrics Forecasts for Grid Middleware, e.g. replica manager, data placement Engineering, planning, SLA (set & verify), expectations Also (not addressed here): –Security: spot anomalies, intrusion detection –Accounting

4 Active E2E Monitoring

5 E.g. Using Active IEPM-BW measurements Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model Makes regular measurements with probe tools –ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) –pathchirp, pathload (available bandwidth) –iperf (one & multi-stream), thrulay, (achievable throughput) –supports bbftp, bbcp (file transfer applications, not network) Looking at GridFTP but complex requiring renewing certificates –Choice of probes depends on importance of path, e.g. For major paths (tier 0, 1 & some 2) use full suite For tier 3 use just ping and traceroute Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites – bw.slac.stanford.edu/slac_wan_bw_tests.htmlhttp:// bw.slac.stanford.edu/slac_wan_bw_tests.html

6 IEPM-BW Measurement Topology 40 target hosts in 13 countries Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites Taiwan TWaren Added Sunnyvale for UltraLight Adding FZK Karlsruhe

7 Top page

8 Probes: Ping/traceroute Ping still useful –Is path connected/node reachable? –RTT, jitter, loss –Great for low performance links (e.g. Digital Divide), e.g. AMP (NLANR)/PingER (SLAC) –Nothing to install, but blocking OWAMP/I2 similar but One Way –But needs server installed at other end and good timers –Now built into IEPM-BW Traceroute –Needs good visualization (traceanal/SLAC) –No use for dedicated λ layer 1 or 2 However still want to know topology of paths

9 Probes: Packet Pair Dispersion Used by pathload, pathchirp, ABwE available bw Send packets with known separation See how separation changes due to bottleneck Can be low network intrusive, e.g. ABwE only 20 packets/direction, also fast < 1 sec From PAM paper, pathchirp more accurate than ABwE, but –Ten times as long (10s vs 1s) –More network traffic (~factor of 10) Pathload factor of 10 again more – IEPM-BW now supports ABwE, Pathchirp, Pathload Bottleneck Min spacing At bottleneck Spacing preserved On higher speed links

10 BUT… Packet pair dispersion relies on accurate timing of inter packet separation –At > 1Gbps this is getting beyond resolution of Unix clocks –AND 10GE NICs are offloading function Coalescing interrupts, Large Send & Receive Offload, TOE Need to work with TOE vendors –Turn off offload (Neterion supports multiple channels, can eliminate offload to get more accurate timing in host) –Do timing in NICs –No standards for interfaces Possibly use packet trains, e.g. pathneck

11 Achievable Throughput Use TCP or UDP to send as much data as can memory to memory from source to destination Tools: iperf (bwctl/I2), netperf, thrulay (from Stas Shalunov/I2), udpmon … Pseudo file copy: Bbcp also has memory to memory mode to avoid disk/file problems

12 BUT… At 10Gbits/s on transatlantic path Slow start takes over 6 seconds –To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance) Needs scheduling to scale, even then … It’s not disk-to-disk or application-to application –So use bbcp, bbftp, or GridFTP

13 AND … For testbeds such as UltraLight, UltraScienceNet etc. have to reserve the path –So the measurement infrastructure needs to add capability to reserve the path (so need API to reservation application) –OSCARS from ESnet developing a web services interface ( For lightweight have a “persistent” capability For more intrusive, must reserve just before make measurement

14 Visualization & Forecasting in Real World

15 Some are seasonal Others are not Events may affect multiple-metrics Misconfigured windows New path Very noisy Examples of real data Seasonal effects –Daily & weekly Caltech : thrulay Nov05 Mar Mbps UToronto: miperf Nov05 Jan Mbps UTDallas Pathchirp thrulay Mar Mar iperf Mbps Events can be caused by host or site congestion Few route changes result in bandwidth changes (~20%) Many significant events are not associated with route changes (~50%)

16 Scattter plots & histograms RTT (ms) Throughput (Mbits/s) Thrulay Pathchirp & iperf (Mbps) Thrulay (Mbps) Iperf Pathchirp Scatter plots: quickly identify correlations between metrics Histograms: quickly identify variability or multimodality Pathchirp Thrulay

17 Changes in network topology (BGP) can result in dramatic changes in performance Snapshot of traceroute summary table Samples of traceroute trees generated from the table ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Changes detected by IEPM-Iperf and AbWE Esnet-LosNettos segment in the path (100 Mbits/s) Hour Remote host Dynamic BW capacity (DBC) Cross-traffic (XT) Available BW = (DBC-XT) Mbits/s Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Los-Nettos (100Mbps)

18 On the other hand Route changes may affect the RTT (in yellow) Yet have no noticeable effect on on available bandwidth or throughput Route changes Available Bandwidth Achievable Throughput

19 However… Elegant graphics are great to understand problems BUT: –Can be thousands of graphs to look at (many site pairs, many devices, many metrics) –Need automated problem recognition AND diagnosis So developing tools to reliably detect significant, persistent changes in performance –Initially using simple plateau algorithm to detect step changes

20 Seasonal Effects on events Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

21 Forecasting Over-provisioned paths should have pretty flat time series –Short/local term smoothing –Long term linear trends –Seasonal smoothing But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths Use Holt-Winters triple exponential weighted moving averages

22 Experimental Alerting Have false positives down to reasonable level (few per week), so sending alerts to developers Saved in database Links to traceroutes, event analysis, time-series

23 Passive Active monitoring –Pro: regularly spaced data on known paths, can make on-demand –Con: adds data to network, can interfere with real data and measurements What about Passive?

24 Netflow et. al. Switch identifies flow by sce/dst ports, protocol Cuts record for each flow: –src, dst, ports, protocol, TOS, start, end time Collect records and analyze Can be a lot of data to collect each day, needs lot cpu –Hundreds of MBytes to GBytes No intrusive traffic, real: traffic, collaborators, applications No accounts/pwds/certs/keys No reservations etc Characterize traffic: top talkers, applications, flow lengths etc. LHC-OPN requires edge routers to provide Netflow data Internet 2 backbone – SLAC: –

25 Typical day’s flows Very much work in progress Look at SLAC border Typical day: –~ 28K flows/day –~ 75 sites with > 100KB bulk-data flows –Few hundred flows > GByte Collect records for several weeks Filter 40 major collaborator sites, big (> 100KBytes) flows, bulk transport apps/ports (bbcp, bbftp, iperf, thrulay, scp, ftp …) Divide by remote site, aggregate parallel streams Look at throughput distribution

26 Netflow et. al. Peaks at known capacities and RTTs –RTTs might suggest windows not optimized, peaks at default OS window size(BW=Window/RTT)

27 How many sites have enough flows? In May ’05 found 15 sites at SLAC border with > 1440 (1/30 mins) flows –Maybe Enough for time series forecasting for seasonal effects Three sites (Caltech, BNL, CERN) were actively monitored Rest were “free” Only 10% sites have big seasonal effects in active measurement Remainder need fewer flows So promising

28 Mining data for sites Real application use (bbftp) for 4 months Gives rough idea of throughput (and confidence) for 14 sites seen from SLAC

29 Multi months Bbcp SLAC to Padova Bbcp throughput from SLAC to Padova Fairly stable with time, large variance Many non network related factors

30 Netflow limitations Use of dynamic ports makes harder to detect app. –GridFTP, bbcp, bbftp can use fixed ports (but may not) –P2P often uses dynamic ports –Discriminate type of flow based on headers (not relying on ports) Types: bulk data, interactive … Discriminators: inter-arrival time, length of flow, packet length, volume of flow Use machine learning/neural nets to cluster flows E.g. Aggregation of parallel flows (needs care, but not difficult) Can use for giving performance forecast –Unclear if can use for detecting steps in performance

31 Conclusions Some tools fail at higher speeds Throughputs often depend on non-network factors: –Host: interface speeds (DSL, 10Mbps Enet, wireless), loads, resource congestion –Configurations (window sizes, hosts, number of parallel streams) –Applications (disk/file vs mem-to-mem) Looking at distributions by site, often multi- modal Predictions may have large standard deviations Need automated assist to diagnose events

32 In Progress Working on Netflow viz (currently at BNL & SLAC) then work with other LHC sites to deploy Add support for pathneck Look at other forecasters: e.g. ARMA/ARIMA, maybe Kalman filters, neural nets Working on diagnosis of events –Multi-metrics, multi-paths Signed collaborative agreement with Internet2 to collaborate with PerfSONAR –Provide web services access to IEPM data –Provide analysis forecasting and event detection to PerfSONAR data –Use PerfSONAR (e.g. router) data for diagnosis –Provide viz of PerfSONAR route information –Apply to LHCnet –Look at layer 1 & 2 information

33 Questions, More information Comparisons of Active Infrastructures: – Some active public measurement infrastructures: –www-iepm.slac.stanford.edu/www-iepm.slac.stanford.edu/ –www-iepm.slac.stanford.edu/pinger/www-iepm.slac.stanford.edu/pinger/ –e2epi.internet2.edu/owamp/e2epi.internet2.edu/owamp/ –amp.nlanr.net/amp.nlanr.net/ Monitoring tools – – –Google for iperf, thrulay, bwctl, pathload, pathchirp Event detection – d.docwww.slac.stanford.edu/grp/scs/net/papers/noms/noms d.doc

34 Outline Deployment, keeping in sync, management, timeouts, killing hung processes, host OS/env different Implementation: –MySQL dbs for data and configuration (host, tools, plotting etc.) info –Scheduler, prevents backup –Log files, analyze for troubles –Local target as a sanity check on monitor