PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.

Slides:



Advertisements
Similar presentations
Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
1 ESnet Network Measurements ESCC Feb Joe Metzger
User-Perceived Performance Measurement on the Internet Bill Tice Thomas Hildebrandt CS 6255 November 6, 2003.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
PerfSONAR Information Services Update Jason Zurawski Feb 2, 2009 Winter Joint Techs 2009, College Station Texas.
Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Connect communicate collaborate perfSONAR MDM updates: New interface, new weathermap, towards a complete interoperability Domenico Vicinanza perfSONAR.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks perfSONAR deployment over Spanish LHC Tier.
Internet2 Performance Update Jeff W. Boote Senior Network Software Engineer Internet2.
1 Measuring Circuit Based Networks Joint Techs Feb Joe Metzger
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
Manchester University Tiny Network Element Monitor (MUTiny NEM) A Network/Systems Management Tool Dave McClenaghan, Manchester Computing George Neisser,
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
Franco Carbognani, EGO LSC-Virgo Meeting May 2007 Status and Plans LIGO-G Z Software Management.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update on Network Performance Monitoring.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
Update on Network and Transfer Metrics WG Shawn McKee, Marian Babik GDB 8 th October 2014.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
GEMINI: Active Network Measurements Martin Swany, Indiana University.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
14-Nov-07 OWAMP (One-Way Latencies) BWCTL (Bandwidth Test Control) Jeff Boote Network Performance Tools BOF-SC07.
1 LHCOPN Monitoring Directions January 2007 Joe Metzger
Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
1 Deploying Measurement Systems in ESnet Joint Techs, Feb Joseph Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
OSG Production Foundations for 2M+ Hours/Day April 9, 2014 Rob Quick With Help from Shawn McKee and Chander Seghal.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
Shawn McKee, Marian Babik for the
perfSONAR-PS Deployment: Status/Plans
GOCDB New Requirements
POW MND section.
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
LHCOPN/LHCONE perfSONAR Update
Monitoring the US ATLAS Network Infrastructure with perfSONAR-PS
Update from the HEPiX IPv6 WG
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
Alerting/Notifications (MadAlert)
Deployment & Advanced Regular Testing Strategies
LHCONE perfSONAR: Status and Plans
Network Monitoring Update: June 14, 2017 Shawn McKee
Frederic Schaer, Sophie Ferry
Connie Logg February 13 and 17, 2005
Performance Measuring & Monitoring
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

perfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015

Overview of Talk  Overview of Status (Changes, Issues)  Status of perfSONAR Monitoring for LHCONE/LHCOPN  Discussion February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee2

LHCONE MaDDash – 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee3 We still have a couple hosts with issues: ps01-nl.geant.nl (called perfSONAR- latency) and the Internet2 host at ManLan (called Internet2 perfSONAR) both show issues. NOTE: labels are now generated from Mesh registration information

LHCOPN MaDDash – 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee4 We still have a couple hosts with issues: Kisti had firewall issues: updated today, Still LOTS of orange on latency mesh  BW mesh much better but “red” throughput is worth examining NOTE: labels are now generated from Mesh registration information

OSG Network Service  Open Science Grid (OSG) has deployed a network service for WLCG (and LHCONE). It consists of:  A datastore based upon Esmond (new MA in perfSONAR v3.4)  A GUI using MaDDash  A service monitoring component built on OMD  A “mesh-creation-configuration” utility built on registered information in OIM and GOCDB  Demo on how the mesh-creation works (have to use slides for this since we need X509 credentials) February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee5

OIM / Mesh Config / Hostgroups February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee6

OIM / Mesh Config / Parameters February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee7

OIM / Mesh Config / Configs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee8

Mesh Config Adding Tests February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee9

MyOSG / Mesh Config February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee10

MyOSG / Mesh Config (us-atlas) February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee11

perfSONAR Monitoring Pages  We have 3 versions of our perfSONAR monitoring pages  Prototype at maddash.aglt2.org  Testing at OSG’s ITB instance  Production at OSG’s production instance  Main monitoring types are MaDDash and OMD/Check_MK  Prototype:  Testing: / /  Production:  Notes:  OSG instances rely upon OSG Datastore:  X509 cert needed to view check_mk/OMD pages (any IGTF cert)  OSG datastore currently DOWN for resource consumption debugging February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee12

Prototype OMD for LHCONE perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee13 OMD (Open Monitoring Distribution) wraps a set of Nagios packages into a single pre—configured RPM Needs x509 credential from IGTF CA Very green now!

Prototype OMD for LHCOPN perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee14 Almost all “green”. A few tests are failing…why? Problem is in the Perl library used to get/parse the HTTPS page. There is a conflict in a library that the OMD host installs. The “Fix” the the HTTPS issues requires a newer version that conflicts with the version needed by other software on the host. Solution will be to update tests to utilize a Python library that will directly read the JSON host information via HTTPS

Issue for LHCONE Monitoring  OSG has assigned a subnet for LHC related monitoring and the network service components:  /24  All “production” perfSONAR monitoring is on that subnet  The network hosting OSG’s subnet is a campus “production” network and it is NOT willing to allow this subnet to setup a peering with an LHCONE VRF   Attempted solution was to utilize SOCKS5 proxying via AGLT2 to access LHCONE-only endpoints  Problem: not really working. May require software version changes  For now we are keeping the Prototype instances at AGLT2 running to provide the needed coverage. February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee15

Production OMD for LHCONE perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee16 Notice the difference in what the production instance measures vs the prototype instance. Certain hosts are not allowing icmp pings from the OSG subnet. Some checks are not working from the production host on certain systems.

Production OMD for LHCOPN perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee17 Similar issues for the production LHCOPN monitoring. BNL hosts are not allowing icmp pings from the OSG subnet. Some checks are not working from the production host on certain systems.

OSG Network Datastore  All perfSONAR metrics should be collected into the OSG network datastore  This is an Esmond datastore from perfSONAR (postgresql+cassandra backends)  Loaded via RSV probes; currently one probe per perfSONAR instance every 15 minutes.  Probes have a bug: TWICE the BW as measured by the node  Datastore on pfsd.grid.iu.edu  JSON at  Python API at  Perl API at ps/wiki/MeasurementArchivePerlAPIhttps://code.google.com/p/perfsonar- ps/wiki/MeasurementArchivePerlAPI  Currently the datastore is down for debugging resource usage  All LHONE and (LHC)OPN data should be stored there February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee18

Network Datastore Access via JSON February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee19

PuNDIT Project February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee20

Next Steps  We are working on getting ALL WLCG/OSG perfSONAR instances fully updated and properly configured  Need to be reliably gathering all network metrics centrally  Feb 16 is the deadline for sites to update and configure instances  There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this  As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee21

Discussion/Questions/Comments? February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee22

Useful URLs  Open Science Grid Networking URL   LHCOPN instructions for perfSONAR-PS (needs update):   MaDDash Monitoring  webui/index.cgi?dashboard=LHCONE%20testing%20sites webui/index.cgi?dashboard=LHCONE%20testing%20sites  webui/index.cgi?dashboard=LHCONE%20Mesh%20Config webui/index.cgi?dashboard=LHCONE%20Mesh%20Config  OMD Monitoring  art_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview _name%3Dhostgroup%26hostgroup%3DLHCONE art_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview _name%3Dhostgroup%26hostgroup%3DLHCONE  rt_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_ name%3Dhostgroup%26hostgroup%3DLHCONE rt_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_ name%3Dhostgroup%26hostgroup%3DLHCONE February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee23

LHCONE Network Matrices: 28Apr2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee24 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss> Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb orange Main issue was too much “orange” indicating missing measurements/data Sources are “row”, Destination is “column” Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”

LHCONE Network Matrices: 11Aug2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee25 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss> Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb orange Main issue is STILL too much “orange” indicating missing measurements/data Sources are “row”, Destination is “column” Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”

LHCONE Network Matrices: 15Sep2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee26 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss> Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb Improvements since the APAN meeting…mostly due to the work of Jason Zurawski (see later slides). Still a little orange remaining…some problems seem to be re-occurring after we have fixed them. Also we have MOST of the needed people in the room now…can we fix the rest?