Presentation is loading. Please wait.

Presentation is loading. Please wait.

PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.

Similar presentations


Presentation on theme: "PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015."— Presentation transcript:

1 perfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015

2 Overview of Talk  Overview of Status (Changes, Issues)  Status of perfSONAR Monitoring for LHCONE/LHCOPN  Discussion February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee2

3 LHCONE MaDDash – 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee3 We still have a couple hosts with issues: ps01-nl.geant.nl (called perfSONAR- latency) and the Internet2 host at ManLan (called Internet2 perfSONAR) both show issues. NOTE: labels are now generated from Mesh registration information

4 LHCOPN MaDDash – 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee4 We still have a couple hosts with issues: Kisti had firewall issues: updated today, Still LOTS of orange on latency mesh  BW mesh much better but “red” throughput is worth examining NOTE: labels are now generated from Mesh registration information

5 OSG Network Service  Open Science Grid (OSG) has deployed a network service for WLCG (and LHCONE). It consists of:  A datastore based upon Esmond (new MA in perfSONAR v3.4)  A GUI using MaDDash  A service monitoring component built on OMD  A “mesh-creation-configuration” utility built on registered information in OIM and GOCDB  Demo on how the mesh-creation works (have to use slides for this since we need X509 credentials) February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee5

6 OIM / Mesh Config / Hostgroups February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee6

7 OIM / Mesh Config / Parameters February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee7

8 OIM / Mesh Config / Configs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee8

9 Mesh Config Adding Tests February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee9

10 MyOSG / Mesh Config February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee10

11 MyOSG / Mesh Config (us-atlas) February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee11

12 perfSONAR Monitoring Pages  We have 3 versions of our perfSONAR monitoring pages  Prototype at maddash.aglt2.org  Testing at OSG’s ITB instance  Production at OSG’s production instance  Main monitoring types are MaDDash and OMD/Check_MK  Prototype: http://maddash.aglt2.org/maddash-webuihttp://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk  Testing: http://perfsonar-itb.grid.iu.edu/maddash-webui/http://perfsonar-itb.grid.iu.edu/maddash-webui/ https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk /https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk /  Production: http://pfmad.grid.iu.edu/maddash-webui/http://pfmad.grid.iu.edu/maddash-webui/ https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk  Notes:  OSG instances rely upon OSG Datastore: http://pfds.grid.iu.eduhttp://pfds.grid.iu.edu  X509 cert needed to view check_mk/OMD pages (any IGTF cert)  OSG datastore currently DOWN for resource consumption debugging February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee12

13 Prototype OMD for LHCONE perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee13 https://maddash.aglt2.org/WLCGperfSONAR/check_mk OMD (Open Monitoring Distribution) wraps a set of Nagios packages into a single pre—configured RPM Needs x509 credential from IGTF CA Very green now!

14 Prototype OMD for LHCOPN perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee14 Almost all “green”. A few tests are failing…why? Problem is in the Perl library used to get/parse the HTTPS page. There is a conflict in a library that the OMD host installs. The “Fix” the the HTTPS issues requires a newer version that conflicts with the version needed by other software on the host. Solution will be to update tests to utilize a Python library that will directly read the JSON host information via HTTPS https://maddash.aglt2.org/WLCGperfSONAR/check_mk/

15 Issue for LHCONE Monitoring  OSG has assigned a subnet for LHC related monitoring and the network service components:  129.79.53.0/24  All “production” perfSONAR monitoring is on that subnet  The network hosting OSG’s subnet is a campus “production” network and it is NOT willing to allow this subnet to setup a peering with an LHCONE VRF   Attempted solution was to utilize SOCKS5 proxying via AGLT2 to access LHCONE-only endpoints  Problem: not really working. May require software version changes  For now we are keeping the Prototype instances at AGLT2 running to provide the needed coverage. February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee15

16 Production OMD for LHCONE perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee16 https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/ Notice the difference in what the production instance measures vs the prototype instance. Certain hosts are not allowing icmp pings from the OSG subnet. Some checks are not working from the production host on certain systems.

17 Production OMD for LHCOPN perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee17 Similar issues for the production LHCOPN monitoring. BNL hosts are not allowing icmp pings from the OSG subnet. Some checks are not working from the production host on certain systems. https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/

18 OSG Network Datastore  All perfSONAR metrics should be collected into the OSG network datastore  This is an Esmond datastore from perfSONAR (postgresql+cassandra backends)  Loaded via RSV probes; currently one probe per perfSONAR instance every 15 minutes.  Probes have a bug: TWICE the BW as measured by the node  Datastore on pfsd.grid.iu.edu  JSON at http://pfds.grid.iu.edu/esmond/perfsonar/archive/?format=jsonhttp://pfds.grid.iu.edu/esmond/perfsonar/archive/?format=json  Python API at http://software.es.net/esmond/perfsonar_client.htmlhttp://software.es.net/esmond/perfsonar_client.html  Perl API at https://code.google.com/p/perfsonar- ps/wiki/MeasurementArchivePerlAPIhttps://code.google.com/p/perfsonar- ps/wiki/MeasurementArchivePerlAPI  Currently the datastore is down for debugging resource usage  All LHONE and (LHC)OPN data should be stored there February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee18

19 Network Datastore Access via JSON February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee19

20 PuNDIT Project February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee20

21 Next Steps  We are working on getting ALL WLCG/OSG perfSONAR instances fully updated and properly configured  Need to be reliably gathering all network metrics centrally  Feb 16 is the deadline for sites to update and configure instances  There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this  As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee21

22 Discussion/Questions/Comments? February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee22

23 Useful URLs  Open Science Grid Networking URL  https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG  LHCOPN instructions for perfSONAR-PS (needs update):  https://twiki.cern.ch/twiki/bin/view/LHCOPN/PerfsonarPS https://twiki.cern.ch/twiki/bin/view/LHCOPN/PerfsonarPS  MaDDash Monitoring  http://maddash.aglt2.org/maddash- webui/index.cgi?dashboard=LHCONE%20testing%20sites http://maddash.aglt2.org/maddash- webui/index.cgi?dashboard=LHCONE%20testing%20sites  http://pfmad.grid.iu.edu/maddash- webui/index.cgi?dashboard=LHCONE%20Mesh%20Config http://pfmad.grid.iu.edu/maddash- webui/index.cgi?dashboard=LHCONE%20Mesh%20Config  OMD Monitoring  https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?st art_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview _name%3Dhostgroup%26hostgroup%3DLHCONE https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?st art_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview _name%3Dhostgroup%26hostgroup%3DLHCONE  https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?sta rt_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_ name%3Dhostgroup%26hostgroup%3DLHCONE https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?sta rt_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_ name%3Dhostgroup%26hostgroup%3DLHCONE February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee23

24 LHCONE Network Matrices: 28Apr2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee24 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss>0.01 0.5 0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb orange Main issue was too much “orange” indicating missing measurements/data Sources are “row”, Destination is “column” Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”

25 LHCONE Network Matrices: 11Aug2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee25 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss>0.01 0.5 0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb orange Main issue is STILL too much “orange” indicating missing measurements/data Sources are “row”, Destination is “column” Each box split into two regions indicating where the test is run: top corresponds to “row”, bottom to “column”

26 LHCONE Network Matrices: 15Sep2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee26 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss>0.01 0.5 0.9 Gb, 0.5<BW<0.9 Gb, BW<0.5 Gb Improvements since the APAN meeting…mostly due to the work of Jason Zurawski (see later slides). Still a little orange remaining…some problems seem to be re-occurring after we have fixed them. Also we have MOST of the needed people in the room now…can we fix the rest?


Download ppt "PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015."

Similar presentations


Ads by Google