Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 IEPM-BWIEPM-BW Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003.

Similar presentations


Presentation on theme: "1 IEPM-BWIEPM-BW Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003."— Presentation transcript:

1 1 IEPM-BWIEPM-BW Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003.

2 2 Overview / Goals IEPM-BW monitoring and results Other measurements Publishing Troubleshooting Tools Further work

3 3 IEPM-BWIEPM-BW SLAC package for monitoring and analysis Currently 10 monitoring sites SLAC, FNAL, GATech (SOX), INFN (Milan), NIKHEF, APAN (Japan) Manchester, UMich, UCL, Internet2 2-36 targets

4 4 SNV SLAC CHI ESnet NY Stanford CalREN NERSC LANL JLAB TRIUMF KEK Abilene SLAC SNV FNAL ANL NIKHEF CERN IN2P3 CERN CALTECH SDSC BNL JAnet HSTN SEA ATL CLV IPLS RAL UCL UManc DL NNW NY Rice UTDallas NCSA UMich I2 SOX UFL APAN RIKEN INFN-Roma INFN-Milan CESnet APAN Geant EDG PPDG/GriPhyN Monitoring Site ORNL

5 5 Measurement Engine Ping, Traceroute Iperf, Bbftp, Bbcp (mem and disk) Abwe Gridftp, UDPmon Web100 Passive (netflow)

6 6

7 7 Other Projects (U.S.) PingER (SLAC, FNAL) eJDS (SLAC, ICTP) AMP (NLANR) NIMI (ICIR, PSC) MAGGIE (ICIR, PSC, SLAC, LBL, ANL) NASA, SCNM (LBL) Surveyor (Internet2) E2e PI and PIPES (Internet2) Also SLAC has a RIPE-TT box

8 8 PublishingPublishing Web Service —SOAP::Lite perl module —Python —Java NMWG OGSA

9 9 PublishingPublishing NMWG Properties document Path.delay.roundtrip (Demo)Demo Hop.bandwidth.capacity (tracespeed)tracespeed Guthrie (demo)demo Almost 1000 nodes in database PingER Networks Arena

10 10 AdvisorAdvisor Screenshot taken from the talk by Jim Ferguson at the e2e workshop, Miami Feb 2003.

11 11 MonaLisaMonaLisa Front-end visualization Vital component for development of the LHC Computing Model JINI/JAVA and WSDL/SOAP demo

12 12 TroubleshootingTroubleshooting RIPE-TT Testbox Alarm AMP Automatic Event Detection Our approach is diurnal changes

13 13 Diurnal Changes (1/4) Either Performance varies during the day Or it doesn’t No variation is the special case of variation=0

14 14 Diurnal Changes (2/4) Either performance (within the bin) is variable Or it isn’t No variation is the special case of variation=0

15 15 Diurnal Changes (3/4) Parameterize performance in terms of hour and variability within that hourly bin Measurements can be classified in terms of how they differ from historical value Recent problems are flagged due to difference from historical value Compare to measurement in previous bin to reduce false-positives

16 16 Diurnal Changes (4/4) Calculate Median and standard deviation of last five measurement in bin – e.g. Monday 7pm-8pm “Concerned” if latest measurement is more than 1 s.d. from median “Alarmed” if latest measurement is more than 2 s.d. from median

17 17 Trouble Detection $ tail maggie.log 04/28/2003 14:58:47 (1:14) gnt4 0.51 Alarm (AThresh=38.33) 04/28/2003 16:25:45 (1:16) gnt4 3.83 Concern (CThresh=87.08) 04/28/2003 17:55:21 (1:17) gnt4 169.57 Within boundaries Date and TimeBinNode Throughput (iperf) Status Only write to the log if an alarm is triggered Keep writing to the log until alarm is cleared

18 18 Trouble Status Tempted to make color-coded web page All the hard work still left to do Use knowledge to see common point of failure Production table would be >> 36x700 Instead figure out where to flag

19 19 Net Rat Alarm System – Multiple tools – Multiple measurement points Cross reference – Trigger further measurements – Starting point for human intervention – Informant database hop.performance No measurement is ‘authoritative’ – Cannot even believe a measurement

20 20 LimitationsLimitations Could be over an hour before alarm is generated More frequent measurements impact the network and measurements overlap Low impact tools allow finer grained measurement

21 21 Where next ? GLUE, OGSA, CIM Work with Other Projects Publishing and troubleshooting Discovery Security

22 22 Toward a Monitoring Infrastructure Certainly the need – DOE Science Community – Japanese Earth Simulator – Grid – Troubleshooting / E2Epi Many of the ingredients – Many monitoring projects – PIPES – MAGGIE

23 23 Summary “It is widely believed that a ubiquitous monitoring infrastructure is required”.

24 24 LinksLinks This talk IEPM-BW PingER ABwE AMP NIMI MAGGIE RIPE-TT Surveyor E2E PI SLAC Web Services GGF NMWG Arena Monalisa Advisor TroubleShooting

25 25 CreditsCredits Les Cottrell Connie Logg, Jerrod Williams Jiri Navratil Fabrizio Coccetti Brian Tierney Frank Nagy, Maxim Grigoriev Eric Boyd, Jeff Boote Vern Paxson, Andy Adams Iosif Legrand Jim Ferguson, Steve Englehart Local admins and other volunteers DoE/MICS

26 26 DemosDemos This is the output from the “Publishing” Demo on slide 9.slide 9 $ more soap_client.pl #!/usr/local/bin/perl use SOAP::Lite; print SOAP::Lite -> service('http://www-iepm.slac.stanford.edu/tools/soap/wsdl/profile_0002.wsdl') -> hopBandwidthCapacity("brdr.slac.stanford.edu:i2-gateway.stanford.edu"); $./soap_client.pl 1000Mb

27 27 DemosDemos This is the output from the “tracespeed” demo on slide 9.slide 9 $./tracespeed thunderbird.internet2.edu 0 doris 10Mb 1 core (134.79.122.32) 1000Mb 2 brdr (134.79.235.45) 1000Mb 3 i2-gateway.stanford.edu (192.68.191.83) No Data. 4 stan.pos.calren2.net (171.64.1.213) No Data. 5 sunv--stan.pos.calren2.net (198.32.249.73) No Data. 6 abilene--qsv.pos.calren2.net (198.32.249.162) No Data. 7 kscyng-snvang.abilene.ucaid.edu (198.32.8.103) No Data. 8 iplsng-kscyng.abilene.ucaid.edu (198.32.8.80) No Data. 9 so-0-2-0x1.aa1.mich.net (192.122.183.9) No Data. 10 so-0-0-0x0.ucaid2.mich.net (198.108.90.118) No Data. 11 thunderbird.internet2.edu (207.75.164.95) No Data.

28 28 Aside: NetRat (1/5) If last measurement was Within 1sd Mark each hop as Good Hop.performance = good If last measurement was “Concern” Mark each hop as acceptable If last measurement was an “Alarm” Mark Each hop as poor

29 29 Aside: NetRat (2/5) Measurement generates an alarm Set each hop.performance = poor

30 30 Aside: NetRat (3/5) Other measurements from same site do not generate alarms. Set each hop.performance = good Immediately ruled out problem in local LAN or host machine

31 31 Aside: NetRat (4/5) Different site monitors same target No alarm is generated Set each hop.performance = good Pinpointed possible problem in intermediate network.


Download ppt "1 IEPM-BWIEPM-BW Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003."

Similar presentations


Ads by Google