Presentation is loading. Please wait.

Presentation is loading. Please wait.

John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.

Similar presentations


Presentation on theme: "John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations."— Presentation transcript:

1 John Gordon j.c.gordon@rl.ac.uk and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

2 John Gordon j.c.gordon@rl.ac.uk Outline The monitoring tools How we use them in operations What is still to be done

3 John Gordon j.c.gordon@rl.ac.uk Grid Operations Once middleware has been developed, tested and deployed, grid operations are the set of actions and procedures to keep a grid running for the users.

4 John Gordon j.c.gordon@rl.ac.uk The Vision GOC Processes and Activities –Coordinating Grid Operations –Defining Service Level Parameters –Monitoring Service Performance Levels –First-Level Fault Analysis –Interacting with Local Support Groups –Coordinating Security Activities –Operations Development

5 John Gordon j.c.gordon@rl.ac.uk Have we delivered? Coordinating Grid Operations Defining Service Level Parameters Monitoring Service Performance Levels First-Level Fault Analysis Interacting with Local Support Groups Coordinating Security Activities Operations Development Yes, RAL, CERN & Taipei No up or down Yes Policies, not operation Monitoring and accounting

6 John Gordon j.c.gordon@rl.ac.uk Monitoring the Grid is a Challenge!

7 John Gordon j.c.gordon@rl.ac.uk  Why We Monitor Keep systems up and running Notice failures; grid-wide services MDS; Knowing what services a site should be running  no point raising an alert if the site isn’t meant to run it!  definition of services and which sites run them (SLA)  What Tools Do We Use Job Submission; GridIce; Nagios; GIIS Monitor How – Database Developments Planned nagios Monitoring Overview

8 John Gordon j.c.gordon@rl.ac.uk We have only fragmentary information about the services that sites are running. We don’t know what RBs/SEs/Sites the VOs are using for data challenges. We don’t know what the core services are and who is running them. We don’t have a toolkit to test specific core services. We have to concentrate on functional behaviour of services e.g If an RB sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB? Not all the tests that we perform are effective at finding problems so we must take tests written by the experts and integrate them into GOC monitoring. We must develop tests which simulate the life cycle of real applications in a Grid environment. There are lots of monitoring tools available, so we need to bring them together. Do we spend time investigating new tools, or make the ones which we already have better? …and probably lots more! Monitoring Challenges

9 John Gordon j.c.gordon@rl.ac.uk There are many frameworks which can be used to monitor distributed environments MAPCENTRE http://mapcenter.in2p3.fr/http://mapcenter.in2p3.fr/ GPPMON http://goc.grid-support.ac.uk/http://goc.grid-support.ac.uk/ GRIDICE http://grid-ice.esc.rl.ac.ukhttp://grid-ice.esc.rl.ac.uk NAGIOS http://www.nagios.org/http://www.nagios.org/ MONALISA http://monalisa.cacr.caltech.edu/http://monalisa.cacr.caltech.edu/ GIIS Monitor http://goc.grid.sinica.edu.tw/gstat /http://goc.grid.sinica.edu.tw/gstat / Ganglia –Example: Mapcentre 30 sites ~ 500 lines in config file (static version) –Example: Nagios 30 sites, 12 individual config files with dependencies –Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON Monitoring Services

10 John Gordon j.c.gordon@rl.ac.uk GOC Configuration Database GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Secure Database Management via HTTPS / X.509 People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER

11 John Gordon j.c.gordon@rl.ac.uk GOC Job Submission Flow Diagram Simple job forked on CE using globus GOC (UI) Build List of CE, RB Resources JOB Script GLOBUS.CE create CE sent acknowledge globus-job-run CE SITE DB SQL QUERY wget http://goc_ui/ack.cgi?GLOBUS.CE received acknowledgement 1 2 3 4 5 GPPMON - 2

12 John Gordon j.c.gordon@rl.ac.uk GPPMON - 3 JOB Script RB.CE create RB sent acknowledge edg-job-submit GOC (UI) Build List of CE, RB Resources SITE DB SQL QUERY CE Other.GlueCEUniqueID wget http://goc_ui/ack.cgi?RB.CE received acknowledgement WN CE Simple job through local jobmanager on CE via Resource Broker Job MatchMaking

13 John Gordon j.c.gordon@rl.ac.uk LCG2 Site Status: 21 July 2004 10.00am GPPMON – 1

14 John Gordon j.c.gordon@rl.ac.uk GRIDICE - 1 http://grid-ice.esc.rl.ac.uk/gridice

15 John Gordon j.c.gordon@rl.ac.uk

16 John Gordon j.c.gordon@rl.ac.uk Ganglia Monitoring - 1 http://gridpp.ac.uk/ganglia Can use Ganglia to monitor a cluster RAL Tier-1 Centre LCG PBS Server displays Job status for each VO

17 John Gordon j.c.gordon@rl.ac.uk Ganglia Monitoring - 2 Can also use Ganglia to monitor clusters of clusters

18 John Gordon j.c.gordon@rl.ac.uk  Provide ROCs with a package to monitor the resources in the region Tailored Monitoring ROCs may upload their own maps JAVA GUI to automate site locations on the map  Hierarchical view of Resources Example GridPP made up of virtual T2 centres Regional Monitoring - 1 EGEE FranceUK/I GridPP LondonT2 IMPERIAL QMUL ScotGrid Edinburgh S.E.E

19 John Gordon j.c.gordon@rl.ac.uk LCG2 Site Status: 21 July 2004 10.00am GPPMON – 1

20 John Gordon j.c.gordon@rl.ac.uk  http://goc.grid-support.ac.uk/roc_map/map.php http://goc.grid-support.ac.uk/roc_map/map.php  Active map to select individual regions Regional Monitoring - 2

21 John Gordon j.c.gordon@rl.ac.uk Regional Monitoring - 3 UK/I Monitoring displays GRIDPP and NGS resources.

22 John Gordon j.c.gordon@rl.ac.uk Replica Manager Tests - 1 GOC to take over site certification testing which is done by CERN deployment team on a daily basis (e.g reports by Piotr Nyczyk) First step toward this involved running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3 rd party copies from remote SE e.g Castorgrid Demonstrates that we can integrate other peoples tools into GPPMON Development of a portal which will: –Make it easy to retrieve debug information from the job output. –Connect with information provided by other monitoring tools e.g Taipei GIIS Monitor. –Provide testing “on-demand” to site administrators through a secure interface.

23 John Gordon j.c.gordon@rl.ac.uk http://goc.grid-support.ac.uk/gridsite/status/rmtest.php?action=table Results of each test are shown as a coloured index on the map. Distinguish between jobs that have completed, or have failed or still running. Replica Manager Tests - 2

24 John Gordon j.c.gordon@rl.ac.uk Description of the tests Job Outputs GIIS Monitor Information Replica Manager Tests - 3

25 John Gordon j.c.gordon@rl.ac.uk GIIS Monitor Developed by MinTsai (GOC Taipei) Tool to display and check information published by the site GIIS http://goc.grid.sinica.edu.tw/gstat/

26 John Gordon j.c.gordon@rl.ac.uk Job Accounting -1 http://goc.grid-support.ac.uk/ROC/docs/accounting/accounting.php Program publishes PBS log file information through RGMA to the GOC GOC aggregates data across all sites.

27 John Gordon j.c.gordon@rl.ac.uk Job Accounting - 2 Offline testing of program using data from the CORE sites completed. Development of an accounting portal underway to provide accounting on- demand for each site, and aggregated for each EGEE region Challenge! Deal with large database 1 ROW per LCGPBS Job per Site! http://goc-dev.esc.rl.ac.uk/jpg/goc_demo.php http://goc-dev.esc.rl.ac.uk/jpg/goc_demo3.php

28 John Gordon j.c.gordon@rl.ac.uk GridPP Accounting

29 John Gordon j.c.gordon@rl.ac.uk EDG-network monitoring

30 John Gordon j.c.gordon@rl.ac.uk Security Worked with Security Group Defined a Security Policy –and auditing procedures Have a list for security contacts –but not really exercised it yet –still need to define procedures in the event of security incidents

31 John Gordon j.c.gordon@rl.ac.uk Keeping the Work Flowing Regular monitoring of job submission –shows sites that have problems running jobs Nagios tracks individual services –plus certificate lifetime RM tests show whether data can be moved GridICE and Ganglia show what is running Limited by RB behaviour –we can see that jobs are not getting to sites but not why.

32 John Gordon j.c.gordon@rl.ac.uk What we have delivered? A set of monitoring tools A monitoring regime Two GOCs (RAL and Taipei) Security Policy

33 John Gordon j.c.gordon@rl.ac.uk Still to do Effective problem tracking –we see site problems and get them fixed –but don’t manage long-term problems Integration with User Support –we track problems we see –but problems users notice not effectively dealt with Automatic alerts –Nagios does but EMS from Taipei looks promising Remote repair –agents until middleware can support this directly Security Deploy accounting Distribute monitoring to EGEE ROCs and others

34 John Gordon j.c.gordon@rl.ac.uk What Next ? (1) RSS used to send tailored streams –sites, ROCs, management can all decide what to subscribe to Accounting –being tested in LCG C&T testbed –should be in next LCG release –Then get T2 accounts keep your pbs log and msgs and gatekeeper logs

35 John Gordon j.c.gordon@rl.ac.uk Monitoring Feeds GOC server generates a lot of monitoring information. Need a way to give this information to the right people e.g site administrators Really Simple Syndication (RSS) is an XML schema Used by many sites which want to syndicate content e.g BBC, Slashdot Client Pull model: GOC creates RSS formatted documents, clients pull these feeds which render them in html.

36 John Gordon j.c.gordon@rl.ac.uk Aggregator RSSReader (Windows Client) GOC generates RSS feeds which clients can pull using an RSS aggregator. Aggregators available for Linux, Windows and MacOS The aggregator shown displays test results for the RAL CE. These results are archived and popup on the desktop when the feed is updated.

37 John Gordon j.c.gordon@rl.ac.uk What next? (2) GGUS developments –operations issued forwarded to UK GSC helpdesk Weekly LCG GDA Operations Meeting –see next slide EGEE ROCs taking support load –UK ready? EGEE CICs taking operations load on weekly rotation

38 John Gordon j.c.gordon@rl.ac.uk Proposal 2 hour weekly meeting, with VRVS for remote participation – –use the existing GDA slot –Fully open meeting Weekly operations reports (written in advance - previous Friday evening) from –Each EGEE ROC (NE should include Nordugrid ops) –Taipei GOC –Grid3 (covering FNAL and BNL Tier 1’s) –Other LCG Tier 1 sites (where different from the above) - Triumf, Tokyo – others? –ROCs and Tier1s will report on and represent the sites they support Weekly reports (written submitted in advance) from customers: –LHC experiments –Bio-med –Others as they come on-line During the meeting only issues should be brought up and resolved Need to have good representation from ROCs and Tier 1s Need application reps involved in grid work to attend Once a month have more general discussions (presentation style): eg: –Middleware developments –Larger issues - batch system problems, etc Minutes, attendance and problems will be public

39 John Gordon j.c.gordon@rl.ac.uk UK view RAL CIC will take on part of ongoing GOC work –including development for LCG/EGEE UK/I ROC will monitor and support UK/I sites –Helpdesk/DTeam/GOC –Maps tailored for Tier2s


Download ppt "John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations."

Similar presentations


Ads by Google