Presentation on theme: "LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007."— Presentation transcript:
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007
LCG 3 Grids EGEE OSG Nordugrid
LCG WLCG=3 Grids EGEE+OSG+NGDF Would like it to be one seamless grid but not yet High-level tasks like Simulation Production can be split into 3 parts and farmed out Interoperability has some successes in job submission and information publishing For us WLCG Operations = EGEE Operations Many parts to infrastructure – concentrate here on Production Service How does it relate to you? What action can you take?
LCG The EGEE Infrastructure Certification Testbeds (SA3) Pre-production Service Production Service Test-beds & Services Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Support Structures Operations Advisory Group (+NA4) Joint Security Policy GroupEuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups Infrastructure: Physical test-beds & services Support organisations & procedures Policy groups
LCG Middleware Release Technical Coordination Group Agrees the contents and priorities for what goes into the integration and testing process Not all desired new components or updates may make the next distribution Depends on priorities and urgency for other pieces Moving away from big-bang releases to component upgrades Concept of a baseline release and then updates and patches New baseline when significant changes (dependencies, …)
LCG Certification Extensive certification test-bed: Close to 100 machines involved Main test-bed at CERN, test-beds for specific tasks at SA3 partner sites Emulate the deployment environments Or at least the main ones … Certification testing: Installation and configuration Component (service) functionality System testing (trying to emulate real workloads and stress testing) Beginning to use virtualization to simplify the testing environment Deployment into the pre-production system Final step of certification – validation by real sites Validation by applications – also allows to prepare apps for new versions Mostly hidden from you, but a lot of effort goes into it.
LCG Grid management: structure Operations Coordination Centre (OCC) management, oversight of all operational and support activities Regional Operations Centres (ROC) providing the core of the support infrastructure, each supporting a number of resource centres within its region Grid Operator on Duty Resource centres providing resources (computing, storage, network, etc.); Grid User Support (GGUS) At FZK, coordination and management of user support, single point of contact for users
LCG Grid monitoring The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources Regional Operations Centre …… Resource Centre Resource Centre … Regional Operations Centre Resource Centre Resource Centre … OSCT Grid Operator on-duty (COD) Monitoring shows a problem
LCG Grid Operator on Duty Role: Watch the problems detected by the grid monitoring tools Problem diagnosis Report these problems (GGUS tickets) Follow and escalate them if needed (well defined procedure) Provide help, propose solutions Build and maintain a central knowledge database (WIKI) Who? 10 ROC teams working in pairs (one lead and one backup) on a weekly rotation
LCG Grid monitoring tools Tools used by the Grid Operator on Duty team to detect problems Distributed responsibility CIC portal single entry point Integrated view of monitoring tools Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) Grid Operations Centre Core Database (GOCDB) GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor Others
LCG COD Tickets Dont ignore them! If problems seem to fix themselves (BDII load) then keep some stats (tickets/interventions) and report to Jeremy/Philippa Dont just fix problems Report trends, repeat problems, solutions The problem at your site is often a symptom of an underlying problem Middleware, deployment, configuration, documentation. Your intervention might help to fix them
LCG SAM Availability Algorithm CE = OR of your CEs SE = OR of your SEs Up if CE.AND.SE.AND.BDII.AND.SRM If Down Then Down until next Up Availability = % of time Up Reliability = % of time Up excluding Scheduled Downtime
LCG What to do? SAM Monitoring will be used to judge your site in many ways MoU, user satisfaction, Operations Get used to it! Complaining about the middleware doesnt work Continue to raise tickets and operations reports Look for workrounds Look at SAM failures for long-term fixes. If you cant reduce the number of problems, reduce their effect Automation, alarms Many other tools Nagios? Work on your problems but also work as a team.
LCG Accounting Each Tier1 submits manual report of:- Cputime, wallclocktime, disk, tape Allocated and used Per LHC VO Aggregated into a monthly report Which accumulates through the year Compared with MoU and installed capacity
LCG Automated Accounting This report is being Automated From March the results will be taken from APEL Overlap with manual report for 3 months Storage Accounting too (Gregs talk) Once automatic, easy to extend to Tier2s Be warned!
LCG What to do Study APEL for your site Look for gaps in data Check SI2K values published Compare with local records Check Storage Accounts If you are not being used by VOs, investigate
LCG Summary Act on trouble tickets Work on improving your SAM figures Check your accounting
LCG Message Site view may be from the bottom up We are motivated to put constituent parts in place and run them well WLCG view is from the top down. From up there they see the Tier1s clearly and are driving them Theyll spot you soon, so be prepared. Learn from the Tier1 GridPP has been a success in delivering to LHC … but the pressure will increase over 2007 Keep up the good work!