Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org EGEE Infrastructure, Services, & Operations Ian Bird, CERN IT SA1 Activity Leader 1 st EGEE.

Similar presentations


Presentation on theme: "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org EGEE Infrastructure, Services, & Operations Ian Bird, CERN IT SA1 Activity Leader 1 st EGEE."— Presentation transcript:

1 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org EGEE Infrastructure, Services, & Operations Ian Bird, CERN IT SA1 Activity Leader 1 st EGEE User Forum2 nd March 2006

2 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 2 Outline Introduction – history Middleware and Services Middleware distributions Operations User Support Access to resources & Introducing new VOs What can you get from EGEE? –And what does it cost? From EGEE to EGEE-II Outlook SA1 – Operations & Management  97% SA2 – Network Services  3%

3 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Introduction

4 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 4 History EGEE infrastructure (middleware distribution and operations) was built up during 18 months prior to the start of EGEE by the LCG project –The LCG work formed the basic infrastructure of EGEE –The middleware distribution retained this name (LCG-2.x) as it was expected to be replaced by gLite –Now the middleware distribution will evolve with additional or replacement services coming from gLite or elsewhere EGEE started in April 2004 with a running grid infrastructure –40 sites, 3000 CPU –Basic operations –Developed certification and deployment process Now expanded to: –200 sites, >20 000 CPU, 40 countries –Managed operations – stability of sites –>10 000 jobs / day sustained over the last year Sites CPU Jobs/day

5 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 5

6

7 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Middleware & Services

8 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 8 Grid middleware Middleware is software and services that sit between the user application and the underlying computing and storage resources, to provide a uniform access to those resources. The GRID middleware services: should –Find convenient places for the application to be run –Optimise use of resources –Organise efficient access to data –Deal with authentication to the different sites that are used –Run the job & monitor progress –Recover from problems –Transfer the result back to the scientist

9 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 9 Middleware Distributions and Stacks Terminology: –EGEE deploys a middleware distribution  Drawn from various middleware products, stacks, etc.  Do not confuse the distribution with development projects or with software packages  Count on 6 months from software developer “release” to production deployment –The EGEE distribution:  Current production version labelled: LCG-2.7.0  Next version labelled: gLite-3.0  Name change to hopefully reduce confusion EGEE distribution contents:  LCG-2.7.0: –VDT – packaging Globus 2.4, Condor, MyProxy –EDG workload management –LCG components:  BDII (info sys),  catalogue (LFC),  DPM, data management libraries and CLI tools  monitoring tools –gLite: R-GMA, VOMS, FTS  gLite-3.0: –Based on LCG-2.7.0, and –gLite workload management –Other gLite components (not in the distribution but provided as services):  AMGA, Hydra, Fireman  gLite-IO evolution

10 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 10 CAs, Authentication, Authorization Authentication Use of GSI, X.509 certificates –Generally issued by national certification authorities Agreed network of trust: –International Grid Trust Federation (IGTF)  EUGridPMA  APGridPMA  TAGPMA –All EGEE sites will usually trust all IGTF root CAs Authorization Until LCG-2.7.0 via grid-map files only From LCG-2.7.0 using VOMS extended proxies –Call-outs to local authorization services –Integration with grid services under way – compute elements, storage systems –For some time the authorization will be a mixture of call-outs and grid-map files until all services understand extended proxies TAGPMA APGridPMA The Americas Grid PMA European Grid PMA EUGridPMA Asia- Pacific Grid PMA

11 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 11 Basic Services Job Management: Workload Management – –Resource Broker –DLI/SI interface to catalogues for data-based scheduling –Bulk job submission (gLite-3.0) –DAGs (gLite-3.0) –Push/pull mode (pull untested – gLite-3.0) Compute Element (CE): –Globus/EDG/LCG  Condor_C (VO-based scheduling) in gLite-3.0 Logging & Bookkeeping Local Batch systems: –LSF, PBS, Condor, (Sun Grid Engine) Additional tools: –Ability to “peek” at stdout/stderr of running jobs –User job monitoring – look at the status (state, cpu time, etc) of running jobs Data Management File and replica catalogues (LFC) –Central or local (not distributed) –Replication via Oracle, or squid caches tested by LCG –Secure File Transfer Service (FTS) –Reliable data transfer –Uses gridftp or srmcopy as transport Storage Elements based on SRM interface –DPM: implements Posix ACLs, VOMS roles/groups (gLite-3.0) –Other available SEs: dCache, Castor –Deprecated: “Classic SE” – basically just gridftp Metadata catalogue: –AMGA (gLite-3.0 – partial support) Secure Keystore: –Hydra (gLite-3.0 – partial support) Utilities and IO libraries: –Lcg-utils –GFAL – this is the SRM client library –gLiteIO – expect functionality to be replaced

12 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 12 Other services Information system BDII (implementation of Globus MDS) GLUE schema Several tools to access information FCR site selection tool (see next slide) Monitoring & Accounting R-GMA used as monitoring framework Aggregation for various sources of monitoring data Accounting: APEL package: –After-the-fact accounting –Uses GGF User Record as schema –Does not provide user-level data – but this is a legal/privacy issue not technical!

13 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 13 Selecting resources Selecting resources: –Tool that uses dynamically updated data about sites  Site functional tests –VO can:  Select critical tests  White/black list sites –VO gets a customised set of “good” sites – a view in the information system –VO can add VO- specific tests Can be used by RB or other workload management system to run on good/stable sites

14 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 14 Selecting resources

15 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 15 Selecting resources

16 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Middleware distributions  Deployment

17 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 17 Integration VDT/OSG OMII- Europe JRA1 SA3 … Testing & Certification Support, analysis, debugging Production service SA1 Pre-production service Middleware providers SA3 Certification activities SA3+SA1 Process to deployment

18 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 18 Release Process (simplified) C&T EIS GIS GDB Applications RC Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah EIS CICs Head of Deployment Head of Deployment prioritization & selection Developers Applications Developers 1 1 List for next release (can be empty) List for next release (can be empty) 2 2 integration & first tests C&T 3 3 Internal Releases Internal Releases 4 4 User Level install of client tools EIS 5 5 full deployment on test clusters (6) functional/stress tests ~1 week C&T 6 6 assign and update cost Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah components ready at cutoff Internal Client Release Internal Client Release 7 7 Client Release Client Release Service Release Service Release Updates Release Updates Release Core Service Release Core Service Release C&T

19 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 19 Deployment process Release(s) Certification is run daily Update User Guides EIS Update Release Notes GIS Release Notes Installation Guides User Guides Re-Certify CIC Every Month 11 Release Client Release Deploy Client Releases (User Space) GIS Deploy Service Releases (Optional) CICs RCs CICs RCs Deploy Major Releases (Mandatory) ROCs RCs ROCs RCs YAIM Every Month Every 3 months on fixed dates ! at own pace

20 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 20 Certification test bed

21 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 21 Time to upgrade Time to upgrade ~constant (~2.5 sites/day) Takes a long time to upgrade entire infrastructure Better now than it was – site functional tests and operational oversight Need to move away from the need to do full upgrades more than 1-2 times / year –But need to be able to deploy updates, new tools, security patches, etc. LCG-2.6.0

22 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 22 Desired scenario Steady-state with: –Components delivered (as far as possible) independent of each other –Developed according to realistic schedules – not constrained by artificial release deadlines –Production service running stable, tested (certified) versions of services and tools  Major upgrades only 1 or 2 times per year  Potential for upgrading individual services  Client tools: new versions deployed as needed  Emphasis on reliability, stability, performance, backward compatibility, … –Pre-production service running new, but certified versions of services  Anticipated as upgrades to production services (beta releases of next versions or new services)  Allowing reasonable scale application testing and integration with new versions –Certification testbed running full regression, stress, and functional tests  Pre-requisite before moving to pre-production and production Software can be rejected (not working, not ready, … ) –During testing/certification –During pre-production Net result must be that the production service is stable and as reliable as possible; and evolves incrementally and in a controlled way

23 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 23 Checklist for a new service User support procedures (GGUS) –Troubleshooting guides + FAQs –User guides Operations Team Training –Site admins –CIC personnel –GGUS personnel Monitoring –Service status reporting –Performance data Accounting –Usage data Service Parameters –Scope - Global/Local/Regional –SLAs –Impact of service outage –Security implications Contact Info –Developers –Support Contact –Escalation procedure to developers Interoperation –Documented issues First level support procedures –How to start/stop/restart service –How to check it’s up –Which logs are useful to send to CIC/Developers  and where they are SFT Tests –Client validation –Server validation –Procedure to analyse these  error messages and likely causes Tools for CIC to spot problems –GIIS monitor validation rules (e.g. only one “global” component) –Definition of normal behaviour  Metrics CIC Dashboard –Alarms Deployment Info –RPM list –Configuration details –Security audit  This is what is takes to make a reliable production service from a middleware component  Not much middleware is delivered with all this … yet

24 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Operations

25 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 25 Grid Operations Services: –Production service –Pre-production service –Operational security – incident response Operation process, includes: –Problem detection –Reporting –Problem solving –Escalation procedures

26 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 26 EGEE Operations Structure Operations Management Centre (OMC) Core Infrastructure Centres (CIC) –Manage daily grid operations – oversight, troubleshooting  “Operator on Duty” –Run infrastructure services –UK/I, Fr, It, CERN, Ru,Taipei Regional Operations Centres (ROC) –Front-line support for user and operations issues –Provide local knowledge and adaptations –One in each region – many distributed User Support Centre (GGUS) –In FZK: provide single point of contact (service desk) + portal.

27 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 27 EGEE Operations Process Grid operator on duty –6 teams working in weekly rotation  CERN, IN2P3, INFN, UK/I, Ru,Taipei –Crucial in improving site stability and management Operations coordination –Weekly operations meetings –Regular ROC, CIC managers meetings –Series of EGEE Operations Workshops  Nov 04, May 05, Sep 05, (June 06?) Geographically distributed responsibility for operations: –There is no “central” operation –Tools are developed/hosted at different sites:  GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) Procedures described in Operations Manual –Introducing new sites –Site downtime scheduling –Suspending a site –Escalation procedures –etc

28 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 28 Operations tools: Dashboard Dashboard provides top level view of problems: –Integrated view of monitoring tools (SFT, GStat) shows only failures and assigned tickets –Single tool for ticket creation and notification emails with detailed problem categorisation and templates –Detailed site view with table of open tickets and links to monitoring results –Ticket browser highlighting expired tickets Test summary (SFT,GSTAT) GGUS Ticket status ` Problem categories ` Sites list (reporting new problems) Developed and operated by CC- IN2P3: http://cic.in2p3.fr/http://cic.in2p3.fr/

29 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 29 Regional Operations Centre … … Resource Centre Resource Centre … Resource Centre Resource Centre … Operations Coordination Centre OSCT Coordination, Middleware deployment Operational security coordination 1 st Level support 2 nd Level support JSPG Coordination, Middleware deployment Coordination, Middleware deployment JSPG: Joint Security Policy Group OSCT: Operational Security Coordination Team Operations/deployment support

30 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 30 Regional Operations Centre …… Resource Centre Resource Centre … Regional Operations Centre Resource Centre Resource Centre … OSCT Grid Operator on-duty 2 nd Level support 1 st Level support Monitoring shows a problem Operator submits a GGUS ticket against the ROC and cc’s the site. The ticket is followed until it is solved ROC and Site work to resolve the problem Operations support workflows

31 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 31 Evolution of SFT metric Missing log data Available sites Available CPU Daily: July  November

32 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 32 Security Policy Joint Security Policy Group –EGEE with strong input from OSG –Policy Set: Policy Revisions –Grid Acceptable Use Policy (AUP)  https://edms.cern.ch/document/428036/ https://edms.cern.ch/document/428036/  common, general and simple AUP  for all VO members using many Grid infrastructures EGEE, OSG, SEE-GRID, DEISA, national Grids… –VO Security  https://edms.cern.ch/document/573348/ https://edms.cern.ch/document/573348/  responsibilities for VO managers and members  VO AUP to tie members to Grid AUP accepted at registration –Incident Handling and Response  https://edms.cern.ch/document/428035/ https://edms.cern.ch/document/428035/  defines basic communications paths  defines requirements (MUSTs) for IR reporting response protection of data analysis  not to replace or interfere with local response plans Security & Availability Policy Usage Rules Certification Authorities Audit Requirements Incident Response User Registration & VO Management Application Development & Network Admin Guide VO Security

33 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 33 Operational Security Coordination Team (OSCT) –What it is not:  Not focused on middleware security architecture  Not focused on vulnerabilities (see Vulnerabilities Group) –Focus on Incident Response Coordination  Assume it’s broken, how do we respond?  Planning and Tracking –Focus on ‘Best Practice’  Advice  Monitoring  Analysis –Coordinators for each EGEE ROC  plus OSG LCG Tier 1 + Taipei SSC1 - Job Trace SSC2 - Storage Audit Infrastructure HANDBOOK Incident Response Policy Procedures Resources Reference Playbook Security Service Challenge Infrastructure Agents Deployment Monitoring Tools 3 strategies OSCT membership  ROC security contacts

34 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 34 Vulnerability Group Has been set up last summer (CCLRC lead) Purpose: inform developers, operations, site managers of vulnerabilities as they are identified and encourage them to produce fixes or to reduce their impact Set up (private!) database of vulnerabilities –To inform sites and developers Urgent action  OSCT to manage After reaction time (45 days) –Vulnerability and risk analysis given to OSCT to define action – publication? –Will not publish vulnerabilities with no solution Intend to report progress and statistics on vulnerabilities by middleware component and response of developers Balance between open responsible public disclosure and creating security issues with precipitous publication Following first experience in implementing this process, review of procedures under way, including need for appropriate risk analyses

35 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org User Support

36 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 36 Goals A single access point for support A portal with a well structured information and updated documentation Knowledgeable experts Correct, complete and responsive support Tools to help resolve problems –search engines –monitoring applications –resources status Examples, templates, specific distributions for software of interest Interface with other Grid support systems Connection with developers, deployment, operation teams Assistance during production use of the grid infrastructure

37 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 37 Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal The Support Model “Regional Support with Central Coordination" The ROCs, VOs and other project- wide groups such as the Core Infrastructure Center (CIC), middleware groups (JRA), network groups (NA), service groups (SA) areCICJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units

38 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 38 The GGUS System

39 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 39 GGUS Portal: user services Browseable tickets Search through solved tickets Useful links (Wiki FAQ) Broadcast tools Latest News GGUS Search Engine Updated documentation (Wiki FAQ)

40 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 40 TPM Grid experts GGUS Supporters VO-TPM VO experts User First line support VO Support Units Middleware Support Units Deployment Support Units Operations Support ROC Support Units Network Support Second line support

41 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 41 Performance statistics September October A peak of 80 tickets per day has been reached. November 2005: 315 tickets

42 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org New VOs; Access to Resources; Benefits & Costs

43 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 43 How new VOs find resources Various possibilities: 1.Pilot applications: –Expectation that they have access to resources provided by many partners  For EGEE-II this is specified in TA 2.Applications reviewed and approved by EGAAP: –Negotiation via OAG to understand which ROCs/sites are willing to  Run services on behalf of the VO  Provide compute and/or storage resources 3.Other (self supporting) applications  Own their own resources  Use EGEE infrastructure, operations, support  Many successful examples of such VOs 1 & 2: –Formal agreements (TA or MoU) –Should expect support via NA4 – but should also build up internal support teams –Expected to collaborate on improving the service – not just “users” 1, 2 & 3: –Full user and operations support –VOs need to provide support teams – some problems are application problems!

44 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 44 Negotiation Operations Advisory Group (OAG) Brings together VOs and resource providers (ROCs) Negotiate for services and resources Should not always be an expectation of “free” resources –In future applications should bring some resources with them –Computational and storage resources are not funded (!) by the project

45 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 45 EGEE – What can it deliver? A managed operation – providing a service: –A large number of sites of different sizes and capabilities –Developed operational procedures  Monitoring of the grid services providing access to resources –Operational security support; incident response coordination –Support services: user support, training, etc. –Building up considerable experience in grid-enabling a variety of different applications –Tools for monitoring of resources at a site … if required A new VO joining EGEE with a few sites: –Benefits from the operations and support – the VO sites can be monitored and supported as part of the infrastructure –Potentially access to other resources –It is a significant effort to set up a grid infrastructure from scratch

46 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 46 … and what does it cost? “The application VO buys into the EGEE model” –Actually not so restrictive now – supports many linux flavours, IA64, (other teams have worked on AIX, SGI ports) –Simple installation of client software now (can be done on the fly) –Basic grid services are quite general, nothing really application-specific Some unresolved issues: –Commercial licensed software used by an application –Levels of privacy/security needed in some life-science applications –True interactivity … and of course, this is all new, rapidly evolving and many problems still to be overcome VOs should: –Provide application support effort to help other VO users –Invest effort into helping improve the infrastructure and services – should not be simple “client – server” – rather a collaboration

47 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Future

48 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 48 From EGEE to EGEE-II Simplify operations structure –ROCs absorb CIC roles – spread of expertise Introduce SA3 –Integration, certification, distribution preparation –Emphasises focus on stability, reliability, performance rather than new features –Mechanism for integrating non-EGEE software – according to need Increased emphasis on –Platform support (OS, 64-bit, etc) –Interoperability with other grids (international, regional, national, local, campus,) and other middleware stacks (Unicore, ARC, …) SA: 54% of total SA1 (operations) : 86% SA2 (network) : 3% SA3 (certification): 11%

49 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 49 Outlook LHC VOs must achieve reliable production and analysis in 2006 –Will be making significant use of resources Consolidate and improve existing services: Focus on –Reliability, robustness –Manageability –Performance, scalability –Evolution or replacement of services driven by needs of application (or security/manageability) Expand grid operations –Spread expertise to ROCs –Collaboration with OSG, A-P –Start to negotiate SLAs New applications –Must bring resources – show commitment –Resource sharing and negotiation – must become streamlined  Will need a mechanism for cost/credit for use of resources

50 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE Infrastructure & Operations 50 Summary EGEE Infrastructure – world’s largest multi-science production grid service –But does not exist in isolation: interoperability and interoperation is essential Significant improvements in reliability and stability over the last year Is in constant use for significant production work –Many VOs now use it as their primary resource Middleware distribution is –Consolidating existing and new services –Basis for evolution according to needs Shift from EGEE to EGEE-II –No major changes, but adjustments based on experience and anticipated evolution –Refine and improve processes


Download ppt "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org EGEE Infrastructure, Services, & Operations Ian Bird, CERN IT SA1 Activity Leader 1 st EGEE."

Similar presentations


Ads by Google