EGEE-II INFSO-RI EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October 2006 Enabling Grids for E-sciencE
EGEE-II INFSO-RI JLab; 9 th -13 th October Outline Some history –What led up to where we are now? –The EGEE project What is the EGEE grid infrastructure today? –What has been achieved? –How is it used? –How does it compare and relate to other production grids? Outlook
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Some history … LHC EGEE Grid 1999 – Monarc Project –Early discussions on how to organise distributed computing for LHC 2000 – growing interest in grid technology –HEP community was the driver in launching the DataGrid project EU DataGrid project –middleware & testbed for an operational grid – LHC Computing Grid – LCG –deploying the results of DataGrid to provide a production facility for LHC experiments – EU EGEE project phase 1 –starts from the LCG grid –shared production infrastructure –expanding to other communities and sciences – EU EGEE-II –Building on phase 1 –Expanding applications and communities … … and in the future – Worldwide grid infrastructure?? –Interoperating and co-operating infrastructures? CERN
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October The EGEE project EGEE - €32 M –1 April 2004 – 31 March 2006 –71 partners in 27 countries, federated in regional Grids EGEE-II - €35 M –1 April 2006 – 31 March 2008 –91 partners in 32 countries –13 Federations Objectives –Large-scale, production-quality infrastructure for e-Science –Attracting new resources and users from industry as well as science –Improving and maintaining “gLite” Grid middleware
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October The EGEE Infrastructure Certification testbeds (SA3) Pre-production service Production service Test-beds & Services Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Support Structures Operations Advisory Group (+NA4) Joint Security Policy GroupEuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups Infrastructure: Physical test-beds & services Support organisations & procedures Policy groups
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Certification & release preparation The goal is to produce a middleware distribution that can be deployed widely –Not the same as middleware releases from development projects –More like a Linux distribution – bringing together many pieces from several sources Extensive certification test-bed: –Close to 100 machines involved, CERN + partners Emulate the main deployment environments Certification testing: –Installation and configuration –Component (service) functionality –System testing (trying to emulate real workloads and stress testing) –Beginning to use virtualization to simplify the testing environment Deployment into the pre- production system –Final step of certification – validation by real sites –Validation by applications – also allows to prepare apps for new versions
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Pre-production service Pre-production service is now ~ 20 sites Provides access to some 500 CPU –Some sites allow access to their full production batch systems for scale tests Sites install and test different configurations and sets of services –Try to get good feeling for the quality of the release or updates before general release to production –Feedback to: certification, integration, developers, etc. P-PS is now used in the way it was intended –For some time it was acting as a second certification test-bed for the gLite- 1.x branch –Some services may be demonstrated in this environment before going to production (or they may need more work)
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Production service sites Size of the infrastructure today: 196 sites in 42 countries ~ CPU ~ 3 PB disk, + tape MSS CPU
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Usage of the infrastructure >50k jobs/day ~7000 CPU-months/month
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Non-LHC VOs Workloads of the “other VOs” start to be significant – approaching 8- 10K jobs per day; and 1000 cpu-months/month one year ago this was the overall scale of work for all VOs Workloads of the “other VOs” start to be significant – approaching 8- 10K jobs per day; and 1000 cpu-months/month one year ago this was the overall scale of work for all VOs
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Use of the infrastructure 20k jobs running simultaneously
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October CPU Usage Virtual Organizations Jan. ’06 Sep. ’06
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Use for massive data transfer Large LHC experiments now transferring ~ 1PB/month each
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Applications on EGEE More than 25 applications from an increasing number of domains –Astrophysics –Computational Chemistry –Earth Sciences –Financial Simulation –Fusion –Geophysics –High Energy Physics –Life Sciences –Multimedia –Material Sciences –….. Application types: Simulation Bulk Processing Responsive Apps. Workflow Parallel Jobs Legacy Applications
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Simulation Examples –LHC Monte Carlo simulation –Fusion –WISDOM—malaria/avian flu Characteristics –Jobs are CPU-intensive –Large number of independent jobs –Run by few (expert) users –Small input; large output Needs –Batch-system services –Minimal data management for storage of results ATLAS ITER
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Drug Discovery WISDOM focuses on in silico drug discovery for neglected and emerging diseases. Malaria — Summer 2005 –46 million ligands docked –1 million selected –1TB data produced; 80 CPU-years used in 6 weeks Avian Flu — Spring 2006 –H5N1 neuraminidase –Impact of selected point mutations on eff. of existing drugs –Identification of new potential drugs acting on mutated N1 Fall 2006 –Extension to other neglected diseases
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Bulk Processing Examples –HEP processing of raw data, analysis –Earth observation data processing Characteristics –Widely-distributed input data –Significant amount of input and output data Needs –Job management tools (workload management) –Meta-data services –More sophisticated data management
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Responsive Apps. (I) Examples –Prototyping new applications –Monitoring grid operations –Direct interactivity Characteristics –Small amounts of input and output data –Not CPU-intensive –Short response time (few minutes) Needs –Configuration which allows “immediate” execution (QoS) –Services must treat jobs with minimum latency
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Responsive Apps. (II) Grid as a backend infrastructure: –gPTM3D: interactive analysis of medical images bioinformatics via web portal –GATE: radiotherapy planning –DILIGENT: digital libraries –Volcano sonification Characteristics –Rapid response: a human waiting for the result! –Many small but CPU-intensive tasks –User is not aware of “grid”! Needs –Interfacing (data & computing) with non-grid application or portal –User and rights management between front-end and grid
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Workflow Examples –“Bronze Standard”: image registration –Flood prediction Characteristics –Use of grid and non-grid services –Complex set of algorithms for the analysis –Complex dependencies between individual tasks Needs –Tools for managing the workflow itself –Standard interfaces for services (I.e. web-services)
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Parallel Jobs Examples –Climate modeling –Earthquake analysis –Computational chemistry Characteristics –Many interdependent, communicating tasks –Many CPUs needed simultaneously –Use of MPI libraries Needs –Configuration of resources for flexible use of MPI –Pre-installation of optimized MPI libraries
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Legacy Applications Examples –Commercial or closed source binaries –Geocluster: geophysical analysis software –FlexX: molecular docking software –Matlab, Mathematics, … Characteristics –Licenses: control access to software on the grid –No recompilation no direct use of grid APIs! Needs –License server and grid deployment model –Transparent access to data on the grid
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Grid management: structure Operations Coordination Centre (OCC) –management, oversight of all operational and support activities Regional Operations Centres (ROC) –providing the core of the support infrastructure, each supporting a number of resource centres within its region –Grid Operator on Duty Resource centres –providing resources (computing, storage, network, etc.); Grid User Support (GGUS) –At FZK, coordination and management of user support, single point of contact for users
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Grid Monitoring Goal: –Proactively monitor operational state & performance of the grid –Trigger corrective actions at sites, ROCs, service managers Many tools used: –Distributed responsibility for tools maintenance and operation –Operator portal, Info sys monitor, SFT/SAM, job monitors, etc. Site Functional Tests (SFT) Site Availability Monitor (SAM) –Framework to sample/test services at sites and publish results –Can include ad-hoc tests (e.g. VO-specific) in the framework or externally –Allows dynamic look-up by VO of sites that are currently OK for them –SAM: extends the concept to measure service availability –Web service access to the data –Intend to use this to generate trouble tickets and alarms Primary tools of the operator on duty are –Information system monitoring and SFT/SAM
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Site metrics - availability
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Support - GGUS
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October The EGEE Network Operations Centre Creating a “Network Support unit” in the EGEE operational model; Tasks: –Receive tickets from NRENs, and forward to GGUS if impact on grid –Receive tickets from GGUS if a network issue –Troubleshoot & follow up with sites or NRENs GGUS Users Support Units ENOC NRENs GÉANT2 EGEENetwork
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Interoperation Interoperability and interoperation (or co-operation) EGEE has interoperability activities with: (enabling the middlewares to work together) –Open Science Grid (U.S.) – quite far advanced –Nordugrid (ARC) – task in EGEE-II, 4 workshops and ongoing activity –UNICORE – task in EGEE-II –NAREGI (Japan) – 1 workshop, continued activity –GIN (OGF) – active in several areas EGEE has interoperation activities with: (enabling the infrastructures to co-operate) –Open Science Grid – actually in use –Anticipated with NorduGrid (NDGF) for WLCG
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Interoperating information systems EGEE OSG Naregi Teragrid Pragma Nordugrid
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Related infrastructure projects DEISA TeraGrid Coordination in SA1 for: EELA, BalticGrid, EUMedGrid, EUChinaGrid, SEE-GRID Interoperation with OSG, NAREGI SA3 : DEISA, ARC, NAREGI
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Sustainability: Beyond EGEE-II Need to prepare for permanent Grid infrastructure –Maintain Europe’s leading position in global science Grids –Ensure a reliable and adaptive support for all sciences –Independent of short project funding cycles –Modelled on success of GÉANT Infrastructure managed in collaboration with national grid initiatives
Enabling Grids for E-sciencE EGEE-II INFSO-RI JLab; 9 th -13 th October Summary of status Today we have an operating production infrastructure –Probably the largest in the world, supporting many science domains –Relied upon by several as their primary source of computing We have a managed operations process addressing most areas –Constantly evolving Inter/Co-operation is a fact and is becoming more important very quickly –Several applications need to work across grids – and they need support for that A large fraction of the value of the operations activity is in the intangibles – processes, structures, expertise, etc. We recognise that there are many outstanding problems with the current state of things: reliability and robustness are the focus for the next year