Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006.

Similar presentations


Presentation on theme: "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006."— Presentation transcript:

1 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

2 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 2 Outline Recommendations from intermediate focused review Highlights of last 3 months of the project Summary of SA1 achievements and open issues sites CPU

3 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 3 SA1 Achievements Scale of the infrastructure –Has grown steadily during the project –Now slowed – expansion with related projects Sustained real production use of the infrastructure –Which is supported by the operations teams Maturing but evolving operations procedures –Dealing with all aspects of operations User support –GGUS is becoming the central coordination point, use is growing Middleware distribution –Now clear how to evolve the production service –Convergence between existing LCG-2.x and gLite-1.x Progress on interoperability and interoperation –With OSG significant progress, progress with ARC –Related projects

4 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 4 Recommendation 16 – i “Plan the migration procedure of service support for gLite in full production service more clearly with precise dates and mandates for each site, and advertise to the users well in advance.” & comment: “Pre-production service must not take on a life of its own…” Early set up of TCG; –forum for agreeing schedules across the technical and application activities. –Schedule proposed and agreed for 2006 – see next slide

5 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 5 Recommendation 16 – ii Deliver and deploy LCG-2.7.0 end January 2006 –Bug fixes, patches, etc. accumulated since last major release in August.  Delivered on time and deployed Prepare gLite-3.0 for initial deployment in May 2006. –Convergence of LCG-2.x and gLite- 1.x –Evolutionary from deployment point of view – will not be a big-bang change of production service –Schedule driven by LCG service challenges Foresee second major “release” on October/November timescale –Added functionality – driven by apps via TCG Quickfixes, security patches –May be produced at any time, deployed with agreement of TCG Client tools –May be updated more frequently, and can be deployed rapidly without need for major upgrades Other stand-alone services may be deployed centrally or at a few sites –To demonstrate functionality or provide new facilities –Usually need by-hand installation Deployment schedule for 2006 In general we try to move away from big-bang releases: Focus on service/component upgrades where possible Check-point releases to consolidate changes and to provide new sites a starting point See this more like a Linux distribution – major releases with continual component updates, security patches, etc. Pre-production service – now integral part of the release process – should demonstrate new releases Continuous process of integration, certification, pre-production testing  eventual deployment In general we try to move away from big-bang releases: Focus on service/component upgrades where possible Check-point releases to consolidate changes and to provide new sites a starting point See this more like a Linux distribution – major releases with continual component updates, security patches, etc. Pre-production service – now integral part of the release process – should demonstrate new releases Continuous process of integration, certification, pre-production testing  eventual deployment

6 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 6 Recommendation 17 – i “Help to establish exemplary procedures for interoperations of more divergent infrastructures and take the lead in such activities.” Several avenues –Collaborative activities – security and operational policy –Interoperability –Interoperation / shared operation – workshops –Other projects Joint collaborative activities: –Security – JSPG, MWSG, GridPMAs –Grid Interoperability Now (GIN) group – many projects  Very active in GGF17

7 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 7 Recommendation 17 – ii Interoperability Several initiatives at various stages With OSG –Most advanced – cross job submission has been put in place for WLCG  Used in production by US-CMS for several months –EGEE Generic Info Provider installed on OSG site (now in VDT)  Allows all sites to be seen in info system –GStat and SFT can run on OSG sites –EGEE clients installed on OSG-LCG sites –Inversely – EGEE sites can run OSG jobs –All use SRM SEs; –File catalogues are application choice – LFC widely used

8 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 8 Interoperability – cont. With ARC/NorduGrid –Strategies: 1.Agree standard interfaces at site level & evolve services for these interfaces 2.Present these interfaces at Grid boundary  portal to forward and translate 3.Deploy EGEE and ARC CE in parallel Large sites for LCG 1 is long-term goal; 2 is medium term solution Several workshops to follow progress Work on information system (GLUE) EGEE  ARC submission works With NAREGI –First workshop in March –Several joint activities agreed; work just starting  Information system translators (GLUE ↔ CIM)  Data management tools – NAREGI will test EGEE LFC, FTS, DPM  Job management JDL ↔ JSDL etc.  Security

9 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 9 Recommendation 17 – iii Operations (Interoperations) Joint operations: –WLCG is a strong driver – bring together EGEE and OSG grid operations –Extend ROC concept  Structures for routing tickets – prototype to be demonstrated in June  Use of GOC-DB for OSG sites  OSG sites join weekly operations meeting  Run SFTs on LCG production sites in OSG  Agreed ops VO for joint operations –Accounting – for LCG – use GGF usage record Related projects –EUMedGrid, BalticGrid, EELA, EUChinaGrid, SEE-Grid: –implement EGEE operational concepts and procedures Operations workshops –Explicitly joint with OSG, ensure related projects attend

10 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 10 Recommendation 17 – iv Future: –Shared operations will be a reality – required for LCG  EGEE, OSG, ARC, NAREGI –EGEE-II  Explicit tasks on interoperability  ARC and UNICORE –Expectation is for coexisting campus, local, regional, national, international grid infrastructure  Coexistence, interoperability, interoperations, common policies will be a way of life –Long term sustainable infrastructure after EGEE-II will be built on this work

11 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 11 Recommendation 18 “Move away from present primary dependence on particular flavours of both processors and Linux and provide support for more heterogeneous resources, including supercomputers, to allow increased collaborative adoption at major computing centres.” Current porting status: –Several ports to other architectures: IA64, several Linux flavours. Available a few months after main release; –Done by partners; outside of main build and integration system Future: –Important to have several important ports close to or part of main integration and testing; –Include 64-bit cleanliness as part of build test – will flag as failure –Move to ETICS to provide distributed build system to support many platforms; helps tie porting partners into central process  Partner interested in a particular port can provide build and test hardware and ETICS can help integrate this into the process –TCG should agree a reasonable/realistic set of standard primary platforms to be provided as part of base release  E.g. SL4 + Debian on 32 and 64 bit  Other ports can be asynchronous and should be certified by partners providing resources –Supercomputers – should be supported by ports to relevant OS, MPI  Collaboration with DEISA in EGEE-II

12 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org SA1 Highlights

13 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 13 EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage EGEE Grid Sites : Q1 2006 sites CPU EGEE: Steady growth over the lifetime of the project

14 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 14 A global, federated e-Infrastructure EGEE infrastructure ~ 200 sites in 39 countries ~ 20 000 CPUs > 5 PB storage > 20 000 concurrent jobs per day > 60 Virtual Organisations EUIndiaGrid EUMedGrid SEE-GRID EELA BalticGrid EUChinaGrid OSG NAREGI Related projects & collaborations are where the future expansion of resources will come from ProjectAnticipated resources (initial estimates) Related Infrastructure projects SEE-grid6 countries, 17 sites, 150 cpu EELA5 countries, 8 sites, 300 cpu EUMedGrid6 countries BalticGrid3 countries, fewx100 cpu EUChinaGridTBC Collaborations OSG30 sites, 10000 cpu ARC15 sites, 5000 cpu DEISASupercomputing resources

15 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 15 Use of the infrastructure Sustained & regular workloads of >30K jobs/day spread across full infrastructure doubling/tripling in last 6 months – no effect on operations

16 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 16 Use of the infrastructure Massive data transfers > 1.5 GB/s Several applications now depend on EGEE as their primary computing resource Sustainability: Usage can (and does) grow without need for additional operational effort

17 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 17 EGEE Operations Process Grid operator on duty –6 teams working in weekly rotation  CERN, IN2P3, INFN, UK/I, Ru,Taipei –Crucial in improving site stability and management –Expanding to all ROCs in EGEE-II Operations coordination –Weekly operations meetings –Regular ROC managers meetings –Series of EGEE Operations Workshops  Nov 04, May 05, Sep 05, June 06 Geographically distributed responsibility for operations: –There is no “central” operation –Tools are developed/hosted at different sites:  GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) Procedures described in Operations Manual –Introducing new sites –Site downtime scheduling –Suspending a site –Escalation procedures –etc Highlights: Distributed operation Evolving and maturing procedures Procedures being in introduced into and shared with the related infrastructure projects

18 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 18 Site Functional Tests Site Functional Tests (SFT) –Framework to test (sample) services at all sites –Shows results matrix –Detailed test log available for troubleshooting and debugging –History of individual tests is kept –Can include VO-specific tests (e.g. sw environment) –Normally >80% of sites pass SFTs  NB of 180 sites, some are not well managed Very important in stabilising sites: Apps use only good sites Bad sites are automatically excluded Sites work hard to fix problems Extending to service availability: measure availability by service, site, VO each service has associated service class defining required availability (Critical, highly available, etc.) First approach to SLA Use to generate alarms generate trouble tickets call out support staff

19 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 19 Middleware Distributions and Stacks Terminology: –EGEE deploys a middleware distribution  Drawn from various middleware products, stacks, etc.  Do not confuse the distribution with development projects or with software packages  Count on 6 months from software developer “release” to production deployment –The EGEE distribution:  Current production version labelled: LCG-2.7.0  New production version labelled: gLite-3.0  Name change to hopefully reduce confusion EGEE distribution contents:  LCG-2.7.0: –VDT – packaging Globus 2.4, Condor, MyProxy –EDG workload management –LCG components:  BDII (info sys),  catalogue (LFC),  DPM, data management libraries and CLI tools  monitoring tools –gLite: R-GMA, VOMS, FTS  gLite-3.0: –Based on LCG-2.7.0, and –gLite workload management –Other gLite components (not in the distribution but provided as services):  AMGA, Hydra, Fireman  gLite-IO evolution

20 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 20 Integration VDT/OSG OMII- Europe JRA1 SA3 … Testing & Certification Support, analysis, debugging Production service SA1 Pre-production service Middleware providers SA3 Certification activities SA3+SA1 Process to deployment

21 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 21 Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal The Support Model “Regional Support with Central Coordination" The ROCs, VOs and other project- wide groups such as the middleware groups (JRA), network groups (NA), service groups (SA) areJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units GGUS is now being used for all problem reporting: Operational, deployment and user support VOs are using it for their support system The use is growing steadily

22 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 22 Security & Policy Collaborative policy development –Many policy aspects are collaborative works; e.g.: Joint Security Policy Group Certification Authorities –EUGridPMA  IGTF, etc. Grid Acceptable Use Policy (AUP) –common, general and simple AUP –for all VO members using many Grid infrastructures  EGEE, OSG, SEE-GRID, DEISA, national Grids… Incident Handling and Response –defines basic communications paths –defines requirements (MUSTs) for IR –not to replace or interfere with local response plans Security & Availability Policy Usage Rules Certification Authorities Audit Requirements Incident Response User Registration & VO Management Application Development & Network Admin Guide VO Security

23 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 23 SA1 goals for EGEE-II Key goal: –We have a large running production infrastructure; But EGEE-II MUST take what we have now and make it:  Reliable  Middleware components fail, error reporting is missing, …  There is an application responsibility here too – needs effort  … but ! The service has been running non-stop for > 2 years  Robust  Must continue to address service aspects – move away from prototypes  Usable  It is still hard to use for many users; still too slow to introduce new VOs  Acceptable  It must be easy to deploy in a wide variety of environments and coexist with other grid infrastructures  Sustainable  The infrastructure must become sustainable for the long term

24 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 24 SA1 Outlook LHC VOs must achieve reliable production and analysis in 2006 –Will be making significant use of resources –Applications must bring resources  show commitment Consolidate and improve existing services: Focus on –Reliability, robustness, manageability, performance, scalability, etc. –Evolution or replacement of services driven by needs of application (or operations/security/manageability)  TCG has key role here Expand grid operations –Spread expertise to ROCs –Collaboration with OSG, A-P, etc. and related projects –Start to negotiate SLAs –Sustainability: processes evolving, spread of expertise and tasks –Resource sharing and negotiation – must become streamlined  Will need a mechanism for cost/credit for use of resources

25 Enabling Grids for E-sciencE INFSO-RI-508833 Ian Bird, SA1, EGEE Final Review 23-24 th May 2006 25 Summary SA1 has built a large production grid infrastructure In constant and extensive daily production use –Several applications depend on it for resources Tools and processes are maturing and evolving Security and usage policies also evolving We have a basic set of middleware that addresses most requirements Production middleware is converged now LCG-2 + gLite  gLite 3 EGEE-II will focus on making this sustainable and really usable


Download ppt "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006."

Similar presentations


Ads by Google