Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Dr. Ian Bird CERN SA1 Activity Manager EGEE’07.

Similar presentations


Presentation on theme: "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Dr. Ian Bird CERN SA1 Activity Manager EGEE’07."— Presentation transcript:

1 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Dr. Ian Bird CERN SA1 Activity Manager EGEE’07 Conference, Budapest 2 nd October 2007 The EGEE infrastructure

2 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Outline Overall status and usage Progress in the past year –Growth in resources and use –Security activities –Operations –Pre-production service –Certification and testing –Network support –SLAs Monitoring advances Expectations for the next year –New services –EGEE-III Summary 2 EGEE'07; 2nd October 2007

3 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 The EGEE Infrastructure EGEE'07; 2nd October 2007 3 Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Operations Advisory Group (+NA4) Joint Security Policy GroupEuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups Support Structures & Processes Training infrastructure (NA4) Training activities (NA3)

4 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Resources RegionCountriesSitesCPU CERN5126400 UK/I2258384 Fr2127238 De/CH2154413 It1344341 NE9303289 SEE8382727 CE7242588 SWE2181938 A-P8201884 Ru215738 Totals4824344040 4 EGEE'07; 2nd October 2007

5 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Increasing workloads 32% Still expect factor 5 increase for LHC experiments over next year 5 EGEE'07; 2nd October 2007

6 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Use of the infrastructure EGEE: ~250 sites, >45000 CPU 24% of the resources are contributed by groups external to the project ~>20k simultaneous jobs 6 EGEE'07; 2nd October 2007

7 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Progress since EGEE’06

8 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations progress Progress/success: Production service, Oct ’06 to Sep ’07: –Number of sites: ~190 => ~240 (x1.25 increase) –average number of jobs/month for preceding 12 months: 0.97 million => 2.46 million x2.5 increase) –peak number of jobs in preceding 12 months: 1.45 million (June 06) => 3.11 million (May 07) (x2.14 increase) –number of CPUs: ~32,000 => ~46,000 (x1.44 increase) –Increase in number of teams involved in grid operations (CODs):  The work is now shared by the two teams who are on duty (it used to be primary/backup set-up where the backup only came on-line as needed). This is actually a better way to work as it means the teams do not have such long breaks between shifts (used to be ~10 weeks) 8 EGEE'07; 2nd October 2007

9 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 SFT  SAM Migrated from SFT to SAM –Massive improvements in standardizing the framework.  anyone can now easily contribute tests  now easier for people to run their own instance of the service –SAM now used in one way or another by all the LHC experiments –Started generating site availability reports 9 EGEE'07; 2nd October 2007

10 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations progress Successful releases of major updates to many central operations services (GOCDB, CIC Portal, GGUS) –CIC Portal new features include raising of alarms and masking of unnecessary alarms (leading to less time wasted by CODs) –RSS feed for CIC Portal alarms so that site administrators can monitor their own sites –Major update to GOCDB which included many new, useful features  Still a few bugs to fix Implementation of failover for most central operations services –Still needed for GOC database –improvements still needed for other operations services (for example CIC Portal) 10 EGEE'07; 2nd October 2007

11 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations progress Implementation of a formalized grid middleware release processes –Moved from “big bang” releases to incremental updates –Formal, documented process now in place which is handled by teams rather than single-point-of-failure individuals –For details: http://egee-pre-production-service.web.cern.ch/egee-pre- production-service/index.php?dir=./release/http://egee-pre-production-service.web.cern.ch/egee-pre- production-service/index.php?dir=./release/ Release of WMS – better performance and reliability cf RB. Full deployment of FTS service Process implemented to track most urgent/important grid issues by the ROCs. –These are passed to the TCG where appropriate and have resulted in significant improvements, for example standardization and improvement of middleware logging. Interoperability with OSG in production –CMS now submit jobs to both grids (EGEE and OSG) through a single WMS Moved to SL4 version of WN. Other services coming soon. 11 EGEE'07; 2nd October 2007

12 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Issues for Operations Need to improve reliability of ‘user’ services (users care about successful jobs and this involves many grid middleware services). –Resilience to glitches –Identification and treatment of SPoFs Not clear how far the current COD structure can scale Central operations services (Gstat, GOCDB, CIC portal, etc.) are all now interdependent and heavily used for day-to-day operations. –The failover mechanisms and upgrades mechanisms need to be improved to keep down-time to a minimum. Still need to keep improving the release notes. Still a major cause of deployment issues. Need dedicated interoperability testing VOViews and Job Priorities is confusing for many sites 12 EGEE'07; 2nd October 2007

13 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operational Security Operational Security Coordination Team (OSCT) Successes: –the OSCT will provide its first security training event during EGEE07. All service managers and site administrators are welcome Issues: –the OSCT is looking for additional experts to contribute to its activities, people with security interest should contact the team Progress: –the OSCT is gradually introducing SAM Security tests to check for known security issues at the sites  note: it uses special tests in SAM, securely transported and visible only to the OSCT 13 EGEE'07; 2nd October 2007

14 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 JSPG The successes during the last year have been: Updating the top-level Security Policy to make simpler and more general. –generalisation and simplification of the policies has been needed to achieve interoperable (identical) policies between EGEE, OSG, NDGF and others. New policies: Site Operations, VO Operations, Pilot Jobs Issues: Need to review and update several older policy documents (in EGEE-II) to remove duplications and ambiguities. Work on next revision of full policy set to make even more general and applicable to more Grids in world of EGI and NGI's (in EGEE- III). 14 EGEE'07; 2nd October 2007

15 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GSVG The EGEE-II Grid Security Vulnerability issue handling is now approved and in use Deliverable DSA1.3, which includes a summary of the GSVG strategy has been approved by the PEB and accepted by the EU. –This allows the disclosure of issues concerning EGEE middleware when they reach the Target Date for resolution The Risk Assessment team is handling Security Vulnerability issues and carrying out Risk Assessments: Since GSVG started (end 2005): –122 issues analysed (1 – 2 per week)  62 open (42 are sw bugs); 60 closed (25 bug fixes, 7 operational)  1 extremely critical, 9 high risk (2 open) EGEE'07; 2nd October 2007 15

16 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 User Support – 1 Technical and procedural efforts: Lots of technical Improvements: –new search engine –ticket linking –subscription to tickets –local helpdesks –Reporting tools Bidirectional interface with OSG user support TPM first line support works smoothly now Clear distinction between Services and Software Support Units Still responsiveness issues when problems leave the influence sphere of SA1 16 EGEE'07; 2nd October 2007

17 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 User Support – 2 Documentation: Transparent development through ESC shopping list https://savannah.cern.ch/projects/esc/ https://savannah.cern.ch/projects/esc/ A prototype for better quality ticket submission is available on https://iwrgustrain.fzk.de/pages/ticket1.php Put comments in shopping list ticket #102127 or send them to ggus-info@cern.ch https://iwrgustrain.fzk.de/pages/ticket1.phpggus-info@cern.ch Rigorous ticket progress reporting and monitoring is now possible in: https://gus.fzk.de/pages/download_escalation_reports.php and http://goc.grid.sinica.edu.tw/gocwiki/TPM_monitoring_reports https://gus.fzk.de/pages/download_escalation_reports.php http://goc.grid.sinica.edu.tw/gocwiki/TPM_monitoring_reports A collection of information sources assembled in https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport 17 EGEE'07; 2nd October 2007

18 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 User Support – 3 Communication: With all Grid Sites, including OSG, weekly at the Operations meeting. With ROCs, VOs and GGUS developers monthly at the ESC With GGUS developers fortnightly in the Shopping List review that defines the content of the (monthly) GGUS Releases. VOs begin to realise the importance of a strong user support (CHEP'07) Workshop to establish and improve connection between grid and VO user support  see SA1 session on Thursday afternoon in "VO managers and ROC managers issues" 18 EGEE'07; 2nd October 2007

19 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Pre-production service is now ~ 27 sites in 16 countries Provides access to some 3000 CPU –Some sites allow access to their full production batch systems for scale tests Sites install and test different configurations and sets of services Weekly update cycle Try to get good feeling for the quality of the release or updates before general release to production Larger sites gain experience on PPS before going to production. Services may be initially demonstrated in this environment Before further development New VO-s: adapt their applications & gain experience (e.g. DILIGENT) Pre-production service 19 EGEE'07; 2nd October 2007

20 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Pre-production service Issue: The service is not used at the level that it was intended Many issues for LHC experiments –Lack of effort –Difficulty to test complex software stacks not in production environment Slows down deployment process – but good for sites to get pre- deployment look at changes, new services, etc. Cannot justify the cost? Discussion on future of PPS during this conference 20 EGEE'07; 2nd October 2007

21 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Progress in Certification Handling change with the established process –1 release per week –Process evolving, based on experience –289 Patches in 2007  Corresponding to 820 closed bugs  ~10 Patches are in work in parallel Limited by resources –Patch certification sees more partner participation  18 Patches certified by external partners  We have to increase this Extensive use of the “Experimental Services” process –Only way to address scalability and stability of core services –For the WMS the service moved outside CERN  Service run by INFN  Verification of checkpoint releases by Imperial College 21 EGEE'07; 2nd October 2007

22 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Certification & Testing Improved Test Coverage –especially Data Management –Pre-certification release to interested user communities  Very early feedback to developers Extensive use of virtual test-beds Changes: YAIM-4 Configuration support tool –Independent releasable modules per component –Opened YAIM for developers and site admins –Major refactoring of the tool –Removed almost all legacy Python configuration Move to ETICS –Difficult transition  Sometimes 3 build systems involved in one release 22 EGEE'07; 2nd October 2007

23 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Porting Move to SL4 and VDT-1.6 (32 and 64 bit) –Much delayed  Revised plan and plan for restructuring gLite –Still in progress (but getting close) –WN and UI (32-bit) are in production –LCG-CE has been ported to SL4 + VDT-1.6  Will reach PPS in 2 weeks (including DGAS support) –WMS/LB gLite 3.1 / SL4 version  certification in about 2 months –BDII released to PPS –DPM and LFC have been tested internally on SL4 (32- and 64-bit)  Just waiting for the yaim component to complete certification –FTS-2 SL4 pilot service is planned for October  Release and deployment at T1s in January –VOBOX prototype has been setup during summer  1-2 months –Glite-PX  Finalising configuration ( 1 month ) 23 EGEE'07; 2nd October 2007

24 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Porting – cont. –Glite-MON  Need config for tomcat 5.5 –glite-SE_classic  Just started working, but simple –Glite-VOMS  Being processed as patch #1322  ~2 months Strategy for 64-bit is prioritised; –WN + Torque client –DPM-disk –UI –Other services depending on 64bit advantage Currently the 64-bit WN + torque is undergoing runtime testing –management scripts need to be updated to accommodate packages which must be installed 32/64 24 EGEE'07; 2nd October 2007

25 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 EGEE'07; 2nd October 2007 25 Plan (May)‏ As shown at the EGEE review Problems to move to gLite-3.1 (including ETICS) –Addressed by the PMB endorsed “gLite restructuring plan” JanDecNovFebMarAprMayJun WMS gLite-CE SL4&VDT1.6 WMS gLite-CE SL4&VDT1.6 WN Fall Back SL3 code on VDT1.2 on SL4 On PPS WN Fall Back SL3 code on VDT1.2 on SL4 On PPS Build with 3.1 Build System Build with 3.1 Build System UI/WN SL4&VDT1.6 UI/WN SL4&VDT1.6 UI Fall Back SL3 code on VDT1.2 on SL4 UI Fall Back SL3 code on VDT1.2 on SL4 Fall Back Solutions delivered on time, minimal impact on sites Fall Back Solutions delivered on time, minimal impact on sites Status “gLite restructuring plan ” JulAug FTS, DPM, LFC,…… Move to SL4&VDT1.6 Independently when they are ready FTS, DPM, LFC,…… Move to SL4&VDT1.6 Independently when they are ready WN-3.1 SL4 released UI very close 90+% of all components build WN-3.1 SL4 released UI very close 90+% of all components build Revised Plan

26 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Progress around networking (SA2) The EGEE Network Operations Centre (ENOC): –65% of EGEE certified sites covered (72% in Europe)  Receiving incidents & maintenance notices from NRENs  Linking them with detected troubles on the EGEE infrastructure –Monitoring of the sites’ network availability  Web interface presenting the results  Data available via HTTP/XML for other usages (Nagios, COD)  https://ccenoc.in2p3.fr/DownCollector/ https://ccenoc.in2p3.fr/DownCollector/  Improve the network monitoring within EGEE but performance data still missing! –LHC Optical Private Network operational model  Critical for the reliability of LCG  Ongoing formalization of the roles, functions & processes  In collaboration with LCG and NRENs More info and details:  Dedicated network session on Wed. morning (11:00-12:30)  https://ccenoc.in2p3.fr/ https://ccenoc.in2p3.fr/ 26 EGEE'07; 2nd October 2007

27 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Progress in SLAs SLA working group put in place A draft SLA document has been produced for discussion –Based on experience in other projects and ROCs –Some outstanding issues still to be addressed – see discussion at this conference –Metrics have not yet been agreed So far mainly addresses agreements between sites and ROCs Covers: –Responsibilities (ROCs and Sites) –Hardware and connectivity –Services to be provided –Service hours –Availability –Support – general and for Vos –Service continuity and security –Service reporting & reviewing 27 EGEE'07; 2nd October 2007

28 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Monitoring

29 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Monitoring landscape Local resources Grid Middleware Grid Applications central services site services site Local monitoring Lemon/SLS Nagios Ganglia... Gstat SAM/GridView GridICE GridPP Real Time Monitor... Experiment Dashboards... Grid Services monitoring Application monitoring DomainMonitoring Tools in use 3 Monitoring Working Groups 29 EGEE'07; 2nd October 2007

30 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Monitoring working groups Goal: Improve overall reliability of sites and services System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Repositories Views ……. System Analysis Application monitoring …… 30 EGEE'07; 2nd October 2007

31 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 High Level Model 31 EGEE'07; 2nd October 2007

32 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 EGEE'07; 2nd October 200732 Prototype site implementation

33 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nagios display EGEE'07; 2nd October 2007 33

34 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Treemap visualization 34 EGEE'07; 2nd October 2007

35 35 EGEE'07; 2nd October 2007

36 36 EGEE'07; 2nd October 2007

37 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Evolution

38 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Changes to services SL4 –SL4 deployment in progress –Will have 64-bit versions of many components –Need to ensure next ports are not showstoppers (SL5, …) WMS –gLite WMS is replacing the old RB’s; RB’s will not be supported CE –Strategy for the CE has been agreed: –LCG-CE has been ported to SL4 –CREAM and gLite-CE were both shown to provide basic performance levels  Cannot afford to bring both to production – focus on CREAM – expect a deployable version in early 2008 38 EGEE'07; 2nd October 2007

39 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Service evolution EGEE'07; 2nd October 2007 39 Pilot jobs & glexec –Pilot jobs are a reality (and have been for some time); –need to ensure correct audit and/or identity management –gLexec can be used to authorize users via LCAS/LCMAPS Job priorities –Has been an ongoing issue – desire to base priorities on VOMS roles/groups –Short term “simple” solution caused many problems – and has now been fixed; this seems to be sufficient for ~ next year –Longer term: re-look at end-end auth n /auth z with real use cases

40 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Grid Operations in EGEE-3 No major changes – consolidation of existing activities –Overall level of effort is ~25% reduced from EGEE-II Will continue with the 5 major tasks that we currently have: –Grid management –Grid operations & support –User support –Operational security –General and admin tasks Emphasis on improving reliability, robustness, usability, support All ROCs will do all key operational tasks –Operator on Duty –TPMs and GGUS support effort –Security coordination – OSCT Suppression of some sub-tasks –Have seen no justified case for regional certification –Porting tasks are in SA3 40 EGEE'07; 2nd October 2007

41 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Specific areas to address Monitoring and oversight should evolve towards automation –Between EGEE-3 and EGI must reduce operations effort (by factor 2?) –Need to have a plan for automation, alarms, etc Service Level Agreements –Part of the overall effort of QA –Categorization of sites; different deployment scenarios Integration of operations with existing and embryonic National Grid Infrastructures –Transition plan to EGI/NGI; need to understand what NGIs will do Integrating new VOs into the infrastructure 41 EGEE'07; 2nd October 2007

42 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 New VO support Based on discussion in Stockholm workshop: Catch-all/regional VOs Regions agree to support any VO with users in the region All JRUs/NGIs commit a certain fraction of their resources Pool of additional “seed resources” –75 k € requested for CPU and disk –To be installed at max of 3 sites who guarantee access and high level of service to new VOs VO managers group will identify new VOs eligible for project support Core services for new VOs assigned to set of sites that have agreed to provide this – round-robin if no existing relationship VOs must provide “ID card” – full set of information needed by sites 42 EGEE'07; 2nd October 2007

43 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Challenges Usage - related Scale Reliability Usability Organizational Many grids (campus, national, …) NGI/EGI Devolution: central model  fully distributed How will a VO get dependability of services in this scenario? 43 EGEE'07; 2nd October 2007

44 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Challenges Middleware –Complexity of the full distribution –Time for porting, etc –Time to go to production – unrealistic expectations  E.g. gLite WMS –The biggest technical challenge for EGEE ?? EGEE'07; 2nd October 2007 44

45 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Summary EGEE infrastructure has continued to grow – –sites, resources, usage Still need to scale workloads by at least x5 in the next year for LHC Major challenges: reliability, usability, manageability have improved – but not enough Significant efforts in monitoring to try and help But must focus on stabilizing what we have and not trying to add too much Significant progress in the past year – The start up of LHC will be a major test of the infrastructure 45 EGEE'07; 2nd October 2007


Download ppt "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Dr. Ian Bird CERN SA1 Activity Manager EGEE’07."

Similar presentations


Ads by Google