Presentation is loading. Please wait.

Presentation is loading. Please wait.

The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers.

Similar presentations


Presentation on theme: "The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers."— Presentation transcript:

1 The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers and Demanding User Communities Dr. Andreas Unterkircher, Dr. Markus Schulz EGEE SA3 & LCG Deployment April 2009,CERN, IT Department

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Outline CERN LHC the computing challenge – Data rates, computing, community Grid Projects @ CERN – WLCG, EGEE gLite Middleware – Code Base Experience – Integration – Certification Lessons Learned Markus Schulz, CERN, IT Department

3 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN stands for over 50 years of 1954 Rebuilding Europe First meeting of the CERN Council 1980 East meets West Visit of a delegation from Beijing 2004 Global Collaboration The Large Hadron Collider involves over 80 countries fundamental research and discoveries technological innovation training and education bringing the world together

4 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN’s mission in Science Understand the fundamental laws of nature – We accelerate elementary particles and make them collide. – Then compare the results with the theory Provide a world-class laboratory to researchers in Europe and beyond A few numbers … 2500 employees: physicists, engineers, technicians, craftsmen, administrators, secretaries, … (shrinking) 6500 visiting scientists (half of the world’s particle physicists), representing 500 universities and over 80 nationalities (increasing) Budget: ~1 Billion Swiss Francs per year Additional contributions by participating institutes A few numbers … 2500 employees: physicists, engineers, technicians, craftsmen, administrators, secretaries, … (shrinking) 6500 visiting scientists (half of the world’s particle physicists), representing 500 universities and over 80 nationalities (increasing) Budget: ~1 Billion Swiss Francs per year Additional contributions by participating institutes

5 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Markus Schulz, CERN, IT Department View of the LHC tunnel CERN build the Large Hadron Collider (LHC) the world’s largest particle accelerator (27 km long, 100 m under ground) First beam in 2008 Start of the physics program autumn 2009

6 View of the ATLAS detector (2005) 150 million sensors deliver data … … 40 million times per second

7 View of the ATLAS detector (almost ready)

8 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it The LHC Computing Challenge  Signal/Noise <10 -9  Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year ( 20 Million CDs)  Compute power Event complexity * Nb. events * thousands users  >100 k of (today's) fastest CPUs  Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology The Needle

9 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it LHC User Community 70 538 27 4603 637 55 22 87 10 Europe: 267 Institutes, 4603 Users Other: 208 Institutes, 1632 Users Over 6000 LHC Scientists world wide Markus Schulz, CERN, IT Department

10 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Flow to the CERN Computer Center Markus Schulz, CERN, IT Department 10Gbit

11 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it LHC Computing Grid project (LCG) Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) 10Gbit links to each of the 10 T1 centers large facilities with mass storage capability Tier-2s ~150 centres in ~35 countries from 50-5000 CPUs

12 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it LHC Computing  Multi-science 1999 - MONARC project – First LHC computing architecture – hierarchical distributed model 2000 – growing interest in grid technology – HEP community main driver in launching the DataGrid project 2001-2004 - EU DataGrid project – middleware & testbed for an operational grid 2002-2005 – LHC Computing Grid – LCG – deploying the results of DataGrid to provide a production facility for LHC experiments 2004-2006 – EU EGEE project phase 1 – starts from the LCG grid – shared production infrastructure – expanding to other communities and sciences 2006-2008 – EU EGEE project phase 2 – expanding to other communities and sciences – Scale and stability – Interoperations/Interoperability 2008-2010 – EU EGEE project phase 3 – More communities – Efficient operations – Less central coordination CERN

13 The EGEE project EGEE –Started in April 2004, now in third phase with 91 partners in 32 countries –3 rd phrase (2008-2010) –2010 egi.org Objectives –Large-scale, production-quality grid infrastructure for e-Science –Attracting new resources and users from industry as well as science –Maintain and further improve “gLite” Grid middleware CERN, IT Department

14 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … >250 sites 48 countries >100,000 CPUs >20 PetaBytes >10,000 users >200 communities >350,000 jobs/day CERN, IT Department Global Multi Science Infrastructure, mission critical for many communitiesNumber of jobs from 2004 to 2009 Rapid growth of the infrastructure

15 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department www.glite.org

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department Data Services Storage Element File and Replica Catalog Metadata Catalog Job Management Services Computing Element Worker Node Workload Management Job Provenance Security Services AuthorizationAuthentication Information & Monitoring Services Information System Job MonitoringAccounting Access Services User InterfaceAPI gLite middleware Development effort from different projects: - Condor - globus - Virtual Data Toolkit (VDT) -EGEE -LCG - others………… The project relies on a collaborative consensus based process - No single architect -Technical Director and Technical Management Board -Agree with stakeholders on next steps -Agree on priorities -Bi-weekly phone conference to coordinate -Short term priorities -Incidents ( bugs) -2-3 all hands meetings/year -Mail mail and mail …

17 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite code base CERN, IT Department

18 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite code details CERN, IT Department

19 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite code details CERN, IT Department 10K5K 2K 1K

20 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite code details CERN, IT Department 2K Complex external and internal cross dependencies  Integration, configuration management was always a challenge  The components are grouped together to ~30 services

21 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department Complex Dependencies

22 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Markus Schulz, CERN, IT Department Example: Data Management

23 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department Stability of the software All components still see frequent changes Many developments started 2002 – Why do we still need changes? Scale of the system increased rapidly Number of user and use cases increased – Deeper code coverage – New functional requirements Less tolerance to failures – Implementation of fail over Emerging standards – Project started when no standards where available – Incremental introduction Exponential growth

24 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department Software stability: Defects Most changes are triggered by defects 81% ~40% are found by users ~2000 open bugs at any time Increased production use Developers use the same system

25 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department Software Process (since 2006) Component based, frequent releases – Components are updated independently No big bang releases – Updates (patches) are delivered on a weekly basis to PPS Move after 2 weeks to production – Clear prioritization by stakeholders – Clear definition of roles and responsibilities – Use of a common build system ( ETICS) Release model: Pull Sites pick up updates when convenient Multiple versions are in production Retirement of old versions takes > 1 year

26 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Component based process

27 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN, IT Department Patch and Bug Lifecycle State changes are tracked by Savannah – progress is monitored by dashboards

28 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Effort Work areas – Integration – Configuration – Testing & Certification – Release Management Coordinated by CERN – 10 partner institutes – ~30 FTEs

29 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Integration testing Deployment tests – Developers produce rpms that conflict with existing rpms (gLite or system). – Update affects production node types with the produced rpms. – Deployment tests are available can be launched by the developer before giving the rpms to certification. 29

30 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Integration testing Deployment tests issues – We provide a repository, rpm lists and tarballs (for certain services). – Sites install/update the middleware differently yum, fabric management tools,… – Difficult to “test” all deployment scenarios Sites and regions customize install and configuration procedures – The base OS version is updated frequently independently 30

31 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Integration testing Configuration tests Grid services are configured with YAIM – YAIM (Ain’t an Installation Manager). – Modular bash shell script >37 000 lines, >30 modules Test configuration after changes – middleware or YAIM 31

32 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it System testing Services have to be tested against a grid What version should we test against? – Production service is not homogenous One patch may affect several node types. For every node type we have a list of tests that have to be done. Regression tests are available and evolving 32

33 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Acceptance testing Pre-Production Service (PPS) – ~ 20 sites several hundred nodes – Provides access to grid services in previews to interested users. – Evaluate deployment procedures, interoperability and basic functionality of the software against operational scenarios reflecting real production conditions. – After certification patches go to PPS before being released to production. Time spent in PPS: 1-2 weeks. 33

34 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Acceptance testing It is difficult to convince users to try out the services before they are being released to production. Production Grid conditions cannot be fully replicated – Size of the Grid. – File catalogs with millions of entries. Early life support – Dedicated sites install certain service immediately after release to production. – Well defined rollback procedure in case of problems. Pilot services – Preview of a new (version of a) service. – Users can (stress) test it with typical production workloads. – Quick feedback to developers. 34

35 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Test process Tailored to our environment People in different locations involved. – Independent in their work habits and infrastructure. Open source tools. Use „least common denominator“. 35

36 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Test writing Biggest challenge: to get tests written at all. Learning curve for grid services is steep – We maintain lists of expertise. It is difficult to get realistic use cases Keep it simple to focused on test writing Tests are in one defined test categories – installation, functionality etc. Test script may use : Bash, Python or Perl. Tests can be executed as a command – Ensures integration into different frameworks. Tests must be fully configurable Focus on test script, not the integration into a framework. 36

37 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Available tests and checklists are documented 37

38 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Test framework Testing requires a grid Ideal: bring up a complete grid with one click, well defined versions of the nodes according to test results. – Installing grid nodes is non-trivial Pragmatic approach – CERN provides a certification testbed Complete, self-contained grid providing all services. Certifiers install the nodes they need to test and integrate them into the testbed. – Heavy use of virtualization We developed our own tools to create customized images and a VM management framework (Xen based) 38

39 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Test framework Don’t let the framework distract you from doing tests! – We tried complex test frameworks ….. – Execute tests store and display results, information about test set up Pragmatic approach: – Test data and results are stored with the patch. – Patch & bug tracking tool: Savannah – Tests are simple scripts that can be used by anybody 39

40 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Experience We are victims of our own success – Moved prototypes into production very early – With production users we can evolve only slowly ( standards) Software life cycle management has to change with the project’s maturity – Before 2006 focus on functionality Big bang releases, large dedicated testbeds Central team – 2006-2008 manage diversity and scale, reactive Fast release cycles Deployment scenarios via PPS Pilot services using production Strong central team & distributed team

41 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Future Components will be developed more independently – Process has to reflect this – Decentralized approach Tests follow agreed process, can be run everywhere More problems are found at full scale in production – Focus on pilots and staged rollout – Improved “Undo” ( rollback) – Deployment tests move to sites Too many different setups to handle in one place

42 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it If we could start again….. Expectation management – Software developers and users have to understand the limitation of testing better Enforce unit and basic tests to be provided by software producers – Often software is rejected for trivial reasons Very inefficient Avoid a overambitious Pre-Production Service – Limited gain Enforce control over dependencies from the start on Add process monitoring earlier in the project


Download ppt "The LHC Computing Grid – February 2008 CERN’s Integration and Certification Services for a Multinational Computing Infrastructure with Independent Developers."

Similar presentations


Ads by Google