Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE is a project funded by the European Union under contract IST-2003-508833 Experiment Software Installation in the LHC Computing Grid www.eu-egee.org.

Similar presentations


Presentation on theme: "EGEE is a project funded by the European Union under contract IST-2003-508833 Experiment Software Installation in the LHC Computing Grid www.eu-egee.org."— Presentation transcript:

1 EGEE is a project funded by the European Union under contract IST-2003-508833 Experiment Software Installation in the LHC Computing Grid www.eu-egee.org E-science2005,December 8-th Roberto Santinelli

2 Experiment Software Installation Outline The problem The problem -Eye’s of bird view -Eye’s of bird view -Requirements from the site administrators -Requirements from the site administrators -Requirements from the experiments The general mechanism (since LCG phase-1) The general mechanism (since LCG phase-1) - Limits of the LCG-1 mechanism - Limits of the LCG-1 mechanism Tank&Spark Tank&Spark Experimental results from INFN activity. Experimental results from INFN activity.

3 Experiment Software Installation The Problem

4 Experiment Software Installation How can I run my private application in a How can I run my private application in a world-wide Grid? world-wide Grid? How is a user guaranteed that the software he/she needs is in fact installed on a remote resource where the user might not even have an account?

5 Experiment Software Installation The problem The middleware software is often installed and configured by system administrators at sites via customized tools like YAIM (in the past LCFGng) or Quattor (but also Rocks) that provide also for centralized management of the entire computing facility.  (root access required, please) The middleware software is often installed and configured by system administrators at sites via customized tools like YAIM (in the past LCFGng) or Quattor (but also Rocks) that provide also for centralized management of the entire computing facility.  (root access required, please)while Gigabytes of Virtual Organization (VO) specific software need to be installed and managed at each site too. Gigabytes of Virtual Organization (VO) specific software need to be installed and managed at each site too.

6 Experiment Software Installation The problem The difficulty here is mainly connected to the existence of disparate ways of installing, configuring and validating experiment software. The difficulty here is mainly connected to the existence of disparate ways of installing, configuring and validating experiment software.  Each VO (LHC Experiment and not only) wants to maintain its own proprietary method for distributing the software on LCG (scram, pacman, rpm, tarball and so on).  Furthermore the system for accomplishing this task might change with time even inside the same VO. Site admin’s might be not familiar with all VO software requirements and release procedures. Site admin’s might be not familiar with all VO software requirements and release procedures. HEP Application Software is continuously upgraded (HF) HEP Application Software is continuously upgraded (HF)

7 Experiment Software Installation Concerns from site administrators  No daemon running on the WNs: the WN belongs to the site and not to the GRID  No root privileges for VO software installation on WN.  Traceable access to the site.  In/Outbound connectivity to be avoided.  Site policies applied for external action triggering.  Strong authentication required  No shared file system area should be assumed to serve the experiment software area at the site.

8 Experiment Software Installation Requirements from the Experiment Only special users are allowed to install software at a site. The experiment should be free to install whenever it wants the software in a site. Management of simultaneous and concurrent installations should be in place. Installation/validation/removal process failure recovery. Publication of relevant information for the software. Installation/validation/removal of software should be done not only through a Grid Job but also via a direct service request in order to assure the right precedence with respect to “normal” jobs. Validation of the software installed could be done in different steps and at different times with respect to the installation. The software must be “relocatable” It is up to the experiment to provide scripts for installation/validation/removal of software (freedom).

9 Experiment Software Installation “In Order to comply with these requirements/concerns - since LCG-1- a general mechanism relying upon a set of “key-features” has been taken into account……………….. “ General Mechanism

10 Experiment Software Installation General LCG-1 Mechanism: The VO_ _SW_AREA ?? Sites can choose between providing VO space in a shared file system or allow for local software installation on a WN. This is indicated through the local environment variable VO_ _SW_DIR. If it is equal to “.” (meaning in linux the current path) it means there is not file system shared among WNs and everything is installed locally, on the working directory, and later on scratched by the local batch system (to avoid the exhaustion of disk space on the node). Vice versa the software can be installed permanently on the shared area and made always accessible by all WNs.

11 Experiment Software Installation General LCG-1 Mechanism: The ESM Only few people in a given collaboration are eligible to install software on behalf of the VO. These special users have write access to the VO space (in case of shared file system) and privileges to publish the tag in the Information System. The VO Manager creates the software administrator Group. People belonging to this group are mapped to a special Administrative VO account in the local grid-mapfile at all sites.

12 Experiment Software Installation General LCG-1 Mechanism: Tag publication For sites on which the installation/certification has been successfully run, the ESM adds in the InfoSys of the site a Tag that indicates that the version of the VO-software has successfully passed the tests. This information will allow real user jobs - requiring a given version of VO software - to be driven on homologated sites.

13 Experiment Software Installation General LCG-1 Mechanism: The workflow 1. The ESM checks the resources available and decides where the new release must be installed. 2. The ESM creates one or more tarball files containing scripts to install (and validate later on) and bundles of experiment software that have to be installed. 3. The ESM runs “special lcg-tool” to install software on a per site basis by submitting “normal grid jobs” explicitly on these sites. 4. The ESM runs further verification steps by submitting other grid jobs that will certify the software previously installed. 5. The ESM finally publishes on the Information System of the site a tag. These tags indicate the software is available to subsequent running jobs from “normal users” 5. The ESM finally publishes on the Information System of the site a tag. These tags indicate the software is available to subsequent running jobs from “normal users”.

14 Experiment Software Installation How this general mechanism behaves with respect to the initial list of requirements/concerns?

15 Experiment Software Installation Limits: from site administrators concerns point of view  No daemon running on the WNs: the WN belongs to the site and not to the GRID  No root privileges for VO software installation on WN.  In/Outbound connectivity to be avoided.  Strong authentication required  Site policies applied for external action triggering.  No shared file system area should be assumed to serve the experiment software area at the site.  Traceable access to the site.

16 Experiment Software Installation Limits: from Experiments requirements point of view  Only special users are allowed to install software at a site  Installation/validation/removal process failure recovery  Publication of relevant information for the software.  It is up to the experiment to provide scripts for installation/validation/removal of software. (freedom)  The experiment needs to install very frequently its own software: it should be free to install whenever it wants the software in a site.  Management of simultaneous and concurrent installations should be in place  Installation/validation/removal of software should be done not only through a Grid Job but also via a direct service request in order to assure the right precedence in respect to “normal” jobs (special priority)  Validation of the software installed could be done in different steps and at different times with respect to the installation. THE SOFTWARE MUST BE PROPAGATED IN ALL WNs WHENEVER THERE IS NOT SHARED FILE SYSTEM

17 Experiment Software Installation “In order to FULLY comply with these requirements LCG passed from a simple set of scripts …….to a structured service : Tank and Spark integrated into this general framework through a multilayered toolkit.” Tank & Spark

18 Experiment Software Installation - Structure of the Experiment Software Installation toolkit Lcg-ManageSoftware Lcg-ManageVOTag Tank&Spark gssklog Lcg-asis UI WN CE lcg-asis: is a friendly user interface (on the UI) that simplifies the ESM life (do you remember the wizard steering you at the begin of this talk?) It uploads (if specified) the sources of the software on the grid (rpms, archives and so on). It loops over all available sites to the VO (complying with some requirements provided by the ESM in terms of CPU, memory, disk space and CPU-time) and for each site:  Checks if another software management process is running on the site through the Information System  Creates automatically the JDL for that site  Submits the jobs and stores job information lcg-ManageSoftware: 1. It is the steering script invoked on the WN for installing/removing/validating application software. 2. It checks for the local environment and decides the workflow to be performed. 1. Is there running some other process for the same software and VO? 2. Is it a shared file system or not ? 3. Is it a AFS file system or not (conversion of GSI credential to AFS Tokens)? 4. Is there installed Tank&Spark ? 3. It downloads the sources of the software to be installed (if specified) to the local WN (by-passing the outbound connectivity requirement through the local SE. 4. It invokes the experiment specific script (provided somehow by the ESM) and checks the result of the script. 5. It creates a temporary directory used for install/validate/remove software and later on it cleans up the temporary directory. 6. It publishes TAGs on the Information System with a flavor that depends on the action that it’s going to be performed and the result of a such action (monitoring of the process: do you remember the scroll bar at the begin of this talk?) Lcg-ManageVOTag: It is the module used by lcg-ManageSoftware in order to add/list/remove tags published on the Information System. Tank&Spark: Mainly used to propagate software to other WNs. installation by-passing the grid-job-submission installation by-passing the grid-job-submission high prioritization of software management high prioritization of software management tracks who is installing what tracks who is installing what complies with external policy set by the site administrators complies with external policy set by the site administrators can manage all possible topologies of file system (shared, no-shared, AFS, a mix of them!) can manage all possible topologies of file system (shared, no-shared, AFS, a mix of them!) a-synchronous (currently) and synchronous installation. a-synchronous (currently) and synchronous installation. failure recovery capabilities (re-try of the installation on the node) failure recovery capabilities (re-try of the installation on the node) Possible roll-back of a given installation (not in place yet) Possible roll-back of a given installation (not in place yet) exhaustive notification about the installation process (node by node) exhaustive notification about the installation process (node by node) to the ESM and (automatically) to the site admin to the ESM and (automatically) to the site admin do you remember the final wizard at the begin of this talk? do you remember the final wizard at the begin of this talk? Keeps information about a given software Keeps information about a given software internally identified through GUIDs internally identified through GUIDs date, size, owner, path, status, dependencies and so on … date, size, owner, path, status, dependencies and so on … Automatic farm management Automatic farm management Is a node out? Is a new node there? Is the file system up and running? Is a node out? Is a new node there? Is the file system up and running? Rough monitoring of the installation progress (using the Information System) Rough monitoring of the installation progress (using the Information System) SSL embedded synchronization SSL embedded synchronization in case the SE is outside the site firewall in case the SE is outside the site firewall gssklog: It allows for the conversion of GSI credentials into valid KRB5 AFS tokens in case the site provides shared area through AFS For more information visit http://grid-deployment.web.cern.ch/grid- deployment/eis/docs/configuration_of_tankspark

19 Experiment Software Installation Tank & Spark It is a client-server application written entirely in C++ and it consists of three different components: Tank : multithread (g-SOAP based) service (running on the CE) listening for GSI-authenticated (and non) connections interfaced to a MySQL BKD. Spark :client application running on each WN which talks with “Tank” for retrieving/inserting/deleting software information and triggering synvcronization R-sync server running on another machine (a SE for instance) and acting as central repository of the VO software.

20 SE TANK CE WN ESM 1 JDL-installation job from ESM arrives on CE 2 SPARK ESM requests ends up on WN that becomes SPARK The software (here labeled as “c”) is installed locally through the middle layer lcg- ManageSoftware. A pre-validation is highly recommended before triggering the propagation. The Information System is upgraded 3 Spark-client program is called. Delegated credentials of the ESM are checked in TANK. SPARK ask for a sw tag registration in TANK central DB. 4 TANK registers the new tag and synchronize through R-SYNC the new directory created in SPARK in a central repository 5 TANK is contacted by all WNs one at the time External conditions are checked. Special site policies can be taken into account. Local installation on WNs is triggered. No authentication is required : each WN trusts TANK. 7 At the end of the whole process TANK will e-mail the ESM indicating the result of the installation; the Information System is upgraded accordingly to the result of the process Site Firewall ab abc “c” 6

21 Experiment Software Installation Experimental results

22 Experiment Software Installation Experimental Results: Goals Checking all functionalities of the service. (Pisa test) Installing and synchronizing ~3GB of *real application software (Atlas) * among 50 machines (Padova test).  Reproduce the same installation test several times proving T&S being suitable for the typical HEP use case (reliability test) Tested the removal capability of Tank and Spark  Testing the system in a production environment Compatibility with the existent production infrastructure (CE load affection, data safety and integrity)  Non-expert users Proving the scalability of the system up to o(100) machines and the stability of the system in the time (Legnaro test)

23 Experiment Software Installation Experimental results: Pisa test Installing a (fake) package tagged: CMSIM-145.1.2-0 (50% sharing area and 50% local disk machines) Spark WN (beginning process) Tank DB (at the end) Another WN (at the end) ` On the CE

24 Experiment Software Installation Experimental Results: Padova test …after ~3hs… Storage Element (r-sync server) CPU=3GHz, Mem=2GB,

25 Experiment Software Installation Experimental results: Padova tests 50 WNs 3GHz dual Xeon 2GB of memory FastEthernet connected A WN of Padova CE load: Tank doesn’t affect the CPU (this is the normal CE activity!)

26 Experiment Software Installation Experimental results: Legnaro Tests O(100) Dual Xeon machine (2.4-3.0 GHz) and 2 GB of RAM. Configurated with share file system. Similar for the CE and the SE. The farm is connected to a Gigabit Ethernet switch. The Tank servers run for more than 30 days with the same load without any watch dog intervention. Very stable. The Tank servers didn’t prove memory problems being his consummation stable to 9 MB and only 0.1% of CPU used (smaller than an emacs process!) We tested the CPU load of the r-sync server in the critical “check summing” operation for the Geant4 software (1.3GB)  R-sync scales The system scaled up to o(100) machines without any problem!

27 Experiment Software Installation Conclusion Tank & Spark is deployed on LCG since last year. It offers a versatile system for propagating GB of experiment specific software in LCG computing centers and tackles all limits that previous mechanism had. It copes with the overwhelming majority of site admins and end- users requirements. It adds extra-features that only preliminary tests and a close work with the HEP experiments showed up to be needed. It has been proved to be reliable, robust and it has been demonstrated its scalability up to O(100) machines farms and 3GB of software successfully synchronized. Next step will consist of performing the same test against O(1000) machines computing centers.

28 Experiment Software Installation Acknowledgements Special thanks to the following Italian site administrators (within the INFN collaboration) that permitted these tests Tommaso Boccali (PISA) Sergio Fantinel and Massimo Biasotto (LEGNARO) Rosario Turrisi (PADOVA)


Download ppt "EGEE is a project funded by the European Union under contract IST-2003-508833 Experiment Software Installation in the LHC Computing Grid www.eu-egee.org."

Similar presentations


Ads by Google