GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Published byModified over 6 years ago
Presentation on theme: "GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova."— Presentation transcript:
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova
What do we want to implement (simplified design) Globus GRAM CONDOR Globus GRAM LSF Globus GRAM … globusrun Site1 Site2Site3 condor_submit (Globus Universe) Condor-G Master Grid Information Service (GIS) Submit jobs (using Class-Ads) Resource Discovery Information on characteristics and status of local resources Local Resource Management Systems Globus GRAM as uniform interface to different local resource management systems Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … Master chooses in which Globus resources the jobs must be submitted Farms
What can be implemented now Globus GRAM CONDOR Globus GRAM LSF Globus GRAM … globusrun Site1 Site2Site3 condor_submit (Globus Universe) Condor-G Grid Information Service (GIS) Submit jobs Information on characteristics and status of local resources Local Resource Management Systems Globus GRAM as uniform interface to different local resource management systems Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … Farms Not very useful in this model
Status Tests on basic capabilities and functionalities have been performed Problems with scalability and fault tolerance found CMS production useful exercise to test everything with real applications and real environments
CMS production Application: Pythia + Cmsim “Traditional” applications Overview Job management (submission, monitoring) from a single machine using Condor tools User must explicitly define in which Globus resource (which farm) the jobs must be submitted The applications and the input files must be stored in the file system of the executing machine The output files will be created in the file system of the executing machine We can try to have just the standard output/error files (useful to check the “status” of the production) created in the submitting machine, using bypass and/or Globus GASS CMS wants to test bypass as a second step
Bypass vs. GASS Bypass Written by Douglas Thain (Condor team) Redirection of standard input/output/error of a program to a remote machine when the program is running Can be used for dynamically linked program Successfully tested with Pythia Use of Globus Security Infrastructure Globus GASS Possibility to copy the input file on the remote machine before the execution, and have the output file back after the execution (otherwise it is necessary to modify the source code)
What is necessary Local farms with shared file system between the various nodes Done using CMS installation toolkit Installation and support up to CMS/local administrators Installation of CMS environment on these farms Done using CMS installation toolkit Support up to CMS
What is necessary Local resource management system to manage the local farm LSF Installation and support up to CMS/local administrators We should define in a “common” way how to configure the queue/s where the jobs run Local Condor pool Installation and configuration (for “dedicated” machines) using CMS toolkit Support ??? PBS Are there sites where PBS will be used ??? Tests on Condor-G – Globus – PBS not performed yet Fork Warmly thoughtless (even for a single machine) Necessary to install Globus on each machine Job queuing up to the production manager
What is necessary Globus One installation per each farm (on a “visible” node) Use of personal certificates and host certificates signed by INFN CA User certificates signed by Globus CA are accepted as well By default it is not possible to “use” Globus resources outside INFN using personal certificates signed by INFN CA Workaround 1: Users have also personal certificates signed by Globus CA Workaround 2: “Small” modification in the Globus configuration of these resources outside INFN in order to accept “our” certificates too Installation Installation done by CMS/local administrators/WP1 member (if present) using distribution and procedures provided by INFN GRID release team (http://www.pi.infn.it/GRID/GRID_INST_1.1.html) In case of problems: firstname.lastname@example.org@infn.it
What is necessary Condor-G Just one installation, used by the production manager (Ivano Lippi ?) Installation and maintenance: Massimo Sgaravatto ??? Scripts to run CMS production using this GRID environment Up to CMS Tools to “monitor” production condor_q Condor Job Viewer (Java GUI) Run the production Up to production manager
Some items/actors missing ??? When ??? Relations with other activities ??? Data Management (GDMP, …) ??? ???