GRID Workload Management System Massimo Sgaravatto INFN Padova.
Published byModified over 6 years ago
Presentation on theme: "GRID Workload Management System Massimo Sgaravatto INFN Padova."— Presentation transcript:
GRID Workload Management System Massimo Sgaravatto INFN Padova
What do we want to implement (simplified design) Globus GRAM CONDOR Globus GRAM LSF Globus GRAM … globusrun Site1 Site2Site3 condor_submit (Globus Universe) Condor-G Master Grid Information Service (GIS) Submit jobs (using Class-Ads) Resource Discovery Information on characteristics and status of local resources Local Resource Management Systems Globus GRAM as uniform interface to different local resource management systems Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … Master chooses in which Globus resources the jobs must be submitted Farms
What can be implemented now (GWMS release 0) Globus GRAM CONDOR Globus GRAM LSF Globus GRAM … globusrun Site1 Site2Site3 condor_submit (Globus Universe) Condor-G Grid Information Service (GIS) Submit jobs Information on characteristics and status of local resources Local Resource Management Systems Globus GRAM as uniform interface to different local resource management systems Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … Farms Not very useful in this model
Overview Job management (submission, monitoring) from a single machine using Condor tools User must explicitly define in which Globus resource (which farm) the jobs must be submitted The applications and the input files must be stored in the file system of the executing machine The output files will be created in the file system of the executing machine We can try to have just the standard input and/or output and/or error files (useful to check the “status” of the production) in the submitting machine, using bypass and/or Globus GASS
Bypass vs. GASS Bypass Written by Douglas Thain (Condor team) Redirection of standard input/output/error of a program to a remote machine when the program is running Can be used for dynamically linked program Successfully tested with Pythia Use of Globus Security Infrastructure Globus GASS Possibility to copy the input file on the remote machine before the execution, and have the output file back after the execution (otherwise it is necessary to modify the source code)
Status of GWMS release 0 Tests on basic capabilities and functionalities have been performed Some tests with real applications (Pythia, CMSIM) performed No “stress” tests performed to evaluate scalability, reliability, … Problems with scalability and fault tolerance found (Globus jobmanager)
What is necessary for GWMS rel. 0 Local farms with shared file system between the various nodes Installation of proper experiment environment and applications on these farms Local resource management system to manage the local farm Fork Warmly thoughtless (even for a single machine) Necessary to install Globus on each machine Job queuing up to the production manager LSF Local Condor pool PBS Tests on Globus-PBS interaction must be completed (i.e. farm environment) Tests on Condor-G – Globus – PBS not performed yet Globus One installation per each farm (on a “visible” node) Installation using INFNGRID distribution
INFNGRID distribution Done by INFN GRID release team (F. Donno, A. Sciaba`, Z. Xie) Version 1.1 released !!! Precompiled version for Linux Red Hat 6.1 Scripts that make simpler and more “automatic” installation and deployment Supported local resource management system: LSF, Condor Possibility to implement INFN customizations Certificates “Test” GIS Architecture Installation instructions (http://www.pi.infn.it/GRID/GRID_INST_1.1.html)
Certificates Use of personal certificates and host certificates signed by INFN CA User certificates signed by Globus CA are accepted as well By default it is not possible to “use” Globus resources outside INFN using personal certificates signed by INFN CA. Is this a problem ??? Workaround 1: Users have also personal certificates signed by Globus CA Workaround 2: “Small” modification in the Globus configuration of these resources outside INFN in order to accept “our” certificates too
Dc=bo, Dc=infn, dc=it,o=grid Bologna GIIS INFN ATLAS GIIS GIIS Dc=mi,Dc=infn, dc=it,o=grid Exp=atlas, o=grid Top Level INFN GIIS Dc=infn,dc=it, o=grid Milano GIS Architecture (test phase) GRIS Implemented Implemented using INFNGRID distribution To be implemented
INFNGRID distribution Next release Solaris 2.6 Support of PBS as local resource management system GDMP Other works, changes, bug fixes “triggered” by users/administrators Necessary to define relationship with DataGrid !!!
What is necessary Condor-G Used by the production manager to submit jobs Scripts to run productions using this GRID environment Tools to “monitor” production condor_q Condor Job Viewer Java GUI
(Some) next steps Tests with real applications and real environments CMS fall production Fix the problems Globus jobmanager Who, how, relations with Globus team, relations with Condor team ??? … GIS – ClassAds converter Globus team ??? Master implementation Who, how, … ??? The default GIS schema must be integrated with other info (the information on characteristics and status of local resources and on jobs is not enough) We need to identify which other info are necessary Much more clear during Master design Packaging ???