Presentation is loading. Please wait.

Presentation is loading. Please wait.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Similar presentations


Presentation on theme: "JSS Job Submission Service Massimo Sgaravatto INFN Padova."— Presentation transcript:

1 JSS Job Submission Service Massimo Sgaravatto INFN Padova

2 JSS Wrapper of Condor-G identified as JSS for Testbed 1 Condor-G is a Personal Condor enhanced with Globus services Used to submit jobs from the user ws to remote Globus resources Condor-G keeps track of the progress of these jobs

3 Condor-G Architecture Condor Master Condor Schedd Condor GridManager Globus resource Globus resource Globus resource condor_submit condor_q condor_rm One GridManager per user

4 Condor-G commands condor_submit CondorSubmitFile To submit jobs to a Globus resource condor_q {id} To monitor the status of the job(s) condor_rm id To remove the job from the queue

5 Example condor_submit myfile myfile: Universe = globus TransferExecutable=True Executable = /home/userx/startsim.sh TransferInput=True Input=/home/userx/inp.$(Process) TransferOutput=False Output = /data/out.$(Process) TransferError=True Error = /home/userx/error.$(Process) Environment = CMSVER=118 Log = /home/userx/log.$(Process) Arguments=123 GlobusRSL=(queue=cmsprod) GlobusScheduler = pcmsfarm01.pi.infn.it/jobmanager-lsf Queue 10

6 Condor-G job log file Info reported When the job has been inserted in the Condor-G queue The IP address of the submitting machine (Condor-G machine) When the job has started its execution The IP name of the gatekeeper machine where the job has been submitted (could be different from the actual executing machine) When the job has completed its execution Condor-G relies on both callbacks and polling to create this log file Library already available to “parse” this job log file Not tested yet

7 “Abnormal” events The submission to Globus fails Condor-G tries again after 5 minutes This event is reported in the GridManager log file (not in the job log file) The gatekeeper can’t be contacted (for an already submitted job) The job remains in the Condor-G queue, and Condor-G tries again later The Gatekeeper can be contacted, but the job manager can’t be contacted Now: job completed with exit status 1 Exit status 0 for the “normal” jobs Enhanced when the new persistent job manager will be released (see next slides)

8 Condor-G problems The failures submitting jobs to Globus resources and the reasons of these failures are reported in the GridManager log file instead of the job log file The log file doesn’t report when the job “arrives” at the Globus resource (i.e. when the job manager is created) It is reported when it is inserted in the Condor-G queue and when it starts its execution in the Globus resource API missing Not possible to be asynchronously notified about job status transitions (i.e. callbacks)

9 Issues not addressed by Condor-G Condor-G is not able to discover if a job “disappears” without any exit status, and the underlying LRMS is not able to manage the problems In this case Globus reports a “done” callback Do we really have to manage this problem ? Exit status of jobs Globus doesn’t report the exit status of jobs The job status transitions: running  suspended (job transition #5 wrt Cesnet doc)  running can’t be detected Globus doesn’t detect these transitions Expiration of proxy Just a parameter in the Condor-G conf file defining the minimum lifetime of the proxy Not possible to move from/to the executing machines other files besides executable/standard input/output/error

10 Other issues Proxy

11 Future developments Next future (1 month ?) Two phase commit submission protocol Persistent Globus job manager (save_state=yes) when submitting a job (recover=ContactStringOfJobManager) to restart a job manager and “reattach” it to a running job Condor GridManager able to automatically exploit the new job manager Used when Condor-G looses track of a job Long term GRAM-2


Download ppt "JSS Job Submission Service Massimo Sgaravatto INFN Padova."

Similar presentations


Ads by Google