JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Slides:



Advertisements
Similar presentations
INFN & Globus activities Massimo Sgaravatto INFN Padova.
Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
First ideas for a Resource Management Architecture for Productions Massimo Sgaravatto INFN Padova.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto - INFN Padova Francesco Prelz – INFN Milano.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
F.Pacini - Milan - 8 May, n° 1 Results of Meeting on Workload Manager Components Interaction DataGrid WP1 F. Pacini
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid job submission using HTCondor Andrew Lahiff.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Proposal for a IS schema Massimo Sgaravatto INFN Padova.
WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
High-Performance Computing Lab Overview: Job Submission in EDG & Globus November 2002 Wei Xing.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
EDG - WP1 (Grid Work Scheduling) Status and plans Massimo Sgaravatto INFN Padova.
BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.
4/9/ 2000 I Datagrid Workshop- Marseille C.Vistoli Wide Area Workload Management Work Package DATAGRID project Parallel session report Cristina Vistoli.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
Practical using C++ WMProxy API advanced job submission
Workload Management Workpackage
First proposal for a modification of the GIS schema
WP1 WMS release 2: status and open issues
Peter Kacsuk – Sipos Gergely MTA SZTAKI
High Availability in HTCondor
Job Submission in the DataGrid Workload Management System
GRID Workload Management System for CMS fall production
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

JSS Job Submission Service Massimo Sgaravatto INFN Padova

JSS Wrapper of Condor-G identified as JSS for Testbed 1 Condor-G is a Personal Condor enhanced with Globus services Used to submit jobs from the user ws to remote Globus resources Condor-G keeps track of the progress of these jobs

Condor-G Architecture Condor Master Condor Schedd Condor GridManager Globus resource Globus resource Globus resource condor_submit condor_q condor_rm One GridManager per user

Condor-G commands condor_submit CondorSubmitFile To submit jobs to a Globus resource condor_q {id} To monitor the status of the job(s) condor_rm id To remove the job from the queue

Example condor_submit myfile myfile: Universe = globus TransferExecutable=True Executable = /home/userx/startsim.sh TransferInput=True Input=/home/userx/inp.$(Process) TransferOutput=False Output = /data/out.$(Process) TransferError=True Error = /home/userx/error.$(Process) Environment = CMSVER=118 Log = /home/userx/log.$(Process) Arguments=123 GlobusRSL=(queue=cmsprod) GlobusScheduler = pcmsfarm01.pi.infn.it/jobmanager-lsf Queue 10

Condor-G job log file Info reported When the job has been inserted in the Condor-G queue The IP address of the submitting machine (Condor-G machine) When the job has started its execution The IP name of the gatekeeper machine where the job has been submitted (could be different from the actual executing machine) When the job has completed its execution Condor-G relies on both callbacks and polling to create this log file Library already available to “parse” this job log file Not tested yet

“Abnormal” events The submission to Globus fails Condor-G tries again after 5 minutes This event is reported in the GridManager log file (not in the job log file) The gatekeeper can’t be contacted (for an already submitted job) The job remains in the Condor-G queue, and Condor-G tries again later The Gatekeeper can be contacted, but the job manager can’t be contacted Now: job completed with exit status 1 Exit status 0 for the “normal” jobs Enhanced when the new persistent job manager will be released (see next slides)

Condor-G problems The failures submitting jobs to Globus resources and the reasons of these failures are reported in the GridManager log file instead of the job log file The log file doesn’t report when the job “arrives” at the Globus resource (i.e. when the job manager is created) It is reported when it is inserted in the Condor-G queue and when it starts its execution in the Globus resource API missing Not possible to be asynchronously notified about job status transitions (i.e. callbacks)

Issues not addressed by Condor-G Condor-G is not able to discover if a job “disappears” without any exit status, and the underlying LRMS is not able to manage the problems In this case Globus reports a “done” callback Do we really have to manage this problem ? Exit status of jobs Globus doesn’t report the exit status of jobs The job status transitions: running  suspended (job transition #5 wrt Cesnet doc)  running can’t be detected Globus doesn’t detect these transitions Expiration of proxy Just a parameter in the Condor-G conf file defining the minimum lifetime of the proxy Not possible to move from/to the executing machines other files besides executable/standard input/output/error

Other issues Proxy

Future developments Next future (1 month ?) Two phase commit submission protocol Persistent Globus job manager (save_state=yes) when submitting a job (recover=ContactStringOfJobManager) to restart a job manager and “reattach” it to a running job Condor GridManager able to automatically exploit the new job manager Used when Condor-G looses track of a job Long term GRAM-2