Presentation on theme: "Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages."— Presentation transcript:
Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages
Checkpointing Checkpointing is used to vacate job from one idle workstation to another A Condor checkpoint library linked with the program’s code Checkpoint library installs signal handler for handling SIGSTP signal. Checkpoints either stored on local disk of submitting machine or on checkpoint servers Stores unix process’ states including text, stack, data segments, files, pointers etc. Condor also provides periodic checkpointing
Checkpointing Overview When startd daemon detects policy violations, sends a signal to the process The signal handler in the process is invoked, process state is checkpointed Checkpoints sent to shadow process which stores it When a new machine is chosen, the executable and checkpoint is sent to remote machine When the job is started on the remote machine, it detects that it is a restart; reads the checkpoint; some manipulations done such that process state at the time of checkpoint is restored. It appears to the user code that the process has just returned from the signal handler
Checkpointing Details (Refer to postscript file) Preserving and restoring text area (same executable), data area (using sbrk(0)) and stack Preserving stack state consists of storing and restoring 2 parts – stack context and stack space Stack context stored by setjmp and restored by longjmp Stack space replacement is tricky – performed by using a secure data region for stack Open files state saved by augmenting open calls state saved by augmenting open calls lseek performed during checkpointing to obtain offset information lseek performed during checkpointing to obtain offset information Signals – sigaction, sigispending
Checkpoint summary Checkpoint library installs signal handler called checkpoint() Then calls main() At the time of checkpoint, SIGSTP signal sent, checkpoint() invoked checkpoint() Write open files, signals, stack context to data area Write open files, signals, stack context to data area Stores data and stack segments Stores data and stack segments
Restart Summary restore() Overwrites data segment with that in checkpoint Overwrites data segment with that in checkpoint Restores file and signal information Restores file and signal information Switches to a temporary location in data segment, replaces its stack space Switches to a temporary location in data segment, replaces its stack space Performs longjmp() pointing to checkpoint() signal handler Performs longjmp() pointing to checkpoint() signal handler Checkpoint routine returns and restores CPU registers Checkpoint routine returns and restores CPU registers
Limitations Cannot checkpoint fork()/exec() or multi-process Can checkpoint only on homogeneous systems Cannot checkpoint communicating multi- processes
Condor Universes Universe specified during job submission Types:Standard System calls transferred to submit machines System calls transferred to submit machines Provides for checkpointing and migration Provides for checkpointing and migration Relink program with condor_compile Relink program with condor_compileVanilla For programs that cannot be relinked For programs that cannot be relinked Does not provide for checkpointing and migration – WHY? Does not provide for checkpointing and migration – WHY? For accessing to files, use Condor File Transfer mechanism For accessing to files, use Condor File Transfer mechanismScheduler For job that should act as metascheduler For job that should act as metascheduler Mpi, pvm, java,globus
Condor Commands condor_compile Relinks source or object files with condor libraries Relinks source or object files with condor libraries Condor library provides checkpointing, migration, remote system calls Condor library provides checkpointing, migration, remote system calls condor_submit - Takes as input submit description file and produces a job classAd for further processing by central manager condor_status – to view about various machines in the Condor pool condor_q – for viewing job status
DAGMan Meta scheduler for Condor Manages dependencies between jobs at a higher level Sits on top of Condor Input of one program depends on the other condor_ submit_dag DAGInputFileName DAG within a DAG is supported
Example input file for DAGMan # Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor PARENT A CHILD B C PARENT B C CHILD D Retry C 3
Condor File System and File Transfer Mechanism Applicable for only vanilla jobs By default a shared file system is assumed between submitting machine and executing machine Machine classAd attributes – FileSystemDomain and UidDomain To bypass default: say something like: Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu'' Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu''
Condor File System and File Transfer Mechanism If machines do not share file systems or the file systems not explicitly specified, enable Condor File Transfer Mechanism: should_transfer_files = YES should_transfer_files = YES when_to_transfer_output = ON_EXIT when_to_transfer_output = ON_EXIT Any files that are generated or modified in the remote working directory are transferred back to the submit machine
References / Sources / Credits Condor manual Condor web pages Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi- Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10) San Francisco, California, August 7-9, Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL. Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages , June, 1988.
Submit description files Directs queuing of jobs Contains Executable location Executable location Command line arguments to job Command line arguments to job stdin, stderr, stdout stdin, stderr, stdout Initial working directory Initial working directory should_transfer_files =. NO disables condor file transfer mechanism should_transfer_files =. NO disables condor file transfer mechanism when_to_transfer_output = when_to_transfer_output =
Submit description file requirements = requirements = By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are set By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are set requirements = requirements = + = + =
Heterogeneous job submission Works well with the vanilla universe since checkpoint is not taken. For standard universe, # Added by Condor CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) Requirements = ( ) && $(CkptRequirements)
Job Migration SIGSTP and signal handler in standard universe SIGTERM in vanilla
Condor Security Schedd starts shadow with the effective UID of job owner Different methods like Kherberos and GSI for authentication, different encryption mechanisms, authorization are supported between client and daemons Sockets and ports – condor collector and negotiator start on well known ports. Other daemons start on ephermeral ports.