Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages.

Slides:



Advertisements
Similar presentations
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
CONDOR CISC 879 Parallel Computation Spring 2003 Preethi Natarajan.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
1 Using Condor An Introduction ICE 2008.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Condor Tutorial Prabhaker Mateti Wright State University.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Computing I CONDOR.
Condor Birdbath Web Service interface to Condor
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Hunter of Idle Workstations Miron Livny Marvin Solomon University of Wisconsin-Madison URL:
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
3-1.1 Schedulers Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 3, pp For.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 Condor BirdBath SOAP Interface to Condor Charaka Goonatilake Department of Computer Science University College London
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison
how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.
1 Checkpoint and Migration in the Condor Distributed Processing System Presentation by James Nugent CS739 University of Wisconsin Madison.
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Condor Tutorial NCSA Alliance ‘98 Presented by: The Condor Team University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
6d.1 Schedulers and Resource Brokers Topics ITCS 4146/5146, UNC-Charlotte, B. Wilkinson, 2007 Feb 12, 2007 Local schedulers Condor.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
HTCondor Security Basics
Condor – A Hunter of Idle Workstation
Condor: Job Management
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
Condor-G Making Condor Grid Enabled
Presentation transcript:

Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Checkpointing Checkpointing is used to vacate job from one idle workstation to another A Condor checkpoint library linked with the program’s code Checkpoint library installs signal handler for handling SIGSTP signal. Checkpoints either stored on local disk of submitting machine or on checkpoint servers Stores unix process’ states including text, stack, data segments, files, pointers etc. Condor also provides periodic checkpointing

Checkpointing Overview When startd daemon detects policy violations, sends a signal to the process The signal handler in the process is invoked, process state is checkpointed Checkpoints sent to shadow process which stores it When a new machine is chosen, the executable and checkpoint is sent to remote machine When the job is started on the remote machine, it detects that it is a restart; reads the checkpoint; some manipulations done such that process state at the time of checkpoint is restored. It appears to the user code that the process has just returned from the signal handler

Checkpointing Details (Refer to postscript file) Preserving and restoring text area (same executable), data area (using sbrk(0)) and stack Preserving stack state consists of storing and restoring 2 parts – stack context and stack space Stack context stored by setjmp and restored by longjmp Stack space replacement is tricky – performed by using a secure data region for stack Open files state saved by augmenting open calls state saved by augmenting open calls lseek performed during checkpointing to obtain offset information lseek performed during checkpointing to obtain offset information Signals – sigaction, sigispending

Checkpoint summary Checkpoint library installs signal handler called checkpoint() Then calls main() At the time of checkpoint, SIGSTP signal sent, checkpoint() invoked checkpoint() Write open files, signals, stack context to data area Write open files, signals, stack context to data area Stores data and stack segments Stores data and stack segments

Restart Summary restore() Overwrites data segment with that in checkpoint Overwrites data segment with that in checkpoint Restores file and signal information Restores file and signal information Switches to a temporary location in data segment, replaces its stack space Switches to a temporary location in data segment, replaces its stack space Performs longjmp() pointing to checkpoint() signal handler Performs longjmp() pointing to checkpoint() signal handler Checkpoint routine returns and restores CPU registers Checkpoint routine returns and restores CPU registers

Limitations Cannot checkpoint fork()/exec() or multi-process Can checkpoint only on homogeneous systems Cannot checkpoint communicating multi- processes

Condor Universes Universe specified during job submission Types:Standard System calls transferred to submit machines System calls transferred to submit machines Provides for checkpointing and migration Provides for checkpointing and migration Relink program with condor_compile Relink program with condor_compileVanilla For programs that cannot be relinked For programs that cannot be relinked Does not provide for checkpointing and migration – WHY? Does not provide for checkpointing and migration – WHY? For accessing to files, use Condor File Transfer mechanism For accessing to files, use Condor File Transfer mechanismScheduler For job that should act as metascheduler For job that should act as metascheduler Mpi, pvm, java,globus

Condor Commands condor_compile Relinks source or object files with condor libraries Relinks source or object files with condor libraries Condor library provides checkpointing, migration, remote system calls Condor library provides checkpointing, migration, remote system calls condor_submit - Takes as input submit description file and produces a job classAd for further processing by central manager condor_status – to view about various machines in the Condor pool condor_q – for viewing job status

DAGMan Meta scheduler for Condor Manages dependencies between jobs at a higher level Sits on top of Condor Input of one program depends on the other condor_ submit_dag DAGInputFileName DAG within a DAG is supported

Example input file for DAGMan # Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor PARENT A CHILD B C PARENT B C CHILD D Retry C 3

Condor File System and File Transfer Mechanism Applicable for only vanilla jobs By default a shared file system is assumed between submitting machine and executing machine Machine classAd attributes – FileSystemDomain and UidDomain To bypass default: say something like: Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu'' Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu''

Condor File System and File Transfer Mechanism If machines do not share file systems or the file systems not explicitly specified, enable Condor File Transfer Mechanism: should_transfer_files = YES should_transfer_files = YES when_to_transfer_output = ON_EXIT when_to_transfer_output = ON_EXIT Any files that are generated or modified in the remote working directory are transferred back to the submit machine

References / Sources / Credits Condor manual Condor web pages Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi- Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10) San Francisco, California, August 7-9, Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL. Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages , June, 1988.

Submit description files Directs queuing of jobs Contains Executable location Executable location Command line arguments to job Command line arguments to job stdin, stderr, stdout stdin, stderr, stdout Initial working directory Initial working directory should_transfer_files =. NO disables condor file transfer mechanism should_transfer_files =. NO disables condor file transfer mechanism when_to_transfer_output = when_to_transfer_output =

Submit description file requirements = requirements = By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are set By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are set requirements = requirements = + = + =

Machine ClassAd Attributes ActivityArch CondorLoadAvg, ConsoleIdle, Disk, Cpus, KeyboardIdle, LoadAvg, KFlops, Mips, Memory, OpSys, FileSystemDomain, Requirements, StartdIpAddr ClientMachine, CurrentRank, RemoteOwner, LastPeriodicCheckpoint

Job ClassAd Attributes CompletionDate, RemoteIwd

Heterogeneous job submission Works well with the vanilla universe since checkpoint is not taken. For standard universe, # Added by Condor CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) Requirements = ( ) && $(CkptRequirements)

Submission steps Job preparation Choosing a universe Submit description file condor_submit

Job Migration SIGSTP and signal handler in standard universe SIGTERM in vanilla

Condor Security Schedd starts shadow with the effective UID of job owner Different methods like Kherberos and GSI for authentication, different encryption mechanisms, authorization are supported between client and daemons Sockets and ports – condor collector and negotiator start on well known ports. Other daemons start on ephermeral ports.

Checkpointing CkptArch, CkptOpSys, LastCkptServer, LastCkptTime, NumCkpts classAds generated automatically for job