Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor Project Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor.

SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.

1 Using Condor An Introduction ICE 2008.

Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Condor Tutorial Prabhaker Mateti Wright State University.

April Open Science Grid Campus Condor Pools Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel Hill.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor User Tutorial.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Grid Computing I CONDOR.

1 Using Condor An Introduction ICE 2010.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Grid job submission using HTCondor Andrew Lahiff.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Condor Project Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Five todos when moving an application to distributed HTC.

Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.

Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Condor Introduction and Architecture for Vanilla Jobs CERN Feb

Intermediate HTCondor: Workflows Monday pm

Using Stork An Introduction Condor Week 2006

Condor: Job Management

Using Condor An Introduction Condor Week 2004

Troubleshooting Your Jobs

Haiyan Meng and Douglas Thain

Job Matching, Handling, and Other HTCondor Features

Using Condor An Introduction Condor Week 2003

Process Description and Control

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Using Condor An Introduction Paradyn/Condor Week 2002

Troubleshooting Your Jobs

Presentation transcript:

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison

OSG Summer School 2010 Before we begin… Any questions on the lectures or exercises up to this point? 2

OSG Summer School 2010 How can my jobs access their data?

OSG Summer School 2010 Access to Data in Condor Use shared filesystem if available  Not available for today’s exercises No shared filesystem?  Condor can transfer files  Can automatically send back changed files  Atomic transfer of multiple files  Can be encrypted over the wire  This is what we’ll do in the exercises  Remote I/O Socket  Standard Universe can use remote system calls (more on this later)

OSG Summer School 2010 Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = IF_NEEDED Transfer_input_files = dataset$(Process), common.data Queue 600 Condor File Transfer ShouldTransferFiles = YES  Always transfer files to execution site ShouldTransferFiles = NO  Rely on a shared filesystem ShouldTransferFiles = IF_NEEDED  Will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain

OSG Summer School 2010 Some of the machines in the Pool do not have enough memory or scratch disk space to run my job!

OSG Summer School 2010 Specify Requirements An expression (syntax similar to C or Java) Must evaluate to True for a match to be made Universe = vanilla Executable = my_job Log = my_job.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > Queue 600

OSG Summer School 2010 Specify Rank All matches which meet the requirements can be sorted by preference with a Rank expression. Higher the Rank, the better the match Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > Rank = (KFLOPS*10000) + Memory Queue 600

OSG Summer School 2010 My jobs run for 20 days… What happens when they get preempted? How can I add fault tolerance to my jobs?

OSG Summer School 2010 Condor’s Standard Universe to the rescue! Condor can support various combinations of features/environments in different “Universes” Different Universes provide different functionality for your job:  Vanilla: Run any serial job  Standard: Support for transparent process checkpoint and restart

OSG Summer School 2010 Process Checkpointing Condor’s process checkpointing mechanism saves the entire state of a process into a checkpoint file  Memory, CPU, I/O, etc. The process can then be restarted from right where it left off Typically no changes to your job’s source code needed—however, your job must be relinked with Condor’s Standard Universe support library

OSG Summer School 2010 Relinking Your Job for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f

OSG Summer School 2010 Limitations of the Standard Universe Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not:  fork()  Use kernel threads  Use some forms of IPC, such as pipes and shared memory Many typical scientific jobs are OK Must be same gcc as Condor was built with

OSG Summer School 2010 When will Condor checkpoint your job? Periodically, if desired (for fault tolerance) When your job is preempted by a higher priority job When your job is vacated because the execution machine becomes busy When you explicitly run:  condor_checkpoint  condor_vacate  condor_off  condor_restart

OSG Summer School 2010 Remote System Calls I/O system calls are trapped and sent back to submit machine Allows transparent migration across administrative domains  Checkpoint on machine A, restart on B No source code changes required Language independent Opportunities for application steering

OSG Summer School 2010 Job I/O Lib Remote I/O condor_schedd condor_startd condor_shadow condor_starter File

OSG Summer School 2010 Clusters and Processes If your submit file describes multiple jobs, we call this a “cluster” Each cluster has a unique “cluster number” Each job in a cluster is called a “process”  Process numbers always start at zero A Condor “Job ID” is the cluster number, a period, and the process number (“20.1”) A cluster is allowed to have one or more processes.  There is always a cluster for every job

OSG Summer School 2010 Example Submit Description File for a Cluster # Example submit description file that defines a # cluster of 2 jobs with separate working directories Universe = vanilla Executable = my_job log = my_job.log Arguments = -arg1 -arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_0 Queue Becomes job 2.0 InitialDir = run_1 Queue Becomes job 2.1

OSG Summer School 2010 % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 frieda 4/15 06: :00:00 I my_job 2.1 frieda 4/15 06: :00:00 I my_job 2 jobs; 2 idle, 0 running, 0 held Submitting The Job

OSG Summer School 2010 Submit Description File for a BIG Cluster of Jobs The initial directory for each job can be specified as run_$(Process), and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once The $(Process) macro will be expanded to the process number for each job in the cluster ( ), so we’ll have “run_0”, “run_1”, … “run_599” directories All the input/output files will be in different directories!

OSG Summer School 2010 Submit Description File for a BIG Cluster of Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) ·run_0 … run_599 Queue 600 ·Creates job 3.0 … 3.599

OSG Summer School 2010 More $(Process) You can use $(Process) anywhere. Universe = vanilla Executable = my_job Log = my_job.$(Process).log Arguments = -randomseed $(Process) Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) ·run_0 … run_599 Queue 600 ·Creates job 3.0 … 3.599

OSG Summer School 2010 Sharing a directory You don’t have to use separate directories. $(Cluster) will help distinguish runs Universe = vanilla Executable = my_job Arguments = -randomseed $(Process) Input = my_job.input.$(Process) Output = my_job.stdout.$(Cluster).$(Process) Error = my_job.stderr.$(Cluster).$(Process) Log = my_job.$(Cluster).$(Process).log Queue 600

OSG Summer School 2010 Job Priorities Are some of the jobs in your sweep more interesting than others? condor_prio lets you set the job priority  Priority relative to your jobs, not other peoples  Priority can be any integer Can be set in submit file:  Priority = 14

OSG Summer School 2010 What if you have LOTS of jobs? Set system limits to be high:  Each job requires a shadow process  Each shadow requires file descriptors and sockets  Each shadow requires ports/sockets Each condor_schedd limits max number of jobs running  Default is 200  Configurable: can be quite high (2000+) Consider multiple submit hosts  You can submit jobs from multiple computers  Immediate increase in scalability & complexity We constantly strive to improve scalability

OSG Summer School 2010 Advanced Trickery You submit 10 parameter sweeps You have five classes of parameters sweeps  Call them A, B, C, D, E How can you look at the status of jobs that are part of Type B parameter sweeps?

OSG Summer School 2010 Advanced Trickery cont. In your job file: +SweepType = “B” You can see this in your job ClassAd condor_q –l You can show jobs of a certain type: condor_q –constraint ‘SweepType == “B”’ Very useful when you have a complex variety of jobs Try this during the exercises! Be careful with the quoting…

OSG Summer School 2010 Time for more exercises! 28

OSG Summer School 2010 Questions? Questions? Comments? Feel free to ask me questions later: Alain Roy Upcoming sessions  Now – 12:15  Hands-on exercises  12:15 – 1:15  Lunch  Room