SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.

Slides:



Advertisements
Similar presentations
June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Condor Project Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
1 Using Condor An Introduction ICE 2008.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
1 Using Condor An Introduction Condor Week 2008.
Condor Tutorial Prabhaker Mateti Wright State University.
April Open Science Grid Campus Condor Pools Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel Hill.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor User Tutorial.
Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
1 Using Condor An Introduction ICE 2010.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Linux+ Guide to Linux Certification, Third Edition
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
An Introduction to Using HTCondor The Team Established in 1985, to do research and development of distributed high-throughput computing.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Condor Project Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Using Condor An Introduction Condor Week 2007
Condor: Job Management
An Introduction to Using Condor Condor Week 2012
Using Condor An Introduction Condor Week 2004
Troubleshooting Your Jobs
Haiyan Meng and Douglas Thain
Using Condor An Introduction Condor Week 2006
Using Condor An Introduction Condor Week 2003
Condor Tutorial Asia Pacific Grid Workshop Tokyo, Japan October 2001
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Using Condor An Introduction Paradyn/Condor Week 2002
Troubleshooting Your Jobs
Presentation transcript:

SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL

2 Tutorial Outline The Story of Frieda, the Scientist  Using Condor to manage jobs  Using Condor to manage resources  Condor Architecture and Mechanisms Stop me if you have any questions!

3 Meet Frieda. She is a scientist. But she has a big problem.

4 Frieda’s Application … Run a Parameter Sweep of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations)  F takes on the average 6 hours to compute on a “typical” workstation ( total = 3600 hours )  F requires a “moderate” (128MB) amount of memory  F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

5 I have 600 simulations to run. Where can I get help?

6 As if by magic, a genie appears from a lamp, and says, “Use Condor!”

7 What is Condor? › Batch system › Takes advantage off free machines at IAC. › Results: It saves time to us and optimizes the available resources.

8

9 Condor will... › … keep an eye on your jobs and will keep you posted on their progress › … implement your policy on the execution order of the jobs › … keep a log of your job activities › … add fault tolerance to your jobs › … implement your policy on when the jobs can run on your workstation

10 What machines will Condor use? › Condor is only going to use machines that:  Nobody is currently using (neither locally nor remotelly)  Do not have any applications running with a high CPU load › Condor leaves a machine when this one becomes busy.

11 Testimonial › Daniel Bramich has been using Condor at IAC for intensive calculations. › Without Condor, he would have spent about 458 days to complete. But his simulations only take a couple of weeks using 30 of the fastest computers at IAC (limited by IDL’s licences).

12 Getting Started: Submitting Jobs to Condor › Creating a submit description file › Choosing a “Universe” for your job  Just use VANILLA for now › Make your job “batch-ready” › Run condor_submit on your submit description file

13 Making your job batch-ready › Must be able to run in the background: no interactive input, windows, GUI, etc. › Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices › Organize data files

14 Creating a Submit Description File › A plain ASCII text file › Condor does not care about file extensions › Tells Condor about your job:  Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) › Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

15 Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = my_job Queue

16 Running condor_submit › You give condor_submit the name of the submit file you have created:  condor_submit my_job.submit › condor_submit parses the submit file, checks for it errors, and creates a “ClassAd” that describes your job(s)  ClassAds: Condor’s internal data representation Similar to classified ads (as the name inplies) Represent an object & it’s attributes Can also describe what an object matches with

17 The Job Queue › condor_submit sends your job’s ClassAd(s) to the schedd  Manages the local job queue  Stores the job in the job queue › View the queue with condor_q

18 Running condor_submit % condor_submit my_job.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06: :00:00 I my_job 1 jobs; 1 idle, 0 running, 0 held %

19 More information about jobs › Controlled by submit file settings › Condor sends you about events  Turn it off: Notification = Never  Only on errors: Notification = Error › Condor creates a log file (user log)  “The Life Story of a Job”  Shows all events in the life of a job  Always have a log file  To turn it on: Log = filename

20 Sample Condor User Log 000 ( ) 05/25 19:10:03 Job submitted from host: ( ) 05/25 19:12:17 Job executing on host: ( ) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:37, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:05 - Run Local Usage Usr 0 00:00:37, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:05 - Total Local Usage Run Bytes Sent By Job Run Bytes Received By Job Total Bytes Sent By Job Total Bytes Received By Job...

21 Another Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/frieda/condor/my_job.condor Log = my_job.log Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/frieda/condor/run_1 Queue

22 “Clusters” and “Processes” › If your submit file describes multiple jobs, we call this a “cluster” › Each cluster has a unique “cluster number” › Each job in a cluster is called a “process”  Process numbers always start at zero › A Condor “Job ID” is the cluster number, a period, and the process number (“20.1”)  A cluster can have only one process (“21.0”)

23 Example Submit Description File for a Cluster # Example submit description file that defines a # cluster of 2 jobs with separate working directories Universe = vanilla Executable = my_job log = my_job.log Arguments = -arg1 -arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_0 Queue ·Becomes job 2.0 InitialDir = run_1 Queue ·Becomes job 2.1

24 % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 4/15 06: :02:11 R my_job 2.0 frieda 4/15 06: :00:00 I my_job 2.1 frieda 4/15 06: :00:00 I my_job 3 jobs; 2 idle, 1 running, 0 held % Submitting The Job

25 Submit Description File for a BIG Cluster of Jobs › The initial directory for each job can be specified as run_$(Process), and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once › The $(Process) macro will be expanded to the process number for each job in the cluster ( ), so we’ll have “run_0”, “run_1”, … “run_599” directories › All the input/output files will be in different directories!

26 Submit Description File for a BIG Cluster of Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) ·run_0 … run_599 Queue 600 ·Becomes job 3.0 … 3.599

27 Using condor_rm › If you want to remove a job from the Condor queue, you use condor_rm › You can only remove jobs that you own (you can’t run condor_rm on someone else’s jobs unless you are root) › You can give specific job ID’s (cluster or cluster.proc), or you can remove all of your jobs with the “-a” option.  condor_rm 21.1 ·Removes a single job  condor_rm 21 ·Removes a whole cluster

28 condor_status % condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime haha.cs.wisc. IRIX65 SGI Unclaimed Idle :00:04 antipholus.cs LINUX INTEL Unclaimed Idle :28:42 coral.cs.wisc LINUX INTEL Claimed Busy :27:21 doc.cs.wisc.e LINUX INTEL Unclaimed Idle :20:04 dsonokwa.cs.w LINUX INTEL Claimed Busy :01:45 ferdinand.cs. LINUX INTEL Claimed Suspended :00:55 LINUX INTEL Unclaimed Idle :03:28 LINUX INTEL Unclaimed Idle :03:29

29 How can my jobs access their data files?

30 Access to Data in Condor › Use Shared Filesystem › Put your files in a location shared between all machines, e.g., your home directory (/home/ /) or /net/ /scratch/

31 Some of the machines in the Pool do not have enough memory or scratch disk space to run my job!

32 Specify Requirements! › An expression (syntax similar to C or Java) › Must evaluate to True for a match to be made Universe = vanilla Executable = my_job Log = my_job.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > Queue 600

33 Specify Rank! › All matches which meet the requirements can be sorted by preference with a Rank expression. › Higher the Rank, the better the match Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > Rank = (KFLOPS*10000) + Memory Queue 600

34 We’ve seen how Condor can: … keeps an eye on your jobs and will keep you posted on their progress … keeps a log of your job activities … implement your policy on when the jobs can run on your workstation

35 My jobs run for 20 days… › What happens when they get pre-empted? › How can I add fault tolerance to my jobs?

36 Condor’s Standard Universe to the rescue! › Condor can support various combinations of features/environments in different “Universes” › Different Universes provide different functionality for your job:  Vanilla – Run any Serial Job  Standard – Support for transparent process checkpoint and restart

37 Process Checkpointing › Condor’s Process Checkpointing mechanism saves the entire state of a process into a checkpoint file  Memory, CPU, I/O, etc. › The process can then be restarted from right where it left off › Typically no changes to your job’s source code needed – however, your job must be relinked with Condor’s Standard Universe support library

38 Relinking Your Job for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f

39 Limitations of the Standard Universe › Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not:  Fork()  Use kernel threads  Use some forms of IPC, such as pipes and shared memory › Many typical scientific jobs are OK

40 When will Condor checkpoint your job? › Periodically, if desired  For fault tolerance › When your job is preempted by a higher priority job › When your job is vacated because the execution machine becomes busy › When you explicitly run condor_checkpoint command

41 General User Commands › condor_statusView Pool Status › condor_qView Job Queue › condor_submitSubmit new Jobs › condor_rmRemove Jobs › condor_prioIntra-User Prios › condor_checkpointForce a checkpoint › condor_compileLink Condor library PRACTICAL EXAMPLE

42 Condor Job Universes › Serial Jobs  Vanilla Universe  Standard Universe › Java Universe

43 Java Universe Job universe = java executable = Main.class jar_files = MyLibrary.jar input = infile output = outfile arguments = Main queue condor_submit

44 Why not use Vanilla Universe for Java jobs? › Java Universe provides more than just inserting “java” at the start of the execute line  Knows which machines have a JVM installed  Knows the location, version, and performance of JVM on each machine  Provides more information about Java job completion than just JVM exit code Program runs in a Java wrapper, allowing Condor to report Java exceptions, etc.

45 Java support, cont. condor_status -java Name JavaVendor Ver State Activity LoadAv Mem aish.cs.wisc. Sun Microsy Owner Idle anfrom.cs.wis Sun Microsy Owner Idle babe.cs.wisc. Sun Microsy Claimed Busy

46 Thank you! Check us out on the Web: