Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering.

Similar presentations


Presentation on theme: "Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering."— Presentation transcript:

1 Alain Roy Computer Sciences Department University of Wisconsin-Madison roy@cs.wisc.edu http://www.cs.wisc.edu/condor 24-June-2002 Using and Administering Condor

2 www.cs.wisc.edu/condor Добрый вечер! › Thank you for having me! › I am:  Alain Roy  Computer Science Ph.D. in Quality of Service, with Globus Project  Working with the Condor Project

3 www.cs.wisc.edu/condor Condor Tutorials Remaining › Monday (Today)17:00-19:00  Using and administering Condor › Tuesday17:00-19:00  Using Condor on the Grid

4 www.cs.wisc.edu/condor Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility.  Run lots of jobs over a long period of time,  Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy

5 www.cs.wisc.edu/condor Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines…  Crash (or are disconnected)  Run out of disk space  Don’t have your software installed  Are frequently needed by others  Are far away & managed by someone else

6 www.cs.wisc.edu/condor What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be:  Dedicated  Opportunistic

7 www.cs.wisc.edu/condor What’s Condor Good For? › Managing a large number of jobs  You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete  Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.  Condor can handle inter-job dependencies (DAGMan)

8 www.cs.wisc.edu/condor What’s Condor Good For? (cont’d) › Robustness  Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion  If an execute machine crashes, you only lose work done since the last checkpoint  Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover  (Story)

9 www.cs.wisc.edu/condor What’s Condor Good For? (cont’d) › Giving your job the agility to access more computing resources  Checkpointing allows your job to run on “opportunistic resources” (not dedicated)  Checkpointing also provides “migration” - if a machine is no longer available, move!  With remote system calls, run on systems which do not share a filesystem - You don’t even need an account on a machine where your job executes

10 www.cs.wisc.edu/condor Other Condor features › Implement your policy on when the jobs can run on your workstation › Implement your policy on the execution order of the jobs › Keep a log of your job activities

11 www.cs.wisc.edu/condor A Condor Pool In Action

12 www.cs.wisc.edu/condor A Bit of Condor Philosophy › Condor brings more computing to everyone  A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done.  A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.

13 www.cs.wisc.edu/condor The Idea Computing power is everywhere, we try to make it usable by anyone.

14 www.cs.wisc.edu/condor Remember Frieda? Today we’ll revisit Frieda’s Condor explorations in more depth

15 www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help?

16 www.cs.wisc.edu/condor Install a Personal Condor!

17 www.cs.wisc.edu/condor Installing Condor › Download Condor for your operating system › Available as a free download from http://www.cs.wisc.edu/condor › Available for most Unix platforms and Windows NT

18 www.cs.wisc.edu/condor So Frieda Installs Personal Condor on her machine… › What do we mean by a “Personal” Condor?  Condor on your own workstation, no root access required, no system administrator intervention needed—easy to set up.

19 www.cs.wisc.edu/condor Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?

20 www.cs.wisc.edu/condor Your Personal Condor will... › Keep an eye on your jobs and will keep you posted on their progress › Keep a log of your job activities › Add fault tolerance to your jobs › Implement your policy on when the jobs can run on your workstation

21 www.cs.wisc.edu/condor What’s in a Personal Condor? › Everything that is in Condor, just one machine. › Condor daemons:  Condor_master  Condor_collector—Stores ClassAds for jobs, machines  Condor_negotiator—Matchmaking  Condor_schedd—Submits, monitors jobs  Condor_startd—Starts jobs  Condor_starter—Launches a job  Condor_shadow—Monitors remote job

22 www.cs.wisc.edu/condor A Condor Pool of One Condor_master Condor_schedd Condor_collector Condor_negotiator Condor_startd Condor_starter Condor job Condor_shadow

23 www.cs.wisc.edu/condor condor_master › Starts up all other Condor daemons › If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator › Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

24 www.cs.wisc.edu/condor condor_master (cont’d) › Acts as the server for many Condor remote administration commands:  condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc.

25 www.cs.wisc.edu/condor condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon)

26 www.cs.wisc.edu/condor condor_schedd › Represents users to the Condor system › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue:  condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, …

27 www.cs.wisc.edu/condor condor_collector › Collects information from all other Condor daemons in the pool  “Directory Service” / Database for a Condor pool › Each daemon sends a periodic update called a “ClassAd” to the collector › Services queries for information:  Queries from other Condor daemons  Queries from users (condor_status)

28 www.cs.wisc.edu/condor condor_negotiator › Performs “matchmaking” in Condor › Gets information from the collector about all available machines and all idle jobs › Tries to match jobs with machines that will serve them › Both the job and the machine must satisfy each other’s requirements

29 www.cs.wisc.edu/condor Frieda wants more… › She decides to use the graduate students’ computers when they aren’t, and get done sooner. › In exchange, they can use the Condor pool too.

30 www.cs.wisc.edu/condor Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers

31 www.cs.wisc.edu/condor A larger Condor pool Submitter Condor_master Condor_schedd Condor_shadow Collector Condor_master Condor_negotiator Condor_collector Submitter/Executor Condor_master Condor_scheddCondor_startd Condor_shadowCondor_starter Condor Job Executor Condor_master Condor_startd Condor_starter Condor Job

32 www.cs.wisc.edu/condor Happy Day! Frieda’s organization purchased a Beowulf Cluster! › Other scientists in her department have realized the power of Condor and want to share it.. › The Beowulf cluster and the graduate student computers can be part of a single Condor pool.

33 www.cs.wisc.edu/condor Frieda’s Condor pool… Central Manager Graduate Student’s Desktop Computers Beowulf Cluster

34 www.cs.wisc.edu/condor How would you set it up? › Grad student machines:  Submitters  Executors › Beowulf cluster machines  Executors only › Independent machine for collector/neg  Big job—take it away from Freida’s computer  Could split collector and negotiator

35 www.cs.wisc.edu/condor Frieda collaborates… › She wants to share her Condor pool with scientists from another lab.

36 www.cs.wisc.edu/condor Condor Flocking › Condor pools can work cooperatively

37 www.cs.wisc.edu/condor How would you set it up? › Two independent pools  Each has it’s own collector/negotiator › Set up flocking from one pool to another: by machine, or by pool.  FLOCK_TO  FLOCK_FROM › Can be uni- or bi-directional

38 www.cs.wisc.edu/condor Questions So Far?

39 www.cs.wisc.edu/condor How do you run a job? › It doesn’t matter if you have:  Personal Condor  Large Condor pool  Condor pool with flocking › Four steps 1. Write program 2. Write submit file 3. Give it to Condor 4. Condor gives you the results

40 www.cs.wisc.edu/condor Step 1: Writing a program › Condor has universes  Vanilla Universe: Run anything Less capable  Java Universe: Works better for Java  Standard Universe: Checkpointing Remote I/O Can’t work with all programs

41 www.cs.wisc.edu/condor Step 1: Vanilla Universe › You can run any program  C/C++/Perl/Python/Fortran/Java/Lisp…  No checkpointing: if your job is interrupted or the machine crashes, Condor has to restart it from the beginning.  Can do anything you could do if you were logged in.

42 www.cs.wisc.edu/condor Step 1: Java Universe › Works better for Java programs › Checks for valid Java environment › Distinguishes Java environment exceptions from program exceptions (wrapper program) › No checkpointing (it could happen though) › Remote I/O

43 www.cs.wisc.edu/condor Step 1: Standard Universe › Requires re-linking your program  condor_compile gcc –o simple simple.o › Allows checkpointing and remote I/O › Restrictions on behavior  No threading  Limited networking  Restrictions on compiler used

44 www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Requirements = Memory > 512 Queue Note: This assumes a shared filesystem

45 www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Transfer_input_files = data.in Transfer_output_files = data.out Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem

46 www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = standard Arguments = First Log = simple.log Output = simple.output Error = simple.error Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem, but remote I/O

47 www.cs.wisc.edu/condor Step 2: Submit Files › Condor is helpful: it makes a real requirements:  Requirements = memory > 512 becomes…  Requirements = (OpSys == “Linux”) && (memory > 512) && … › Queue can take a parameter (more later) › A single file can submit many jobs

48 www.cs.wisc.edu/condor Step 3: Give it to Condor › condor_submit submit.desc › condor_q -- Submitter: dsonokwa.cs.wisc.edu : : dsonokwa.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 roy 6/15 20:51 0+00:00:02 R 0 0.0 simple First 1 jobs; 0 idle, 1 running, 0 held

49 www.cs.wisc.edu/condor Step 4: Condor gives it back › The program’s output is where you asked it to be. › Condor left a log file documenting what it did. › Condor optionally sends you an email telling you it’s done.

50 www.cs.wisc.edu/condor Step 4: Condor gives it back 000 (34364.000.000) 06/15 21:00:01 Job submitted from host: 001 (34364.000.000) 06/15 21:00:01 Job executing on host: 005 (34364.000.000) 06/15 21:00:06 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

51 www.cs.wisc.edu/condor Step 4: Condor gives it back Date: Sat, 15 Jun 2002 21:00:06 -0500 (CDT) From: Condor Project Message-Id: To: roy@cs.wisc.edu Subject: [Condor] Condor Job 34364.0 This is an automated email from the Condor system on machine "beak.cs.wisc.edu". Do not reply. Your condor job exited with status 0. Job: /scratch/roy/condor/simple/simple First

52 www.cs.wisc.edu/condor Clusters and Processes › If your submit file describes multiple jobs, we call this a “cluster”. › Each job within a cluster is called a “process” or “proc”. › If you only specify one job, you still get a cluster, but it has only one process. › A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) › Process numbers always start at 0.

53 www.cs.wisc.edu/condor Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = simple Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/roy/condor/run.$(Process) Queue 500

54 www.cs.wisc.edu/condor Questions So Far?

55 www.cs.wisc.edu/condor condor_q › Find out status of your jobs, from your condor_schedd. › condor_q cluster: all jobs in a cluster › condor_q cluster.proc: particular job › condor_q –sub name: jobs for a particular user

56 www.cs.wisc.edu/condor Temporarily halt a Job › Use condor_hold to place a job on hold  Kills job if currently running  Will not attempt to restart job until released › Use condor_release to remove a hold and permit job to be scheduled again

57 www.cs.wisc.edu/condor condor_rm › You submitted a job, but you want to cancel it › condor_rm clusterid  Condor_rm 6: all jobs in cluster › condor_rm clusterid.procid  condor_rm 6.3: specific job › condor_rm –all: all of your jobs › Can only remove your jobs › Reflected in job log

58 www.cs.wisc.edu/condor condor_status › Find status of pool from condor_collector (simplified view here) Name OpSys Arch State Activity carmi.cs.wisc LINUX INTEL Unclaimed Idle coral.cs.wisc LINUX INTEL Unclaimed Idle doc.cs.wisc.e LINUX INTEL Unclaimed Idle dsonokwa.cs.w LINUX INTEL Unclaimed Idle... Machines Owner Claimed Unclaimed LINUX 12 2 0 10 SOLARIS28 5 0 0 5 Total 17 2 0 15

59 www.cs.wisc.edu/condor condor_status › condor_status –run: which machines are running jobs › condor_status –sub: whose jobs are running? › condor_status –constraint: restrict to showing subset as defined by user

60 www.cs.wisc.edu/condor DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

61 www.cs.wisc.edu/condor What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

62 www.cs.wisc.edu/condor Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor job specified by its accompanying Condor submit file Job A Job BJob C Job D

63 www.cs.wisc.edu/condor Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. › Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

64 www.cs.wisc.edu/condor DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File

65 www.cs.wisc.edu/condor DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A

66 www.cs.wisc.edu/condor DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

67 www.cs.wisc.edu/condor DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

68 www.cs.wisc.edu/condor DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

69 www.cs.wisc.edu/condor DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

70 www.cs.wisc.edu/condor Additional DAGMan Features › Provides other handy features for job management…  nodes can have PRE & POST scripts  failed nodes can be automatically re- tried a configurable number of times  job submission can be “throttled”

71 www.cs.wisc.edu/condor Questions So Far?

72 www.cs.wisc.edu/condor What if each job needed to run for 20 days? What if I wanted to interrupt a job with a higher priority job?

73 www.cs.wisc.edu/condor Condor’s Standard Universe to the rescue! › Condor can support various combinations of features/environments in different “Universes” › Different Universes provide different functionality for your job:  Vanilla—runs any Serial Job  Java—well suited for Java programs  Standard – Support for transparent process checkpoint and restart

74 www.cs.wisc.edu/condor Process Checkpointing › Condor’s Process Checkpointing mechanism saves all the state of a process into a checkpoint file  Memory, CPU, I/O, etc. › The process can then be restarted from right where it left off › Typically no changes to your job’s source code needed – however, your job must be relinked with Condor’s Standard Universe support library

75 www.cs.wisc.edu/condor Linking for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c OR condor_compile f77 -o myjob filea.f fileb.f

76 www.cs.wisc.edu/condor Limitations in the Standard Universe › Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not  Fork()  Use kernel threads  Use some forms of IPC, such as pipes and shared memory › Many typical scientific jobs are OK

77 www.cs.wisc.edu/condor When will Condor checkpoint your job? › Periodically, if desired  For fault tolerance › To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority)  Preemptive-resume scheduling › When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command

78 www.cs.wisc.edu/condor Administering Condor › Condor provides extensive configuration files  One per pool, one per machine, or anything in between › Extensive documentation  Online manual  Heavily commented sample configuration file

79 www.cs.wisc.edu/condor I am adding nodes to the Cluster… but the Chemistry Department has priority on these nodes. (Boss Fat Cat) Policy Configuration

80 www.cs.wisc.edu/condor The Machine (Startd) Policy Expressions START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job

81 www.cs.wisc.edu/condor Freida’s Current Settings START = True RANK = SUSPEND = False CONTINUE = PREEMPT = False KILL = False

82 www.cs.wisc.edu/condor Freida’s New Settings for the Chemistry nodes START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False

83 www.cs.wisc.edu/condor Submit file with Custom Attribute Executable = chem-job Universe = standard +Department = Chemistry queue

84 www.cs.wisc.edu/condor What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False

85 www.cs.wisc.edu/condor Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = PREEMPT = False KILL = False

86 www.cs.wisc.edu/condor The Cluster is fine. But not the desktop machines. Condor can only use the desktops when they would otherwise be idle. (Boss Fat Cat) Policy Configuration, cont

87 www.cs.wisc.edu/condor So Frieda decides she wants the desktops to: › START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low › SUSPEND jobs as soon as activity is detected › PREEMPT jobs if the activity continues for 5 minutes or more › KILL jobs if they take more than 5 minutes to preempt

88 www.cs.wisc.edu/condor Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Idle = ($(NonCondorLoadAvg) <= $(Background)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer= (CurrentTime - EnteredCurrentActivity)

89 www.cs.wisc.edu/condor Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND= $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT= (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300

90 www.cs.wisc.edu/condor Policy Review › Users submitting jobs can specify Requirements and Rank expressions › Administrators can specify Startd Policy expressions individually for each machine (Start,Suspend,etc) › Expressions can use any job or machine ClassAd attribute › Custom attributes easily added › Bottom Line: Enforce almost any policy!

91 www.cs.wisc.edu/condor Administrator Commands › condor_vacateLeave a machine now › condor_onStart Condor › condor_offStop Condor › condor_reconfigReconfig on-the-fly › condor_config_valView/set config › condor_userprioUser Priorities › condor_statsView detailed usage accounting stats

92 www.cs.wisc.edu/condor Questions So Far?

93 www.cs.wisc.edu/condor Security in Condor › Since version 6.3.3, Condor has greatly improved security › Multiple authentication methods:  X509 (Using GSI)  Kerberos  Filesystem (shared filesystem, known user) › Encryption:  3DES  Blowfish

94 www.cs.wisc.edu/condor Security in Condor › Authentication  Based on users, with optional wildcards roy@cs.wisc.edu *@cs.wisc.edu  Users can be given different permissions: Read Write Administrator Config

95 www.cs.wisc.edu/condor Version Numbers in Condor › Odd minor numbers are development releases:  6.3.1, 6.3.2, 6.5.0…  Compatibility not guaranteed within a series, like 6.3.x. › Even minor numbers are stable releases  6.2.2, 6.4.0, 6.4.1…  Compatibility guaranteed within a series, like 6.4.x.

96 www.cs.wisc.edu/condor Questions? Comments? › Web: www.cs.wisc.edu/condor › Email: condor-admin@cs.wisc.edu


Download ppt "Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering."

Similar presentations


Ads by Google