Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2.

Similar presentations


Presentation on theme: "Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2."— Presentation transcript:

1 Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2

2 www.cs.wisc.edu/Condor Outline › Startd Hooks › Job Router › Job Router Hooks › Power Management › Dynamic Slot Partitioning › Concurrency Limits › Variable Substitution › Preemption Attributes 2

3 www.cs.wisc.edu/Condor Startd Job Hooks › Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system.  Specialized scheduling needs  Jobs live in their own database or other storage rather than a Condor job queue 3

4 www.cs.wisc.edu/Condor Our solution › Make a system of generic “hooks” that you can plug into:  A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program  Hook Condor to your existing job management system without modifying the Condor code 4

5 www.cs.wisc.edu/Condor How does Condor communicate with hooks? › Passing around ASCII ClassAds via standard input and standard output › Some hooks get control data via a command-line argument (argv) › Hooks can be written in any language (scripts, binaries, whatever you want) so long as you can read/write Stdin/out 5

6 www.cs.wisc.edu/Condor What hooks are available? › Hooks for fetching work (startd):  FETCH_JOB  REPLY_FETCH  EVICT_CLAIM › Hooks for running jobs (starter):  PREPARE_JOB  UPDATE_JOB_INFO  JOB_EXIT 6

7 www.cs.wisc.edu/Condor HOOK_FETCH_JOB › Invoked by the startd whenever it wants to try to fetch new work  FetchWorkDelay expression › Stdin: slot ClassAd › Stdout: job ClassAd › If Stdout is empty, there’s no work 7

8 www.cs.wisc.edu/Condor HOOK_REPLY_FETCH › Invoked by the startd once it decides what to do with the job ClassAd returned by HOOK_FETCH_WORK › Gives your external system a chance to know what happened › argv[1]: “accept” or “reject” › Stdin: slot and job ClassAds › Stdout: ignored 8

9 www.cs.wisc.edu/Condor HOOK_EVICT_CLAIM › Invoked if the startd has to evict a claim that’s running fetched work › Informational only: you can’t stop or delay this train once it’s left the station › Stdin: both slot and job ClassAds › Stdout: ignored 9

10 www.cs.wisc.edu/Condor HOOK_PREPARE_JOB › Invoked by the condor_starter when it first starts up (only if defined) › Opportunity to prepare the job execution environment  Transfer input files, executables, etc. › Stdin: both slot and job ClassAds › Stdout: ignored, but starter won’t continue until this hook exits › Not specific to fetched work 10

11 www.cs.wisc.edu/Condor HOOK_UPDATE_JOB_INFO › Periodically invoked by the starter to let you know what’s happening with the job › Stdin: slot and job ClassAds  Job ClassAd is updated with additional attributes computed by the starter: ImageSize, JobState, RemoteUserCpu, etc. › Stdout: ignored 11

12 www.cs.wisc.edu/Condor HOOK_JOB_EXIT › Invoked by the starter whenever the job exits for any reason › Argv[1] indicates what happened:  “exit”: Died a natural death  “evict”: Booted off prematurely by the startd (PREEMPT == TRUE, condor_off, etc)  “remove”: Removed by condor_rm  “hold”: Held by condor_hold 12

13 www.cs.wisc.edu/Condor HOOK_JOB_EXIT … › “HUH!?! condor_rm? What are you talking about?”  The starter hooks can be defined even for regular Condor jobs, local universe, etc. › Stdin: copy of the job ClassAd with extra attributes about what happened:  ExitCode, JobDuration, etc. › Stdout: ignored 13

14 www.cs.wisc.edu/Condor Defining hooks › Each slot can have its own hook ”keyword”  Prefix for config file parameters  Can use different sets of hooks to talk to different external systems on each slot  Global keyword used when the per-slot keyword is not defined › Keyword is inserted by the startd into its copy of the job ClassAd and given to the starter 14

15 www.cs.wisc.edu/Condor Defining hooks: example # Most slots fetch work from the database system STARTD_JOB_HOOK_KEYWORD = DATABASE # Slot4 fetches and runs work from a web service SLOT4_JOB_HOOK_KEYWORD = WEB # The database system needs to both provide work and # know the reply for each attempted claim DB_DIR = /usr/local/condor/fetch/db DATABASE_HOOK_FETCH_WORK = $(DB_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DB_DIR)/reply_fetch.php # The web system only needs to fetch work WEB_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_DIR)/fetch_work.php 15

16 www.cs.wisc.edu/Condor Semantics of fetched jobs › Condor_startd treats them just like any other kind of job:  All the standard resource policy expressions apply (START, SUSPEND, PREEMPT, RANK, etc).  Fetched jobs can coexist in the same pool with jobs pushed by Condor, COD, etc.  Fetched work != Backfill 16

17 www.cs.wisc.edu/Condor Semantics continued › If the startd is unclaimed and fetches a job, a claim is created › If that job completes, the claim is reused and the startd fetches again › Keep fetching until either:  The claim is evicted by Condor  The fetch hook returns no more work 17

18 www.cs.wisc.edu/Condor Limitations of the hooks › If the starter can’t run your fetched job because your ClassAd is bogus, no hook is invoked to tell you about it  We need a HOOK_STARTER_FAILURE › No hook when the starter is about to evict you (so you can checkpoint)  Can implement this yourself with a wrapper script and the SoftKillSig attribute 18

19 www.cs.wisc.edu/Condor Job Router › Automated way to let jobs run on a wider array of resources  Transform jobs into different forms  Reroute jobs to different destinations 19

20 www.cs.wisc.edu/Condor What is “job routing”? 20 Universe = “vanilla” Executable = “sim” Arguments = “seed=345” Output = “stdout.345” Error = “stderr.345” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” Universe = “grid” GridType = “gt2” GridResource = \ “cmsgrid01.hep.wisc.edu/jobmanager-condor” Executable = “sim” Arguments = “seed=345” Output = “stdout” Error = “stderr” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” JobRouter Routing Table: Site 1 … Site 2 … final status routed (grid) joboriginal (vanilla) job

21 www.cs.wisc.edu/Condor Routing is just site-level matchmaking › With feedback from job queue number of jobs currently routed to site X number of idle jobs routed to site X rate of recent success/failure at site X › And with power to modify job ad change attribute values (e.g. Universe) insert new attributes (e.g. GridResource) add a “portal” grid proxy if desired 21

22 www.cs.wisc.edu/Condor Configuring the Routing Table › JOB_ROUTER_ENTRIES list site ClassAds in configuration file › JOB_ROUTER_ENTRIES_FILE read site ClassAds periodically from a file › JOB_ROUTER_ENTRIES_CMD read periodically from a script example: query a collector such as Open Science Grid Resource Selection Service 22

23 www.cs.wisc.edu/Condor Syntax › List of sites in new ClassAd format [ Name = “Grid Site 1”; … ] [ Name = “Grid Site 2”; … ] [ Name = “Grid site 3”; … ] … 23

24 www.cs.wisc.edu/Condor Syntax [ Name = “Site 1”; GridResource = “gt2 gk.foo.edu”; MaxIdleJobs = 10; MaxJobs = 200; FailureRateThreshold = 0.01; JobFailureTest = other.RemoteWallClockTime < 1800 Requirements = target.WantJobRouter is True; delete_WantJobRouter = true; set_PeriodicRemove = JobStatus == 5; ] 24

25 www.cs.wisc.edu/Condor What Types of Input Jobs? › Vanilla Universe › Self Contained (everything needed is in file transfer list) › High Throughput (many more jobs than cpus) 25

26 www.cs.wisc.edu/Condor Grid Gotchas › Globus gt2  no exit status from job (reported as 0) › Most grid universe types  must explicitly list desired output files 26

27 www.cs.wisc.edu/Condor JobRouter vs. Glidein › Glidein - Condor overlays the grid  job never waits in remote queue  job runs in its normal universe  private networks doable, but add to complexity  need something to submit glideins on demand › JobRouter  some jobs wait in remote queue (MaxIdleJobs)  job must be compatible with target grid semantics  simple to set up, fully automatic to run 27

28 www.cs.wisc.edu/Condor Job Router Hooks › Truly transform jobs, not just reroute them  E.g. stuff a job into a virtual machine (either VM universe or Amazon EC2) › Hooks invoked like startd ones 28

29 www.cs.wisc.edu/Condor HOOK_TRANSLATE › Invoked when a job is matched to a route › Stdin: route name and job ad › Stdout: transformed job ad › Transformed job is submitted to Condor 29

30 www.cs.wisc.edu/Condor HOOK_UPDATE_JOB_INFO › Invoked periodically to obtain extra information about routed job › Stdin: routed job ad › Stdout: attributes to update in routed job ad 30

31 www.cs.wisc.edu/Condor HOOK_JOB_FINALIZE › Invoked when routed job has completed › Stdin: ads of original and routed jobs › Stdout: modified original job ad or nothing (no updates) 31

32 www.cs.wisc.edu/Condor HOOK_JOB_CLEANUP › Invoked when original job returned to schedd (both success and failure) › Stdin: Original job ad › Use for cleanup of external resources 32

33 www.cs.wisc.edu/Condor Power Management › Hibernate execute machines when not needed › Condor doesn’t handle waking machines up yet › Information to wake machines available in machine ads 33

34 www.cs.wisc.edu/Condor Configuring Power Management › HIBERNATE  Expression evaluated periodically by all slots to decide when to hibernate  All slots must agree to hibernate › HIBERNATE_CHECK_INTERVAL  Number of seconds between hibernation checks 34

35 www.cs.wisc.edu/Condor Setting HIBERNATE › HIBERNATE must evaluate to one of these strings:  “NONE”, “0”  “S1”, “1”, “STANDBY”, “SLEEP”  “S2”, “2”  “S3”, “3”, “RAM”, “MEM”  “S4”, “4”, “DISK”, “HIBERNATE”  “S5”, “5”, “SHUTDOWN” › These numbers are ACPI power states 35

36 www.cs.wisc.edu/Condor Power Management on Linux › On linux, theses methods are tried in order for setting power level:  pm-UTIL tools  /sys/power  /proc/ACPI › LINUX_HIBERNATION_METHOD can be set to pick a favored method 36

37 www.cs.wisc.edu/Condor Sample Configuration ShouldHibernate = \ ((KeyboardIdle > $(StartIdleTime)) \ && $(CPUIdle) \ && ($(StateTimer) > (2 * $(HOUR))) HIBERNATE = ifThenElse( \ $(ShouldHibernate), “RAM”, “NONE” ) HIBERNATE_CHECK_INTERVAL = 300 LINUX_HIBERNATION_METHOD = “/proc” 37

38 www.cs.wisc.edu/Condor Dynamic Slot Partitioning › Divide slots into chunks sized for matched jobs › Readvertise remaining resources › Partitionable resources are cpus, memory, and disk 38

39 www.cs.wisc.edu/Condor How It Works › When match is made…  New sub-slot is created for job and advertised  Slot is readvertised with remaining resources › Slot can be partitioned multiple times › Original slot ad never enters Claimed state  But may eventually have too few resources to be matched › When claim on sub-slot is released, resources are added back to original slot 39

40 www.cs.wisc.edu/Condor Configuration › Resources still statically partitioned between slots › SLOT_TYPE_ _PARTITIONABLE  Set to True to enable dynamic partition within indicated slot 40

41 www.cs.wisc.edu/Condor New Machine Attributes › In original slot machine ad  PartitionableSlot = True › In ad for dynamically-created slots  DynamicSlot = True › Can reference these in startd policy expressions 41

42 www.cs.wisc.edu/Condor Job Submit File › Jobs can request how much of partitionable resources they need  request_cpus = 3  request_memory = 1024  request_disk = 10240 42

43 www.cs.wisc.edu/Condor Dynamic Partitioning Caveats › Cannot preempt original slot or group of sub-slots  Potential starvation of jobs with large resource requirements › Partitioning happens once per slot each negotiation cycle  Scheduling of large slots may be slow 43

44 www.cs.wisc.edu/Condor Concurrency Limits › Limit job execution based on admin- defined consumable resources  E.g. licenses › Can have many different limits › Jobs say what resources they need › Negotiator enforces limits pool-wide 44

45 www.cs.wisc.edu/Condor Concurrency Example › Negotiator config file  MATLAB_LIMIT = 5  NFS_LIMIT = 20 › Job submit file  concurrency_limits = matlab,nfs:3  This requests 1 Matlab token and 3 NFS tokens 45

46 www.cs.wisc.edu/Condor New Variable Substitution › $$(Foo) in submit file  Existing feature  Attribute Foo from machine ad substituted › $$([Memory * 0.9]) in submit file  New feature  Expression is evaluated and then substituted 46

47 www.cs.wisc.edu/Condor More Info For Preemption › New attributes for these preemption expressions in the negotiator…  PREEMPTION_REQUIREMENTS  PREEMPTION_RANK › Used for controlling preemption due to user priorities 47

48 www.cs.wisc.edu/Condor Preemption Attributes › Submitter/RemoteUserPrio  User priority of candidate and running jobs › Submitter/RemoteUserResourcesInUse  Number of slots in use by user of each job › Submitter/RemoteGroupResourcesInUse  Number of slots in use by each user’s group › Submitter/RemoteGroupQuota  Slot quota for each user’s group 48

49 www.cs.wisc.edu/Condor Thank You! › Any questions? 49


Download ppt "Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2."

Similar presentations


Ads by Google