Presentation is loading. Please wait.

Presentation is loading. Please wait.

Upgrading Condor Best Practices

Similar presentations


Presentation on theme: "Upgrading Condor Best Practices"— Presentation transcript:

1 Upgrading Condor Best Practices

2 The problem More frequent releases of Condor
Every six to nine months? Understand this is a problem for users We’re willing to help out

3 Overview Config file management Condor testing strategies
Standard Universe issues

4 Config files LOCAL_CONFIG_FILE Used for #include-like behaviour:
$(HOSTS), $(GLOBAL), $(POLICY)…

5 Typical Config file ## Try to save this much swap space by not starting new shadows. ## Specified in megabytes. #RESERVED_SWAP = 5 Commented out lists the default value

6 Config file editing Never edit base condor_config file
Except to specify the local file Put all edits in a local file One local file per config type E.g. for schedds, CMs, types of execute machines Can mix and match

7 Dealing with a new config
Diff base config with your config Understand new items Documented in manual version-history Existing ones rarely change Usually capacity changes Almost always, overwriting base file works

8 Managing config files Centralized management key
Cfengine, rsync, nfs (!) etc.

9 Testing new versions

10 Compatibility Guarantees
No guarantees… But we try very hard! Both forward and backward Especially within one machine Federation techniques require this

11 Incremental testing! Three basic components of Condor:
Central Manager Submit points Execute machines Test each independently

12 Testing Central Manager
Take advantage of statelessness Condor HAD can help out here If it breaks, existing jobs keep running

13 Testing schedds Adding a new test schedd easy Schedd can be bottleneck
Test jobs useful too, not just sleep Schedd can be bottleneck Probably only place you need to check cpu performance

14 Testing startds Easy to test a few at once
Be careful when running std uni Glide in can be very helpful But beware of root specific issues Admin slots helpful

15 Now that we’ve tested… Always be undo-able! (never overwrite files) Rely on master restart on stat change

16 Big bang approach What we do at CS
Just change a symlink to the binaries Master does the rest… Can be a big hit on shared filesystems

17 Incremental restart First, restart CM Send, reboot schedd
No jobs lost Send, reboot schedd If restart happens in 20 minutes, jobs keep running What about the startds? Might be OK for standard uni Work on this coming soon…

18 Standard Universe CheckpointPlatform clarifications
More sensitive to backward compatibility CheckpointPlatform clarifications condor_qedit -constraint 'LastCheckpointPlatform =?= "LINUX INTEL 2.6.x normal"' LastCheckpointPlatform '"LINUX INTEL 2.6.x normal 0xffffe000"'

19 Draining old Std Uni Keep a few old startds around
To finish old standard uni jobs Set start to “JobUniverse == 1” Or maybe rank… Only on the old platforms

20 When to upgrade? Zeroth law of software engineering
Development series actually pretty stable We’ll let you know about security issues Probably don’t need every minor version Don’t be more than one major stable version behind

21 In summary… Keep config files under control
Test each component in isolation Be aware of standard universe issues

22 Any questions? Thank you!


Download ppt "Upgrading Condor Best Practices"

Similar presentations


Ads by Google