Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Similar presentations


Presentation on theme: "Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group."— Presentation transcript:

1 Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group

2 Condor Week 2004 The CAF ● Develop, debug and submit on the same machine ● Output to any place the user want ➔ No need to stay connected submit CAF submit

3 Condor Week 2004 ● User group executable, data and libraries in a directory ● The directory is tared-up ● Tar-ball sent via kerberized socket connection ● Job split in several sections  Same executable, different parameters The user side mymachine> ls mydir myexe mydata.conf lib/libfirst.so lib/libsecond.so mytar.tgz CAF mydir 1000-3534 JobID

4 Condor Week 2004 The server side ● Authenticate user using kerberos ● Receive tar-ball and put it in a local dir ● Create submit description files ● Submit to Condor Submitter Condor User mytar.tgz data 1000-3534 Server> ls submit_dir data.tgz CafExe job.dag dagman.ClassAd section_1000.ClassAd... section_3534.ClassAd condor_submit JobID

5 Condor Week 2004 Condor submission ● Every job has its own staging directory ● Using dagman  Script creates dagman submit description file  Plus one description file per section  Final cleanup script removes tar-ball ● Flat DAG, with only the cleanup script as child ● Using kerberos service principals for authentication  Don't want to have a Unix uid for every user

6 Condor Week 2004 Job specifics: ''Transfer In'' ● Using Condor transfer mechanism to transfer  Tar-ball  The startup wrapper  A kerberos keytab ● Encrypion needed for the keytab file ● Using VMx_USER ● Kerberos used for outside authentication  User specific service principle extracted from the keytab  Keytab removed before user executable starts

7 Condor Week 2004 Job specifics: ''Transfer Out'' ● Queued rcp used to copy output to user specified location  Section output too big for the head node  Original submission machine may be down ● Backup file server tried if first rcp fails ● Condor transfer mechanism used only to get the section log and summary files ● In case all rcp failed, data are transferred to the head node as the last resource

8 Condor Week 2004 Mailer ● Implemented as a separate process  Only one mail for the whole DAG  A mail must be generated even if the job is removed ● CAF specific information included ● Has a list of dagmans to watch ● Generate a mail when dagman ends

9 Condor Week 2004 Monitoring data: job information ● condor_q too expensive ● Parsing log files  One global log of all dagman submits  One submit log for every job  CAF specific log files dagman.log dagman 1Section 1Section k job_1/job.log job_1/section_1.out job_1/section_k.out dagman 2Section 1Section h job_2/job.log job_2/section_1.out job_2/section_h.out dagman nSection 1Section j job_n/job.log job_n/section_1.out job_j/section_j.out

10 Condor Week 2004 Monitoring information: system VM information ● condor_status cheap enough ● Used to map back which section runs where  Not enough information in the log files Priorities ● condor_userprio used for user priorities ● Section priorities maintained in submit description files

11 Condor Week 2004 Monitoring: command line ● Logically mimics a unix shell  jobs  ls, tail, cat  top  gdb ● COD used to send request to the worker node User Monitor CafExe CafRout top COD Write pipe

12 Condor Week 2004 ● Polling method used ● Web pages dynamically generated based on snapshot ● History data maintained using RRD (Round Robin Database) Monitoring: web interface ● See demo http://cdfcaf.fnal.gov/condorcaf/user.html demo

13 Condor Week 2004 ● Command line tools for user administration  Kill a job  Kill one or more sections  Change relative priority  Change timeout ● Unix like User intervention

14 Condor Week 2004 CDF CAF in numbers ● At present  180 nodes  6 VMs per node  5 used, 1 for test  Total 900 in use ● By month end  Additional 160 nodes  Total 1700 in use (goal: 5000VM's by year end) ● 100s of users  At least 50 active at any time ● 100s CAF jobs in queue typical  Gives 10k-100k sections  -maxidle lowers Condor jobs <10k

15 Condor Week 2004 Condor configuration ● Single schedd  10k jobs, 100 dagmans  1k VMs, 200 nodes ➔ Single most demanding piece of the system ● Kerberos authentication ● Vanilla universe jobs  Preemption in the first minutes only ● Condor tunning  Relaxed timeouts  Delay between submissions in dagman  Optimized kerberos authentication  Schedd autoclustering  Per-file encryption ● Using pre-released ver.

16 Condor Week 2004 Missing: Group accounting ● Several institutions in the collaboration  Common pool financed by all  Several pools financed by single institution (15) ● Different contributions  Some more than 100 nodes  Some only a few ● Users can run in different pools  Owners must get priority treatment  CPU used by owners in the private pool must not influence the priority in the common pool

17 Condor Week 2004 Possible solution: flocking? Proposal ● One pool for common use ● One pool for every institution  Owners preferred ● Flocking between the common pool and any of the other pools Problems ● A management nightmare ● Small pools penalized  No preemption ● Unfair accounting for stolen CPU

18 Condor Week 2004 New feature: Hierarchical priorities ● Hierarchical priorities  A tree of policies ● Each node can have a different policy  Current fair share  Ranking  Belong-to  Quotas (up to x VMs) Fair share Common MIT Quota 12VMs Allow only MIT users Quota 900VMs Allow only CDF users Job

19 Condor Week 2004 Hierarchical priorities: Advantages ● Easy to manage ● Very flexible ● Allows for use of roles CDF commonCDF MITCMSATLAS CDFGrid3 Our Pool john#CDF/MITCDF/CMSGRID john#CDF/MITCDFjohn#CMSGRID john#CDFjohn#MITCDFjohn#CMSGRID igor#CDF/INFNCDF igor#CDF

20 Condor Week 2004 Future ● Better use of dagman  Wait for data to be staged  Merge section output  Expose DAG to users ● Use COD for interactive use  PEAC prototype at Supercomputing03 ● Use glide-in on remote sites that don't want to use Condor ● Opportunistic use of other pools  Flocking with D0 and Grid3 pools


Download ppt "Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group."

Similar presentations


Ads by Google