Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Similar presentations


Presentation on theme: "Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska."— Presentation transcript:

1 Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel dweitzel@cse.unl.edudweitzel@cse.unl.edu OSG Campus Grids University of Nebraska – Lincoln

2 2013 OSG User School What have we seen? If you do a condor_status on submit: 2 glidein_18928@red- LINUX X86_64 Unclaimed Benchmar 0.880 7933 0+00:00:04 monitor_15002@red- LINUX X86_64 Owner Idle 0.880 793 0+00:01:06 monitor_18449@red- LINUX X86_64 Owner Idle 0.880 793 0+00:00:04 monitor_18928@red- LINUX X86_64 Owner Idle 0.880 793 0+00:00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 15 8 0 7 0 0 0 X86_64/LINUX 27 14 2 11 0 0 0 Total 42 22 2 18 0 0 0

3 2013 OSG User School What have we seen? If you do a condor_status on submit: 3 glidein_18928@red- LINUX X86_64 Unclaimed Benchmar 0.880 7933 0+00:00:04 monitor_15002@red- LINUX X86_64 Owner Idle 0.880 793 0+00:01:06 monitor_18449@red- LINUX X86_64 Owner Idle 0.880 793 0+00:00:04 monitor_18928@red- LINUX X86_64 Owner Idle 0.880 793 0+00:00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 15 8 0 7 0 0 0 X86_64/LINUX 27 14 2 11 0 0 0 Total 42 22 2 18 0 0 0

4 2013 OSG User School What have we seen? What does this mean? 15 nodes what are 32bit 27 nodes that are 64bit 4 INTEL/LINUX 15 8 0 7 0 0 0 X86_64/LINUX 27 14 2 11 0 0 0

5 2013 OSG User School Different Architectures OSG computers come in 2 major architectures:  X86_64 – Dominant, 64 bit platform  32bit – Very few, but Executables have problems on the different architectures. 5

6 2013 OSG User School Different Architectures 32bit application -> 32 bit architecture 32bit application -> 64 bit architecture 64bit application -> 64 bit architecture 64bit application -> 32 bit architecture Be smart when you compile and run executables (more in exercise) 6

7 2013 OSG User School Sites that preempt Remember we had all these sites 7

8 2013 OSG User School Sites that preempt What happens if 1 kills your job? 8

9 2013 OSG User School Sites that preempt What if a site goes away? 9 !

10 2013 OSG User School Sites that preempt What if a site goes away? 10 !

11 2013 OSG User School What happens in GlideinWMS? With GlideinWMS, the jobs stick around. Condor will send the jobs to other remaining sites. GGC (Good Guy Condor?) 11

12 Troubleshooting Resources Wednesday Afternoon, 4:00 pm Derek Weitzel dweitzel@cse.unl.edudweitzel@cse.unl.edu OSG Campus Grids University of Nebraska – Lincoln

13 2013 OSG User School From Previous Did your jobs run? 13

14 2013 OSG User School From Previous Did your jobs run? 14

15 2013 OSG User School From Previous Did your jobs run? 15 [dweitzel@osg-ss-submit ~]$ condor_q -hold 4703 -- Submitter: osg-ss-submit.chtc.wisc.edu : : osg-ss-submit.chtc.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 4703.0 armbrust 6/26 15:16 Error from slot1@glow-c015.cs.wisc.edu: Failed to execute '/var/lib/condor/execute/slot1/dir_1698/condor_exec.exe' with arguments 4 10: (errno=8: 'Exec format error')

16 2013 OSG User School From Previous Did your jobs run? 16 [dweitzel@osg-ss-submit ~]$ condor_q -hold 4703 -- Submitter: osg-ss-submit.chtc.wisc.edu : : osg-ss-submit.chtc.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 4703.0 armbrust 6/26 15:16 Error from slot1@glow-c015.cs.wisc.edu: Failed to execute '/var/lib/condor/execute/slot1/dir_1698/condor_exec.exe' with arguments 4 10: (errno=8: 'Exec format error')

17 2013 OSG User School Goals For this section, I want to cover some common troubleshooting techniques These techniques are widely used by grid users and administrators. 17

18 2013 OSG User School What has happened? Jobs stay idle? Jobs go on hold? Jobs fail on worker nodes? 18

19 2013 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. First, check if any available resources are available: 19 $ condor_status

20 2013 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. Next, check if the condor knows why your job isn’t running 20 $ condor_q –better-analyze 10.0

21 2013 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. Hum… so your jobs should run, ok now what? Look in the job’s log file, has it ran already? Failing? 21

22 2013 OSG User School Jobs on Hold You see your job on hold in the queue 22 $ condor_q -- Submitter: osg-ss-glidein.chtc.wisc.edu : : osg-ss-glidein.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 347.0 mhaytmyr 6/26 11:01 0+00:00:06 H 0 0.0 run-blast.sh yeast 404.0 wliu 6/26 13:32 0+00:00:36 I 0 0.0 blast.sh 2 jobs; 1 idle, 0 running, 1 held

23 2013 OSG User School Jobs on Hold What is the hold reason? 23 $ condor_q 347 -format '%s\n' 'HoldReason' Error from glidein_8812@iut2-c159.iu.edu: STARTER at 149.165.225.159 failed to receive file /var/lib/condor/execute/dir_7087/glide_fZ7141/execute/dir_18801/query1: FILETRANSFER:1:No plugin table defined (request was https://twiki.grid.iu.edu/twiki/bin/viewfile/Education/OSGSS2012CondorBLAST/query1)

24 2013 OSG User School Jobs on Hold Each case is different In this case, the user put in their submit file: The Glidein at IU cannot download from http 24 $ condor_q 347 -format '%s\n' 'HoldReason' Error from glidein_8812@iut2-c159.iu.edu: STARTER at 149.165.225.159 failed to receive file /var/lib/condor/execute/dir_7087/glide_fZ7141/execute/dir_18801/query1: FILETRANSFER:1:No plugin table defined (request was https://twiki.grid.iu.edu/twiki/bin/viewfile/Education/OSGSS2012CondorBLAST/query1) transfer_input_files = https://twiki.grid.iu.edu/twiki/bin/viewfile/Education/OSGSS2012CondorBLAST/query1

25 2013 OSG User School Jobs failing on Worker Nodes How to find jobs are failing on worker nodes?  If the output does not match what you expect.  If the jobs seem to be running ‘too fast’ 25

26 2013 OSG User School Jobs failing on Worker Nodes First, can you see anything useful in the output/error: Next, we have to try some further debugging 26 universe = vanilla... output = out error = err... queue

27 2013 OSG User School Jobs failing on Worker Nodes If you are running a wrapper script, can force output on every step It then outputs every step to the stderr, or ‘error’ in your submit file. 27 #!/bin/sh #!/bin/sh -x

28 2013 OSG User School Jobs failing on Worker Nodes Condor can also send you to the worker node using condor_ssh_to_job HUGE!!!! Will see in exercises 28

29 2013 OSG User School Questions? Questions? Comments?  Feel free to ask me questions later: Derek Weitzel Upcoming sessions  4:30 – 5:00  Hands-on exercises  5:00 – 7:00  Dinner  7:00 – 9:00  Optional Evening Session 29


Download ppt "Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska."

Similar presentations


Ads by Google