Troubleshooting Your Jobs

Slides:



Advertisements
Similar presentations
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Introduction to UNIX/Linux Exercises Dan Stanzione.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
– Introduction to the Shell 10/1/2015 Introduction to the Shell – Session Introduction to the Shell – Session 2 · Permissions · Users.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Files Tutor: You will need ….
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
An Introduction to Using
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.
Chapter 13 Debugging Strategies Learning by debugging.
Christina Koch Research Computing Facilitators
Using Crontab with Ubuntu
Welcome to Indiana University Clusters
Debugging Common Problems in HTCondor
Intermediate HTCondor: More Workflows Monday pm
Development Environment
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Chapter 11 Command-Line Master Class
More HTCondor Monday AM, Lecture 2 Ian Ross
Welcome to Indiana University Clusters
Intermediate HTCondor: Workflows Monday pm
Things you may not know about HTCondor
An Introduction to Workflows with DAGMan
Workload Management System
IW2D migration to HTCondor
High Availability in HTCondor
Testing and Debugging.
Part 3 – Remote Connection, File Transfer, Remote Environments
CREAM-CE/HTCondor site
CSE 303 Concepts and Tools for Software Development
Submitting Many Jobs at Once
Haiyan Meng and Douglas Thain
Job Matching, Handling, and Other HTCondor Features
What’s Different About Overlay Systems?
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Late Materialization has (lately) materialized
Quick Tutorial on MPICH for NIC-Cluster
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Brian Lin OSG Software Team University of Wisconsin - Madison
Short Read Sequencing Analysis Workshop
Presentation transcript:

Troubleshooting Your Jobs Brian Lin OSG Software Team University of Wisconsin - Madison

The grid is your oyster!

The grid is your oyster! ...if your workflow isn’t broken. Photo by brentleimenstoll CC BY-SA

General Troubleshooting Held Idle Running Complete Submit file Suspend Suspend Suspend History file 4

General Troubleshooting Example of a LIGO Inspiral DAG 5 5

Keys to Troubleshooting Identify the problem and break it into sub-problems Know where to find more information: condor_q and its options Output, error, log, and rescue files Google Local support User school list htcondor-users@cs.wisc.edu user-support@opensciencegrid.org 6

Common Issues

Why can’t I submit my job? $ condor_submit sleep.sh Submitting job(s) ERROR: on Line 2 of submit file: ERROR: Failed to parse command file (line 2). 8

Why can’t I submit my job? $ condor_submit sleep.sh Submitting job(s) ERROR: on Line 2 of submit file: ERROR: Failed to parse command file (line 2). Huh? You’ve tried to submit something that wasn’t your submit file. Solution: Submit your .sub file! 9

Why can’t I submit my job? $ condor_submit sleep.sub Submitting job(s) ERROR: Can't open "/cloud/login/blin/school/inptu_data" with flags 00 (No such file or directory) Submitting job(s)No 'executable' parameter was provided ERROR: I don't know about the 'vanila' universe. ERROR: Executable file /bin/slep does not exist 10

Why can’t I submit my job? $ condor_submit sleep.sub Submitting job(s) ERROR: Can't open "/cloud/login/blin/school/inptu_data" with flags 00 (No such file or directory) Submitting job(s)No 'executable' parameter was provided ERROR: I don't know about the 'vanila' universe. ERROR: Executable file /bin/slep does not exist Huh? There are typos in your submit file. Solution: Fix your typos! Condor can only catch a select few of them. 11

Why can’t I submit my DAG? Errors appears in *dagman.out files instead of STDOUT or STDERR: 07/25/16 17:14:42 From submit: Submitting job(s) 07/25/16 17:14:42 From submit: ERROR: Invalid log file: "/home/blin/sleep/sleep.log" (No such file or directory) 07/25/16 17:14:42 failed while reading from pipe. 07/25/16 17:14:42 Read so far: Submitting job(s)ERROR: Invalid log file: "/home/blin/sleep/sleep.log" (No such file or directory) 07/25/16 17:14:42 ERROR: submit attempt failed 12

What are my jobs up to? $ condor_q -help status JobStatus codes: 1 I IDLE 2 R RUNNING 3 X REMOVED 4 C COMPLETED 5 H HELD 6 > TRANSFERRING_OUTPUT 7 S SUSPENDED 13

Why are my jobs idle? 14 $ condor_q -better 29486 [...] The Requirements expression for job 29486.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [1] 8348 Target.OpSysMajorVer == 6 [7] 10841 TARGET.Disk >= RequestDisk [9] 1 TARGET.Memory >= RequestMemory [11] 10852 TARGET.HasFileTransfer 14

Why are my jobs still running? $ condor_q -nobatch -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 14665.0 blin 7/25 18:19 0+23:00:03 R 0 0.3 sleep.sh Solution: Use `condor_ssh_to_job <job ID>` to open an SSH session to the worker node running your job. Non- OSG jobs only! 15

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from fermicloud113.fnal.gov: Failed to execute '/cloud/login/blin/school/sleep.sh': (errno=13: 'Permission denied') 14662.0 blin 7/25 18:05 Error from slot1_12@e163.chtc.wisc.edu: Failed to execute '/var/lib/condor/execute/slot1/dir_3090825/condor_exec.exe': (errno=8: 'Exec format error') 16

Why are my jobs held? Huh? Condor couldn’t run your executable. $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from fermicloud113.fnal.gov: Failed to execute '/cloud/login/blin/school/sleep.sh': (errno=13: 'Permission denied') 14662.0 blin 7/25 18:05 Error from slot1_12@e163.chtc.wisc.edu: Failed to execute '/var/lib/condor/execute/slot1/dir_3090825/condor_exec.exe': (errno=8: 'Exec format error') Huh? Condor couldn’t run your executable. Solution: Set the executable bit on your executable (`chmod +x <filename>`) and/or add the missing shebang line at the top of the executable, e.g. ‘#!/bin/bash’. 17

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 29494.0 blin 7/16 17:31 Failed to initialize user log to /home/blin/foo/bar/test-000.log or /home/blin/foo/./test.dag.nodes.log 18

Why are my jobs held? Huh? Condor couldn’t read job or DAG log files $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 29494.0 blin 7/16 17:31 Failed to initialize user log to /home/blin/foo/bar/test-000.log or /home/blin/foo/./test.dag.nodes.log Huh? Condor couldn’t read job or DAG log files Solution: Ensure existence/write permissions for each folder in the specified path or choose a new location for your files! 19

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from fermicloud113.fnal.gov: Failed to execute '/cloud/login/blin/school/sleep.sh': invalid interpreter (/bin/bash) specified on first line of script (errno=2: 'No such file or directory') 20

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from fermicloud113.fnal.gov: Failed to execute '/cloud/login/blin/school/sleep.sh': invalid interpreter (/bin/bash) specified on first line of script (errno=2: 'No such file or directory') Huh? There are carriage returns (^M) in your executable. Solution: Use `dos2unix` or `vi -b <filename>` to see and delete the carriage returns (use ‘x’ to delete). 21

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from slot1_1@e026.chtc.wisc.edu: Job has gone over memory limit of 1 megabytes. Peak usage: 1 megabytes. 22

Why are my jobs held? Huh? You’ve used too many resources. $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from slot1_1@e026.chtc.wisc.edu: Job has gone over memory limit of 1 megabytes. Peak usage: 1 megabytes. Huh? You’ve used too many resources. Solution: Request more! 23

Why are my jobs held? Edit your jobs on the fly with condor_qedit and condor_release: condor_qedit <job ID> <resource> <value> condor_qedit <job ID> RequestMemory <mem in MB> condor_qedit -const ‘JobStatus =?= 5’ RequestDisk <disk in KiB> condor_qedit -const ‘Owner =?= “blin”’ RequestCpus <CPUs> condor_release <job ID> Or remove your jobs, fix the submit file, and resubmit: condor_rm <job ID> Add request_disk, request_mem, or request_cpus to your submit file condor_submit <submit file> 24

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 24.0 blin 7/14 16:12 Error from learn.chtc.wisc.edu: STARTER at 128.104.100.52 failed to send file(s) to <128.104.100.43:64130>: error reading from /var/lib/condor/execute/dir_24823/bar: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.104.100.52:10507> 25

Why are my jobs held? Submit Node Compute Node Output SHADOW Input STARTER Input 26

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 24.0 blin 7/14 16:12 Error from learn.chtc.wisc.edu: STARTER at 128.104.100.52 failed to send file(s) to <128.104.100.43:64130>: error reading from /var/lib/condor/execute/dir_24823/bar: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.104.100.52:10507> Huh? Your job did not create the files that you specified in transfer_output_files. Solution: Check for typos in transfer_output_files, job runtime, add debugging information to your code. 27

Why are my jobs held? $ condor_q -af holdreason Error from slot1_1@glidein_90499_320684397@node022.local: STARTER at 192.168.1.22 failed to send file(s) to <128.104.100.31:9618>; SHADOW at 128.104.100.31 failed to write to file /home/blin/location/output/location.5164.0.out: (errno 2) No such file or directory 28

Why are my jobs held? Huh? HTCondor couldn’t write output files $ condor_q -af holdreason Error from slot1_1@glidein_90499_320684397@node022.local: STARTER at 192.168.1.22 failed to send file(s) to <128.104.100.31:9618>; SHADOW at 128.104.100.31 failed to write to file /home/blin/location/output/location.5164.0.out: (errno 2) No such file or directory Huh? HTCondor couldn’t write output files Solution: On your submit server, make sure that the directories exist and can be written to 29

Why are my jobs held? Failed to READ Failed to WRITE SHADOW Input files don’t exist or cannot be read on the submit server Could not write output files to your submit server (missing or incorrect permissions on directories) STARTER Files in transfer_output_files were not created on the execute server Something’s wrong with the execute server [1] Contact your HTCondor admin! [1] (Advanced) Avoid problematic execute servers on subsequent job executions with max_retries, periodic_release, and job_machine_attrs 30

My jobs completed but... The output is wrong: Check *.log files for return codes or unexpected behavior: short runtimes, using too much or too few resources Check *.err and *.out for error messages. Submit an interactive job: `condor_submit -i <submit file>` and run the executable manually. If it succeeds, does your submit file have the correct args? If yes, try adding ‘GET_ENV=True’ to your submit file. If it fails, there is an issue with your code or your invocation! 31

Troubleshooting DAGs Check *.rescue* files (which DAG nodes failed) Check *.dagman.out (errors with job submission) Check *.nodes.log (return codes, PRE/POST script failures). If PRE/POST scripts failed, run them manually to see where they failed. 32

Troubleshooting Exercise Source: @direlog, https://twitter.com/direlog/status/886721271102410752

Questions?