Troubleshooting Your Jobs

Slides:



Advertisements
Similar presentations
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Introduction to UNIX/Linux Exercises Dan Stanzione.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
An Introduction to Using
Downloading and Installing GRASP-AF Workshop Ian Robson Information Analyst, North of England Cardiovascular Network.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.
SQL Database Management
Using Crontab with Ubuntu
CHTC Policy and Configuration
Welcome to Indiana University Clusters
Debugging Common Problems in HTCondor
Intermediate HTCondor: More Workflows Monday pm
Development Environment
Large Output and Shared File Systems
Connect:Direct for UNIX v4.2.x Silent Installation
More HTCondor Monday AM, Lecture 2 Ian Ross
Welcome to Indiana University Clusters
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Project 1 Simple Socket Client.
Intermediate HTCondor: Workflows Monday pm
An Introduction to Workflows with DAGMan
IW2D migration to HTCondor
Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison
High Availability in HTCondor
Testing and Debugging.
Part 3 – Remote Connection, File Transfer, Remote Environments
CREAM-CE/HTCondor site
OSG Connect and Connect Client
Simple Socket Client Project 1.
CSE 303 Concepts and Tools for Software Development
Troubleshooting Your Jobs
Submitting Many Jobs at Once
Simple Socket Client Project 1.
Job Matching, Handling, and Other HTCondor Features
What’s Different About Overlay Systems?
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Late Materialization has (lately) materialized
Lab 8: GUI testing Software Testing LTAT
Credential Management in HTCondor
PU. Setting up parallel universe in your pool and when (not
Brian Lin OSG Software Team University of Wisconsin - Madison
Short Read Sequencing Analysis Workshop
Thursday AM, Lecture 2 Brian Lin OSG
Presentation transcript:

Troubleshooting Your Jobs Brian Lin OSG Software Team University of Wisconsin - Madison

The grid is your oyster!

The grid is your oyster! ...if your jobs aren’t broken. Photo by brentleimenstoll CC BY-SA

Keys to Troubleshooting Identify the problem and break it into smaller ones Idle Running Complete Held Suspend History file Submit file 4

Keys to Troubleshooting Know where to find more information: condor_q and its options Output, error, log, and rescue files Google https://support.opensciencegrid.org knowledge base Local support user-support@opensciencegrid.org 5

Common Issues

Why can’t I submit my job? $ condor_submit sleep.sh Submitting job(s) ERROR: on Line 2 of submit file: ERROR: Failed to parse command file (line 2). 7

Why can’t I submit my job? $ condor_submit sleep.sh Submitting job(s) ERROR: on Line 2 of submit file: ERROR: Failed to parse command file (line 2). Huh? You’ve tried to submit something that wasn’t your submit file. Solution: Submit your .sub file! 8

Why can’t I submit my job? $ condor_submit sleep.sub Submitting job(s) ERROR: Can't open "/home/blin/school/inptu_data" with flags 00 (No such file or directory) Submitting job(s)No 'executable' parameter was provided ERROR: I don't know about the 'vanila' universe. ERROR: Executable file /bin/slep does not exist 9

Why can’t I submit my job? $ condor_submit sleep.sub Submitting job(s) ERROR: Can't open "/home/blin/school/inptu_data" with flags 00 (No such file or directory) Submitting job(s)No 'executable' parameter was provided ERROR: I don't know about the 'vanila' universe. ERROR: Executable file /bin/slep does not exist Huh? There are typos in your submit file. Solution: Fix your typos! Condor can only catch a select few of them. 10

Why can’t I submit my job? $ condor_submit sleep.sh Submitting job(s) ERROR: Executable file sleep.sh is a script with CRLF (DOS/Win) line endings Run with -allow-crlf-script if this is really what you want 11

Why can’t I submit my job? $ condor_submit sleep.sh Submitting job(s) ERROR: Executable file sleep.sh is a script with CRLF (DOS/Win) line endings Run with -allow-crlf-script if this is really what you want Huh? There are carriage return line endings (^M) in your executable. Solution: Convert your executable with dos2unix or vi -b <FILENAME> to see and delete the carriage returns (use ‘x’ to delete). 12

What are my jobs up to? $ condor_q -help status JobStatus codes: 1 I IDLE 2 R RUNNING 3 X REMOVED 4 C COMPLETED 5 H HELD 6 > TRANSFERRING_OUTPUT 7 S SUSPENDED 13

Why are my jobs idle? 14 $ condor_q -better 29486 [...] The Requirements expression for job 29486.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [1] 8348 Target.OpSysMajorVer == 6 [7] 10841 TARGET.Disk >= RequestDisk [9] 1 TARGET.Memory >= RequestMemory [11] 10852 TARGET.HasFileTransfer 14

Why are my jobs still running? $ condor_q -nobatch -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 14665.0 blin 7/25 18:19 0+23:00:03 R 0 0.3 sleep.sh Solution: Use condor_ssh_to_job <JOB ID> to open an SSH session to the worker node running your job. Non-OSG jobs only! 15

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 14662.0 blin 7/25 18:05 Error from slot1_12@e163.chtc.wisc.edu: Failed to execute '/var/lib/condor/execute/slot1/dir_3090825/condor_exec.exe': (errno=8: 'Exec format error') 16

Why are my jobs held? Huh? Condor couldn’t run your executable. $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 14662.0 blin 7/25 18:05 Error from slot1_12@e163.chtc.wisc.edu: Failed to execute '/var/lib/condor/execute/slot1/dir_3090825/condor_exec.exe': (errno=8: 'Exec format error') Huh? Condor couldn’t run your executable. Solution: Add the missing ‘shebang’ line at the top of the executable, e.g. #!/bin/bash. 17

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 29494.0 blin 7/16 17:31 Failed to initialize user log to /home/blin/foo/bar/test-000.log or /dev/null 18

Why are my jobs held? Huh? Condor couldn’t write the job log files $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 29494.0 blin 7/16 17:31 Failed to initialize user log to /home/blin/foo/bar/test-000.log or /dev/null Huh? Condor couldn’t write the job log files Solution: On the submit server, ensure existence and write permissions for each folder in the specified path or choose a new location for your files! 19

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from slot1_1@e026.chtc.wisc.edu: Job has gone over memory limit of 100 megabytes. Peak usage: 500 megabytes. 20

Why are my jobs held? Huh? You’ve used too many resources. $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 ID OWNER HELD_SINCE HOLD_REASON 19.0 blin 7/14 15:07 Error from slot1_1@e026.chtc.wisc.edu: Job has gone over memory limit of 100 megabytes. Peak usage: 100 megabytes. Huh? You’ve used too many resources. Solution: Request more! 1GB of memory is a good place to start. If your max memory footprint per job is approaching 10GB, consult OSG staff 21

Fixing held jobs If the problem was with your submit file - remove your jobs, fix the submit file, and resubmit: condor_rm <job ID> Add request_disk, request_mem, or request_cpus to your submit file condor_submit <submit file> If the problem was outside of your submit file - fix the issue and release the job: condor_release <job ID> 22

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 24.0 blin 7/14 16:12 Error from learn.chtc.wisc.edu: STARTER at 128.104.100.52 failed to send file(s) to <128.104.100.43:64130>: error reading from /var/lib/condor/execute/dir_24823/bar: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.104.100.52:10507> 23

Why are my jobs held? Submit Node Execute Node Output SHADOW Input STARTER Input 24

Why are my jobs held? $ condor_q -held -- Schedd: learn.chtc.wisc.edu : <128.104.100.43:9618?... @ 07/18/17 15:13:42 24.0 blin 7/14 16:12 Error from learn.chtc.wisc.edu: STARTER at 128.104.100.52 failed to send file(s) to <128.104.100.43:64130>: error reading from /var/lib/condor/execute/dir_24823/bar: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.104.100.52:10507> Huh? Your job did not create the files that you specified in transfer_output_files. Solution: Check for typos in transfer_output_files, suspiciously quick jobs, add debugging information to your code. 25

Why are my jobs held? $ condor_q -af holdreason Error from slot1_1@glidein_90499_320684397@node022.local: STARTER at 192.168.1.22 failed to send file(s) to <128.104.100.31:9618>; SHADOW at 128.104.100.31 failed to write to file /home/blin/location/output/location.5164.0.out: (errno 2) No such file or directory 26

Why are my jobs held? $ condor_q -af holdreason Error from slot1_1@glidein_90499_320684397@node022.local: STARTER at 192.168.1.22 failed to send file(s) to <128.104.100.31:9618>; SHADOW at 128.104.100.31 failed to write to file /home/blin/location/output/location.5164.0.out: (errno 2) No such file or directory Huh? HTCondor couldn’t write the output files to the submit server Solution: On the submit server, make sure that the directories exist and can be written to 27

Why are my jobs held? Failed to READ Failed to WRITE SHADOW Input files don’t exist or cannot be read on the submit server Could not write output files to your submit server (missing or incorrect permissions on directories) STARTER Files in transfer_output_files were not created on the execute server Something’s wrong with the execute server. Contact your HTCondor admin! 28

My jobs completed but... The output is wrong: Check *.log files for return codes or unexpected behavior: short runtimes, using too many or too few resources Check *.err and *.out for error messages from your executable Submit an interactive job: condor_submit -i <SUBMIT FILE> and run the executable manually. If it succeeds, does your submit file have the correct args? If yes, try adding GET_ENV = True to your submit file. If it fails, there is an issue with your code or your invocation! 29

Troubleshooting Exercise Source: @direlog, https://twitter.com/direlog/status/886721271102410752

Hands-On Questions? Exercise 1.1: Troubleshooting Remember, if you don’t finish, that’s ok! You can make up work later or during evenings, if you’d like. Coming next 10:30 - 10:45 Break 10:45 - 12:00 HTC in action (interactive) 12:00 - 12:15 Security in the OSG