Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Slides:



Advertisements
Similar presentations
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
Advertisements

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
April Open Science Grid Campus Condor Pools Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel Hill.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Grid job submission using HTCondor Andrew Lahiff.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
An Introduction to Using
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor DAGMan: Managing Job Dependencies with Condor
Intermediate HTCondor: Workflows Monday pm
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Workload Management System
Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison
Troubleshooting Your Jobs
Submitting Many Jobs at Once
Job Matching, Handling, and Other HTCondor Features
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor Administration in the Open Science Grid
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska – Lincoln

2013 OSG User School What have we seen? If you do a condor_status on submit: 2 LINUX X86_64 Unclaimed Benchmar :00:04 LINUX X86_64 Owner Idle :01:06 LINUX X86_64 Owner Idle :00:04 LINUX X86_64 Owner Idle :00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX X86_64/LINUX Total

2013 OSG User School What have we seen? If you do a condor_status on submit: 3 LINUX X86_64 Unclaimed Benchmar :00:04 LINUX X86_64 Owner Idle :01:06 LINUX X86_64 Owner Idle :00:04 LINUX X86_64 Owner Idle :00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX X86_64/LINUX Total

2013 OSG User School What have we seen? What does this mean? 15 nodes what are 32bit 27 nodes that are 64bit 4 INTEL/LINUX X86_64/LINUX

2013 OSG User School Different Architectures OSG computers come in 2 major architectures:  X86_64 – Dominant, 64 bit platform  32bit – Very few, but Executables have problems on the different architectures. 5

2013 OSG User School Different Architectures 32bit application -> 32 bit architecture 32bit application -> 64 bit architecture 64bit application -> 64 bit architecture 64bit application -> 32 bit architecture Be smart when you compile and run executables (more in exercise) 6

2013 OSG User School Sites that preempt Remember we had all these sites 7

2013 OSG User School Sites that preempt What happens if 1 kills your job? 8

2013 OSG User School Sites that preempt What if a site goes away? 9 !

2013 OSG User School Sites that preempt What if a site goes away? 10 !

2013 OSG User School What happens in GlideinWMS? With GlideinWMS, the jobs stick around. Condor will send the jobs to other remaining sites. GGC (Good Guy Condor?) 11

Troubleshooting Resources Wednesday Afternoon, 4:00 pm Derek Weitzel OSG Campus Grids University of Nebraska – Lincoln

2013 OSG User School From Previous Did your jobs run? 13

2013 OSG User School From Previous Did your jobs run? 14

2013 OSG User School From Previous Did your jobs run? 15 ~]$ condor_q -hold Submitter: osg-ss-submit.chtc.wisc.edu : : osg-ss-submit.chtc.wisc.edu ID OWNER HELD_SINCE HOLD_REASON armbrust 6/26 15:16 Error from Failed to execute '/var/lib/condor/execute/slot1/dir_1698/condor_exec.exe' with arguments 4 10: (errno=8: 'Exec format error')

2013 OSG User School From Previous Did your jobs run? 16 ~]$ condor_q -hold Submitter: osg-ss-submit.chtc.wisc.edu : : osg-ss-submit.chtc.wisc.edu ID OWNER HELD_SINCE HOLD_REASON armbrust 6/26 15:16 Error from Failed to execute '/var/lib/condor/execute/slot1/dir_1698/condor_exec.exe' with arguments 4 10: (errno=8: 'Exec format error')

2013 OSG User School Goals For this section, I want to cover some common troubleshooting techniques These techniques are widely used by grid users and administrators. 17

2013 OSG User School What has happened? Jobs stay idle? Jobs go on hold? Jobs fail on worker nodes? 18

2013 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. First, check if any available resources are available: 19 $ condor_status

2013 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. Next, check if the condor knows why your job isn’t running 20 $ condor_q –better-analyze 10.0

2013 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. Hum… so your jobs should run, ok now what? Look in the job’s log file, has it ran already? Failing? 21

2013 OSG User School Jobs on Hold You see your job on hold in the queue 22 $ condor_q -- Submitter: osg-ss-glidein.chtc.wisc.edu : : osg-ss-glidein.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD mhaytmyr 6/26 11: :00:06 H run-blast.sh yeast wliu 6/26 13: :00:36 I blast.sh 2 jobs; 1 idle, 0 running, 1 held

2013 OSG User School Jobs on Hold What is the hold reason? 23 $ condor_q 347 -format '%s\n' 'HoldReason' Error from STARTER at failed to receive file /var/lib/condor/execute/dir_7087/glide_fZ7141/execute/dir_18801/query1: FILETRANSFER:1:No plugin table defined (request was

2013 OSG User School Jobs on Hold Each case is different In this case, the user put in their submit file: The Glidein at IU cannot download from http 24 $ condor_q 347 -format '%s\n' 'HoldReason' Error from STARTER at failed to receive file /var/lib/condor/execute/dir_7087/glide_fZ7141/execute/dir_18801/query1: FILETRANSFER:1:No plugin table defined (request was transfer_input_files =

2013 OSG User School Jobs failing on Worker Nodes How to find jobs are failing on worker nodes?  If the output does not match what you expect.  If the jobs seem to be running ‘too fast’ 25

2013 OSG User School Jobs failing on Worker Nodes First, can you see anything useful in the output/error: Next, we have to try some further debugging 26 universe = vanilla... output = out error = err... queue

2013 OSG User School Jobs failing on Worker Nodes If you are running a wrapper script, can force output on every step It then outputs every step to the stderr, or ‘error’ in your submit file. 27 #!/bin/sh #!/bin/sh -x

2013 OSG User School Jobs failing on Worker Nodes Condor can also send you to the worker node using condor_ssh_to_job HUGE!!!! Will see in exercises 28

2013 OSG User School Questions? Questions? Comments?  Feel free to ask me questions later: Derek Weitzel Upcoming sessions  4:30 – 5:00  Hands-on exercises  5:00 – 7:00  Dinner  7:00 – 9:00  Optional Evening Session 29