Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Slides:



Advertisements
Similar presentations
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
April Open Science Grid Campus Condor Pools Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel Hill.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Grid job submission using HTCondor Andrew Lahiff.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
An Introduction to Using
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Condor DAGMan: Managing Job Dependencies with Condor
Intermediate HTCondor: Workflows Monday pm
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Workload Management System
Troubleshooting Your Jobs
Job Matching, Handling, and Other HTCondor Features
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska – Lincoln

2012 OSG User School What have we seen? If you do a condor_status on glidein: 2 LINUX X86_64 Unclaimed Benchmar :00:04 LINUX X86_64 Owner Idle :01:06 LINUX X86_64 Owner Idle :00:04 LINUX X86_64 Owner Idle :00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX X86_64/LINUX Total

2012 OSG User School What have we seen? If you do a condor_status on glidein: 3 LINUX X86_64 Unclaimed Benchmar :00:04 LINUX X86_64 Owner Idle :01:06 LINUX X86_64 Owner Idle :00:04 LINUX X86_64 Owner Idle :00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX X86_64/LINUX Total

2012 OSG User School What have we seen? What does this mean? 15 nodes what are 32bit 27 nodes that are 64bit 4 INTEL/LINUX X86_64/LINUX

2012 OSG User School Different Architectures OSG computers come in 2 major architectures:  X86_64 – Dominant, 64 bit platform  32bit – Very few, but Executables have problems on the different architectures. 5

2012 OSG User School Different Architectures 32bit application -> 32 bit architecture 32bit application -> 64 bit architecture 64bit application -> 64 bit architecture 64bit application -> 32 bit architecture Be smart when you compile and run executables (more in exercise) 6

2012 OSG User School Sites that preempt Remember we had all these sites 7

2012 OSG User School Sites that preempt What happens if 1 kills your job? 8

2012 OSG User School Sites that preempt What if a site goes away? 9 !

2012 OSG User School Sites that preempt What if a site goes away? 10 !

2012 OSG User School What happens in GlideinWMS? With GlideinWMS, the jobs stick around. Condor will send the jobs to other remaining sites. GGC (Good Guy Condor?) 11

Troubleshooting Resources Wednesday Afternoon, 4:00 pm Derek Weitzel OSG Campus Grids University of Nebraska – Lincoln

2012 OSG User School Goals For this section, I want to cover some common troubleshooting techniques These techniques are widely used by grid users and administrators. 13

2012 OSG User School What has happened? Jobs stay idle? Jobs go on hold? Jobs fail on worker nodes? 14

2012 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. First, check if any available resources are available: 15 $ condor_status

2012 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. Next, check if the condor knows why your job isn’t running 16 $ condor_q –better-analyze 10.0

2012 OSG User School Jobs on Idle There are some tools to help with finding why jobs are not running. Hum… so your jobs should run, ok now what? Look in the job’s log file, has it ran already? Failing? 17

2012 OSG User School Jobs on Hold You see your job on hold in the queue 18 $ condor_q -- Submitter: osg-ss-glidein.chtc.wisc.edu : : osg-ss-glidein.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD mhaytmyr 6/26 11: :00:06 H run-blast.sh yeast wliu 6/26 13: :00:36 I blast.sh 2 jobs; 1 idle, 0 running, 1 held

2012 OSG User School Jobs on Hold What is the hold reason? 19 $ condor_q 347 -format '%s\n' 'HoldReason' Error from STARTER at failed to receive file /var/lib/condor/execute/dir_7087/glide_fZ7141/execute/dir_18801/query1: FILETRANSFER:1:No plugin table defined (request was

2012 OSG User School Jobs on Hold Each case is different In this case, the user put in their submit file: The Glidein at IU cannot download from http 20 $ condor_q 347 -format '%s\n' 'HoldReason' Error from STARTER at failed to receive file /var/lib/condor/execute/dir_7087/glide_fZ7141/execute/dir_18801/query1: FILETRANSFER:1:No plugin table defined (request was transfer_input_files =

2012 OSG User School Jobs failing on Worker Nodes How to find jobs are failing on worker nodes?  If the output does not match what you expect.  If the jobs seem to be running ‘too fast’ 21

2012 OSG User School Jobs failing on Worker Nodes First, can you see anything useful in the output/error: Next, we have to try some further debugging 22 universe = vanilla... output = out error = err... queue

2012 OSG User School Jobs failing on Worker Nodes If you are running a wrapper script, can force output on every step It then outputs every step to the stderr, or ‘error’ in your submit file. 23 #!/bin/sh #!/bin/sh -x

2012 OSG User School Jobs failing on Worker Nodes Condor can also send you to the worker node using condor_ssh_to_job HUGE!!!! Will see in exercises 24

2012 OSG User School Questions? Questions? Comments?  Feel free to ask me questions later: Alain Roy Upcoming sessions  9:45-10:30  Hands-on exercises  10:30 – 10:45  Break  10:45 – 12:15  More! 25