Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.

Slides:



Advertisements
Similar presentations
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
Basic Grid Job Submission Alessandra Forti 28 March 2006.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
1 Enabling Secure Internet Access with ISA Server.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Course 6421A Module 7: Installing, Configuring, and Troubleshooting the Network Policy Server Role Service Presentation: 60 minutes Lab: 60 minutes Module.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
Troubleshooting Windows Vista Security Chapter 4.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
20411B 8: Installing, Configuring, and Troubleshooting the Network Policy Server Role Presentation: 60 minutes Lab: 60 minutes After completing this module,
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Zach Miller Computer Sciences Department University of Wisconsin-Madison Bioinformatics Applications.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Authentication and Authorization.
Grid job submission using HTCondor Andrew Lahiff.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Remote Administration Remote Desktop Remote Desktop Gateway Remote Assistance Windows Remote Management Service Remote Server Administration Tools.
1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Module 10: Windows Firewall and Caching Fundamentals.
FermiGrid School Steven Timm FermiGrid School FermiGrid 201 Scripting and running Grid Jobs.
© Geodise Project, University of Southampton, Geodise Middleware Graeme Pound, Gang Xue & Matthew Fairman Summer 2003.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring Windows Server 2008 Printing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Module Overview Installing and Configuring a Network Policy Server
Primer for Site Debugging
The Condor JobRouter.
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations

Previously Covered › Some topics from the vanilla Condor operations talk apply to Condor-G  Configuration files  Log files  Command-line tools  Job policy expressions  Where to get more help

HELD Status › Jobs will be held when Condor-G needs help with an error › On release, Condor-G will retry › The reason for the hold will be saved in the job ad and user log

Hold Reason › condor_q –held jfrey 2/13 13:58 CREAM_Delegate Error: Received NULL fault; › cat job.log 012 ( ) 02/13 13:58:38 Job was held. CREAM_Delegate Error: Received NULL fault; the error is due to another cause… › condor_q –format ‘%s\n’ HoldReason CREAM_Delegate Error: Received NULL fault; the error is due to another cause…

Common Errors › Authentication  Hold reason may be misleading  User may not be authorized by CE  Condor-G may not have access to all Certificate Authority files  User’s proxy may have expired

Common Errors › CE no longer knows about job  CE admin may forcibly remove job files  Condor-G is obsessive about not leaving orphaned jobs  May need to take extra steps to convince Condor-G that remote job is gone

Nonessential Jobs › Jobs can be marked nonessential in the submit file  +nonessential = true › This makes Condor-G more willing to leave orphaned jobs and files on the CE › Use with caution

More Detail on Errors › More details on errors can be found in the gridmanager log › You’ll probably want to increase the debug level and log file size  GRIDMANAGER_DEBUG = D_FULLDEBUG  MAX_GRIDMANAGER_LOG =

Machines Down › If a remote server is down, Condor-G will wait for it to come back up › The time it went down is kept in the job ad  GridResourceUnavailableTime = › And in the user log 026 ( ) 02/13 14:20:39 Detected Down Grid Resource GridResource: gt2 chopin.cs.wisc.edu/jobmanager-fork

Throttles and Timeouts › Limits that prevent Condor-G or CEs from being overwhelmed by large numbers of jobs › Defaults are fairly conservative

Throttles and Timeouts › GRIDMANAGER_MAX_SUBMITTED_JOBS_PER _RESOURCE = 1000  You can increase to 10,000 or more › GRIDMANAGER_MAX_JOBMANAGERS_PER_RE SOURCE = 10  GRAM2 only  Default is conservative  Can increase to ~100 if this is the only client

Throttles and Timeouts › GRIDMANAGER_MAX_PENDING_REQUESTS = 50  Number of commands sent to a GAHP in parallel  Can increase to a couple hundred › GRIDMANAGER_GAHP_CALL_TIMEOUT = 300  Time after which a GAHP command is considered failed  May need to lengthen if pending requests is increased

Network Connectivity › Outbound connections only for most job types › GRAM requires incoming connections  Need 2 open ports per pair