HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Three types of remote process invocation
Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
Condor Birdbath Web Service interface to Condor
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Resource Management Ewa Deelman.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.
Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Condor DAGMan: Managing Job Dependencies with Condor
Intermediate HTCondor: Workflows Monday pm
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System ( WMS )
Building Grids with Condor
Basic Grid Projects – Condor (Part I)
The Condor JobRouter.
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Condor-G Making Condor Grid Enabled
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Condor-G: An Update.
Presentation transcript:

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

HTCondor-G(rid) › HTCondor for the grid  Same job management capabilities as a local HTCondor pool  Use other scheduling systems’ resources

Job Management Interface › Local, persistent job queue › Job policy expressions › Job activity logs › Workflows (DAGMan) › Fault tolerance

“Grid” Universe › All handled in your submit file › Supports a number of “back end” types:  Globus GRAM  CREAM  NorduGrid ARC  HTCondor  PBS  LSF  SLURM  UNICORE  EC2

› Grid universe used any time job is submitted to a different job management/queuing system  Grid services GRAM, CREAM, ARC  Different HTCondor schedd  Local batch system PBS, LSF, SLURM, SGE  Cloud service EC2, Google Compute Engine “Grid” is a Misnomer 5

Gridmanager Daemon › Runs under the schedd › Similar to the shadow › Handles all management of grid jobs › Single instance manages all grid jobs for a user

Grid ASCII Helper Protocol (GAHP) › Runs under gridmanager › Encapsulates grid client libraries in separate process › Simple ASCII protocol › Easy to use client libraries when they can’t be linked directly with gridmanager

How It Works Schedd CREAM LSF Condor-GGrid Resource

How It Works Schedd CREAM LSF Condor-GGrid Resource 600 Grid jobs

How It Works Schedd CREAM LSF Condor-GGrid Resource Gridmanager 600 Grid jobs

How It Works Schedd CREAM LSF Condor-GGrid Resource Gridmanager 600 Grid jobs GAHP

How It Works Schedd CREAM LSF Condor-GGrid Resource Gridmanager 600 Grid jobs GAHP

How It Works Schedd CREAM LSF User Job Condor-GGrid Resource Gridmanager 600 Grid jobs GAHP

› Vanilla universe  Resource acquisition  Job sent directly to machine that can start running it immediately › Grid universe  Job delegation  Job sent to alternate scheduling system  Job may sit idle while resources available elsewhere Differences from Normal HTCondor Jobs 14

› No matchmaking  Specify destination with GridResource  No Requirements, Rank  Resource requests often ignored › Run-time features unavailable  condor_ssh_to_job  condor_tail  condor_chirp Differences from Normal HTCondor Jobs 15

› Information about job execution often lacking  Job exit code  Runtime  Resource usage › Usually must name output files Differences from Normal HTCondor Jobs 16

› GridResource = “ …” › Specify service type and server location › Examples  condor submit.foo.edu cm.foo.edu  gt5 portal.foo.edu/jobmanager-pbs  cream creamce.foo.edu/cream-pbs- glow  nordugrid arc.foo.edu  batch pbs Grid Job Attributes 17

› GridJobId  Job ID from remote server › GridJobStatus  Job status from remote server › LastRemoteStatusUpdate  Time job status last checked › GridResourceUnavailableTime  Time when remote server went down Grid Job Attributes 18

Network Connectivity › Outbound connections only for most job types › GRAM requires incoming connections  Need 2 open ports per pair

› GAHP acts as the user with the remote service › Destination service is local  UID-based authentication  E.g. PBS, LSF, SLURM, Condor › Destination service is remote  X.509 proxy (possibly with VOMS attributes)  Automatically forward refreshed proxy  E.g. Condor, GRAM, CREAM, ARC Authentication 20

HELD Status › Jobs will be held when HTCondor needs help with an error › On release, HTCondor will retry › The reason for the hold will be saved in the job ad and user log

Hold Reason › condor_q –held jfrey 2/13 13:58 CREAM_Delegate Error: Received NULL fault; › cat job.log 012 ( ) 02/13 13:58:38 Job was held. CREAM_Delegate Error: Received NULL fault; the error is due to another cause… › condor_q –format ‘%s\n’ HoldReason CREAM_Delegate Error: Received NULL fault; the error is due to another cause…

Common Errors › Authentication  Hold reason may be misleading  User may not be authorized by CE  Condor-G may not have access to all Certificate Authority files  User’s proxy may have expired

Common Errors › CE no longer knows about job  CE admin may forcibly remove job files  Condor-G is obsessive about not leaving orphaned jobs  May need to take extra steps to convince Condor-G that remote job is gone

Thank You! 25

Throttles and Timeouts › Limits that prevent Condor-G or CEs from being overwhelmed by large numbers of jobs › Defaults are fairly conservative

Throttles and Timeouts › GRIDMANAGER_MAX_SUBMITTED_JOBS_PER _RESOURCE = 1000  You can increase to 10,000 or more › GRIDMANAGER_JOB_PROBE_INTERVAL  Default is 60 seconds  Can decrease, but not recommended

Throttles and Timeouts › GRIDMANAGER_MAX_PENDING_REQUESTS = 50  Number of commands sent to a GAHP in parallel  Can increase to a couple hundred › GRIDMANAGER_GAHP_CALL_TIMEOUT = 300  Time after which a GAHP command is considered failed  May need to lengthen if pending requests is increased