Presentation on theme: "Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF."— Presentation transcript:
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Condor-G
What Is It? Condor-G is a specialization of Condor. It is also known as the grid universe. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue.
Grid Fault-Tolerance Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily unavailable The network goes down
Remote Resource Access: Globus globusrun myjob … Globus GRAM Protocol Globus JobManager fork() Organization A Organization B
Globus Globus GRAM Protocol Globus JobManager fork() Organization A Organization B globusrun myjob …
Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B globusrun myjob …
Globus + Condor globusrun … Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B
Condor-G + Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B Condor-G myjob1 myjob2 myjob3 myjob4 myjob5 …
Condor-G Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager
Just to be fair… The gatekeeper doesnt have to submit to a Condor pool. It could be PBS, LSF, Sun Grid Engine… Condor-G will work fine whatever the remote batch system is.
Other Condor-G Features Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems Job Scheduling Use Matchmaking to select resources for jobs GlideIn Allows late binding of resources and job checkpoint/migration
Grid Universe Fault-Tolerance: Credential Management Authentication in many grid protocols is done with limited-lifetime X509 proxies Proxy may expire before jobs finish executing Condor can put jobs on hold and user to refresh proxy Condor can automatically retrieve new proxies from MyProxy When the proxy is refreshed, Condor forwards it to the jobs
MyProxy Submit file MyProxyHost = foo.edu:12345 MyProxyServerDN = /DC=org/DC=doegrids… MyProxyCredentialName = proxy_file MyProxyRefreshThreshold = 240 #mins MyProxyNewProxyLifetime = 12 #hrs MyProxyPassword = password Or give password on command line condor_submit -p password submit.desc
Condor-G Matchmaking Use Condor-G matchmaking with grid universe jobs Allows Condor-G to dynamically assign computing jobs to grid sites An example of lazy planning
Condor-G Matchmaking, cont. Normally a grid universe job must specify the site in the submit description file via the grid_resource attribute like so: Executable = foo Universe = grid Grid_Resource = gt2 \ beak.cs.wisc.edu/jobmanager-pbs queue
Condor-G Matchmaking, cont. With matchmaking, grid universe jobs can use requirements and rank: Executable = foo Universe = grid Grid_Resource = $$(ResourceName) Requirements = arch == LINUX Rank = NumberOfNodes * random() Queue The $$(x) syntax inserts information from the target ClassAd when a match is made.
Condor-G Matchmaking, cont. Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via condor_advertise command-line utility (method used by D0 JIM, USCMS) Program to query Globus MDS and convert information into ClassAd (method used by EDG) Run HawkEye with appropriate plugins on the gatekeeper For explanation of Condor-G matchmaking setup for USCMS, see
Condor-G Matchmaking: Creating the Resource Ad Advertising a resource condor_advertise UPDATE_STARTD_AD \ ad-file Call periodically Use unix time for UpdateSequenceNumber
But Wait, Theres More… What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling of jobs For job checkpointing and migration For remote system calls What if you dont want to send a job to a site until the moment the job will start running (late binding)
One Solution: Condor-G GlideIn You can use the Grid Universe to run Condor daemons on grid resources When the resources run these GlideIn jobs, they will temporarily join your Condor Pool You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources
your workstation Friendly Condor Pool personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Condor Pool glide-in jobs
GlideIn Concerns What if a grid resource kills my GlideIn job? That resource will disappear from your pool and your jobs will be rescheduled on other machines Standard universe jobs will resume from their last checkpoint like usual What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource