Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.

Similar presentations


Presentation on theme: "Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF."— Presentation transcript:

1 Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Condor-G

2 What Is It? Condor-G is a specialization of Condor. It is also known as the grid universe. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue.

3 Grid Fault-Tolerance Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily unavailable The network goes down

4 Remote Resource Access: Globus globusrun myjob … Globus GRAM Protocol Globus JobManager fork() Organization A Organization B

5 Globus Globus GRAM Protocol Globus JobManager fork() Organization A Organization B globusrun myjob …

6 Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B globusrun myjob …

7 Globus + Condor globusrun … Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B

8 Condor-G + Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B Condor-G myjob1 myjob2 myjob3 myjob4 myjob5 …

9 Condor-G Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager

10 Just to be fair… The gatekeeper doesnt have to submit to a Condor pool. It could be PBS, LSF, Sun Grid Engine… Condor-G will work fine whatever the remote batch system is.

11 Other Condor-G Features Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems Job Scheduling Use Matchmaking to select resources for jobs GlideIn Allows late binding of resources and job checkpoint/migration

12 Condor-G Condor-G Job Description (Job ClassAd) GT2 [.1|2|4] HTTPS CondorPBS/LSFNorduGrid GT4 WSRF Unicore

13 Pre-WS GRAM Submit file grid_resource = gt2 \ foo.edu/jobmanager-pbs globus_rsl = (queue=long)\ (condor_submit=(universe java))

14 OGSA GRAM Submit file grid_resource = gt3 ogsa/services/base/gram/\ PBSManagedJobFactoryService globus_rsl = (queue=long)\ (condor_submit=(universe java)) Museum mode

15 WS GRAM Submit file grid_resource = gt4 foo.edu PBS globus_xml = long

16 NorduGrid Submit file grid_resource = nordugrid foo.edu nordugrid_rsl = (queue=long)

17 Unicore Submit file grid_resource = unicore usite.org vsite keystore_file = keystore keystore_passphrase_file = keystore.pw keystore_alias = my cert

18 Condor Submit file grid_resource = condor schedd.foo.edu \ cm.foo.edu remote_universe = java

19 PBS Submit file grid_resource = pbs

20 LSF Submit file grid_resource = lsf

21 Grid Universe Fault-Tolerance: Credential Management Authentication in many grid protocols is done with limited-lifetime X509 proxies Proxy may expire before jobs finish executing Condor can put jobs on hold and user to refresh proxy Condor can automatically retrieve new proxies from MyProxy When the proxy is refreshed, Condor forwards it to the jobs

22 MyProxy Submit file MyProxyHost = foo.edu:12345 MyProxyServerDN = /DC=org/DC=doegrids… MyProxyCredentialName = proxy_file MyProxyRefreshThreshold = 240 #mins MyProxyNewProxyLifetime = 12 #hrs MyProxyPassword = password Or give password on command line condor_submit -p password submit.desc

23 Condor-G Matchmaking Use Condor-G matchmaking with grid universe jobs Allows Condor-G to dynamically assign computing jobs to grid sites An example of lazy planning

24 Condor-G Matchmaking, cont. Normally a grid universe job must specify the site in the submit description file via the grid_resource attribute like so: Executable = foo Universe = grid Grid_Resource = gt2 \ beak.cs.wisc.edu/jobmanager-pbs queue

25 Condor-G Matchmaking, cont. With matchmaking, grid universe jobs can use requirements and rank: Executable = foo Universe = grid Grid_Resource = $$(ResourceName) Requirements = arch == LINUX Rank = NumberOfNodes * random() Queue The $$(x) syntax inserts information from the target ClassAd when a match is made.

26 Condor-G Matchmaking, cont. Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via condor_advertise command-line utility (method used by D0 JIM, USCMS) Program to query Globus MDS and convert information into ClassAd (method used by EDG) Run HawkEye with appropriate plugins on the gatekeeper For explanation of Condor-G matchmaking setup for USCMS, see

27 Condor-G Matchmaking: Creating the Resource Ad Machine Ad MyType = Machine TargetType = Job Name = foo.edu Machine = foo.edu ResourceName = gt4 foo.edu PBS UpdateSequenceNumber = 4 Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10 CurMatches = 0 NumberOfNodes = 300 Rank = 0.0 CurrentRank = 0.0 WantAdRevaluate = True

28 Condor-G Matchmaking: Creating the Resource Ad Advertising a resource condor_advertise UPDATE_STARTD_AD \ ad-file Call periodically Use unix time for UpdateSequenceNumber

29 But Wait, Theres More… What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling of jobs For job checkpointing and migration For remote system calls What if you dont want to send a job to a site until the moment the job will start running (late binding)

30 One Solution: Condor-G GlideIn You can use the Grid Universe to run Condor daemons on grid resources When the resources run these GlideIn jobs, they will temporarily join your Condor Pool You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources

31 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Globus Grid PBS LSF Condor Condor Pool glide-in jobs

32 GlideIn Concerns What if a grid resource kills my GlideIn job? That resource will disappear from your pool and your jobs will be rescheduled on other machines Standard universe jobs will resume from their last checkpoint like usual What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource

33 Condor schedd (Job caretaker) condor_submit matchmaker Startd (Runs job)

34 Condor-G schedd (Job caretaker) condor_submit gridmanager gahpGlobus gatekeeper PBS or LSF

35 Condor-C schedd (Job caretaker) condor_submit gridmanager condor-gahpscheddmatchmaker startd

36 Condor-C to non-Condor schedd (Job caretaker) condor_submit gridmanager condor-gahpschedd gridmanager pbs/lsf-gahp PBS or LSF

37 Gliding in Condor-C schedd (Job caretaker) condor_submit gridmanager pbs/lsf-gahp PBS or LSF condor-gahpgahp Globus gatekeeper schedd 1. Glide-in 2. Submit jobs

38 Matchmaking with Condor-C In all of these examples, Condor-C went to a specific remote schedd This is not required: you can do matchmaking

39 Matchmaking with Condor-C schedd (Job caretaker) condor_submit gridmanager condor-gahpmatchmaker schedd … submit job

40


Download ppt "Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF."

Similar presentations


Ads by Google