Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid.

Similar presentations


Presentation on theme: "Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid."— Presentation transcript:

1 Alain Roy Computer Sciences Department University of Wisconsin-Madison roy@cs.wisc.edu http://www.cs.wisc.edu/condor 25-June-2002 Using Condor on the Grid

2 www.cs.wisc.edu/condor Добрый вечер! › Thank you for having me! › I am:  Alain Roy  Computer Science Ph.D. in Quality of Service, with Globus Project  Working with the Condor Project › This is the last of three Condor tutorials

3 www.cs.wisc.edu/condor Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility.  Run lots of jobs over a long period of time,  Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy

4 www.cs.wisc.edu/condor Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines…  Crash (or are disconnected)  Run out of disk space  Don’t have your software installed  Are frequently needed by others  Are far away & managed by someone else

5 www.cs.wisc.edu/condor What is Unique about Condor? › ClassAds › Transparent checkpoint/restart › Remote system calls › Works in heterogeneous clusters › Clusters can be:  Dedicated  Opportunistic

6 www.cs.wisc.edu/condor What’s Condor Good For? › Managing a large number of jobs › Robustness  Checkpointing  Persistent Job Queue › Ability to access more resources › Flexible policies to control usage on your pool

7 www.cs.wisc.edu/condor A Bit of Condor Philosophy › Condor brings more computing to everyone  A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done.  A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.

8 www.cs.wisc.edu/condor Condor’s Idea Computing power is everywhere, we try to make it usable by anyone.

9 www.cs.wisc.edu/condor Condor and the Grid › The Grid provides:  Uniform, dependable, consistent, pervasive, and inexpensive computing. Hopefully. › Condor wants to make computing power usable by everyone

10 www.cs.wisc.edu/condor This Must Be a Match Made in Heaven! +

11 www.cs.wisc.edu/condor Remember Frieda? Today we’ll revisit Frieda’s Condor/Grid explorations in more depth

12 www.cs.wisc.edu/condor First, A Review of Globus › Globus isn’t “The Grid”, but it provides a lot of commonly used technologies for building Grids. › Globus is a toolkit: pick the pieces you wish to use › Globus implements standard Grid protocols and APIs

13 www.cs.wisc.edu/condor Globus Toolkit Pieces › Security: Grid Security Infrastructure › Resource Management: GRAM  Submit and monitor jobs › Information services › Data Transfer: GridFTP

14 www.cs.wisc.edu/condor Grid Security Infrastructure › Authentication and authorization › Certificate authorities › Single sign-on › Usually public-key authentication › Can work with Kerberos

15 www.cs.wisc.edu/condor Resource Management › Single method for submitting jobs › Multiple backends for running jobs  Fork  Condor  PBS/LSF/…

16 www.cs.wisc.edu/condor Information Services › LDAP-based  Easy to access with standard clients › Implements standard schemas for representing resources

17 www.cs.wisc.edu/condor Data Transfer › GridFTP  Uses GSI authentication  High-performance through parallel and striped transfers  Quickly becoming widely used

18 www.cs.wisc.edu/condor Where does Condor Fit In? › Condor back-end for GRAM  Submit Globus jobs  They run in your Condor pool › Condor-G submit to Globus resources  Provides reliability and monitoring beyond standard Globus mechanisms › Can be used together! › We’ll describe both of these.

19 www.cs.wisc.edu/condor Condor back-end for GRAM › GRAM uses job-manager to control jobs  Globus comes with Condor job manager  Easy to configure with setup-globus-gram- jobmanager › Users can configure Condor behavior with RSL when submitting jobs:  jobtype: configures universe (vanilla/standard)  Constructs Condor submit file and submits to Condor pool

20 www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help?

21 www.cs.wisc.edu/condor Frieda… › Installed personal Condor › Made a larger Condor pool › Added dedicated nodes › Added Grid resources › We talked about the first three steps in detail earlier.

22 www.cs.wisc.edu/condor Frieda Goes to the Grid! › First Frieda takes advantage of her Condor friends! › She knows people with their own Condor pools, and gets permission to access their resources flock › She then configures her Condor pool to “flock” to these pools

23 www.cs.wisc.edu/condor your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool

24 www.cs.wisc.edu/condor How Flocking Works › Add a line to your condor_config : FLOCK_TO = Friendly-Pool FLOCK_FROM = Friedas-Pool Schedd Collector Negotiator Central Manager (CONDOR_HOST ) Collector Negotiator Friendly-Pool Central Manager Submit Machine

25 www.cs.wisc.edu/condor Condor Flocking › Remote pools are contacted in the order specified until jobs are satisfied › The list of remote pools is a property of the Schedd, not the Central Manager  Different users can Flock to different pools  Remote pools can allow specific users › User-priority system is “flocking-aware”  A pool’s local users can have priority over remote users “flocking” in.

26 www.cs.wisc.edu/condor Condor Flocking, cont. › Flocking is “Condor” specific technology… › Frieda also has access to Globus resources she wants to use  She has certificates and access to Globus gatekeepers at remote institutions › But Frieda wants Condor’s queue management features for her Globus jobs! › She installs Condor-G so she can submit “Globus Universe” jobs to Condor

27 Condor-G Installation: Tell it what you need…

28 … and watch it go!

29 www.cs.wisc.edu/condor Frieda Submits a Globus Universe Job › In her submit description file, she specifies:  Universe = Globus  Which Globus Gatekeeper to use  Optional: Location of file containing your Globus certificate universe = globus globusscheduler = beak.cs.wisc.edu/jobmanager executable = progname queue

30 www.cs.wisc.edu/condor How It Works Schedd LSF Personal CondorGlobus Resource

31 www.cs.wisc.edu/condor How It Works Schedd LSF Personal CondorGlobus Resource 600 Globus jobs

32 www.cs.wisc.edu/condor How It Works Schedd LSF Personal CondorGlobus Resource GridManager 600 Globus jobs

33 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Personal CondorGlobus Resource GridManager 600 Globus jobs

34 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF User Job Personal CondorGlobus Resource GridManager 600 Globus jobs

35 Condor Globus Universe

36 www.cs.wisc.edu/condor Globus Universe Concerns › What about Fault Tolerance?  Local Crashes What if the submit machine goes down?  Network Outages What if the connection to the remote Globus jobmanager is lost?  Remote Crashes What if the remote Globus jobmanager crashes? What if the remote machine goes down?

37 www.cs.wisc.edu/condor New Fault Tolerance › Ability to restart a JobManager › Enhanced two-phase commit submit protocol › Donated by Condor project to Globus 2.0

38 www.cs.wisc.edu/condor Globus Universe Fault-Tolerance: Submit-side Failures › All relevant state for each submitted job is stored persistently in the Condor job queue. › This persistent information allows the Condor GridManager upon restart to read the state information and reconnect to JobManagers that were running at the time of the crash. › If a JobManager fails to respond…

39 www.cs.wisc.edu/condor Globus Universe Fault-Tolerance: Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes – network was down No – machine crashed or job completed Yes - jobmanager crashedNo – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? Has job completed? No – is job still running? Yes – update queue Restart jobmanager

40 www.cs.wisc.edu/condor Globus Universe Fault-Tolerance: Credential Management › Authentication in Globus is done with limited-lifetime X509 proxies › Proxy may expire before jobs finish executing › Condor can put jobs on hold and email user to refresh proxy › Todo: Interface with MyProxy…

41 www.cs.wisc.edu/condor But Frieda Wants More… › She wants to run standard universe jobs on Globus-managed resources that aren’t running Condor  For matchmaking and dynamic scheduling of jobs  For job checkpointing and migration  For remote system calls

42 www.cs.wisc.edu/condor Solution: Condor GlideIn › Frieda can use the Globus Universe to run Condor daemons on Globus resources › When the resources run these GlideIn jobs, they will temporarily join her Condor Pool › She can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the Globus resources

43 www.cs.wisc.edu/condor How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs

44 www.cs.wisc.edu/condor How It Works Schedd LSF Collector Personal CondorGlobus Resource 600 Condor jobs GlideIn jobs

45 www.cs.wisc.edu/condor How It Works Schedd LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

46 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

47 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

48 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

49 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF User Job Startd Collector Personal CondorGlobus Resource GridManager 600 Condor jobs GlideIn jobs

50 www.cs.wisc.edu/condor

51 GlideIn Concerns › What if a Globus resource kills my GlideIn job?  That resource will disappear from your pool and your jobs will be rescheduled on other machines  Standard universe jobs will resume from their last checkpoint like usual › What if all my jobs are completed before a GlideIn job runs?  If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource

52 www.cs.wisc.edu/condor What Have We Done on the Grid Already? › NUG30 › USCMS Testbed

53 www.cs.wisc.edu/condor NUG30 › quadratic assignment problem › 30 facilities, 30 locations  minimize cost of transferring materials between them › posed in 1968 as challenge, long unsolved › but with a good pruning algorithm & high-throughput computing...

54 www.cs.wisc.edu/condor NUG30 Solved on the Grid with Condor + Globus Resource simultaneously utilized: › the Origin 2000 (through LSF ) at NCSA. › the Chiba City Linux cluster at Argonne › the SGI Origin 2000 at Argonne. › the main Condor pool at Wisconsin (600 processors) › the Condor pool at Georgia Tech (190 Linux boxes) › the Condor pool at UNM (40 processors) › the Condor pool at Columbia (16 processors) › the Condor pool at Northwestern (12 processors) › the Condor pool at NCSA (65 processors) › the Condor pool at INFN (200 processors)

55 www.cs.wisc.edu/condor NUG30—Number of Workers

56 www.cs.wisc.edu/condor NUG30 - Solved!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

57 www.cs.wisc.edu/condor USCMS Testbed › Production of CMS data › Testbed has five sites across the US › Condor, Condor-G, Globus, GDMP… › A fantastic test environment for the Grid—the buck stops here!  Errors between systems, logging  Inetd confuses  Globus GASS cache tester

58 www.cs.wisc.edu/condor Questions? Comments? › Web: www.cs.wisc.edu/condor › Email: condor-admin@cs.wisc.edu


Download ppt "Alain Roy Computer Sciences Department University of Wisconsin-Madison 25-June-2002 Using Condor on the Grid."

Similar presentations


Ads by Google