Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Similar presentations


Presentation on theme: "Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G."— Presentation transcript:

1 Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu http://www.cs.wisc.edu/condor What’s New in Condor-G

2 www.cs.wisc.edu/condor Outline › What is Condor-G › Released New Features › In Development

3 www.cs.wisc.edu/condor What Is Condor-G › Use Condor to run jobs on the Grid › Uses Globus Toolkit  GRAM (submit a remote job)  GASS (transfer job’s files) › Two components  Globus Universe  GlideIn

4 www.cs.wisc.edu/condor Globus Universe › Run a job on a Grid resource › Features  Job management  Fault tolerance  Credential management › Roughly equivalent to the vanilla universe

5 www.cs.wisc.edu/condor How It Works Schedd LSF Condor-GGrid Resource

6 www.cs.wisc.edu/condor How It Works Schedd LSF Condor-GGrid Resource 600 Globus jobs

7 www.cs.wisc.edu/condor How It Works Schedd LSF Condor-GGrid Resource GridManager 600 Globus jobs

8 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF Condor-GGrid Resource GridManager 600 Globus jobs

9 www.cs.wisc.edu/condor How It Works Schedd JobManager LSF User Job Condor-GGrid Resource GridManager 600 Globus jobs

10 www.cs.wisc.edu/condor GlideIn › Run the Condor daemons on Grid resources as user jobs › Create your own personal Condor pool from temporarily-acquired Grid resources › Brings the full power of Condor to the Grid

11 www.cs.wisc.edu/condor Globus Grid PBS LSF Condor Condor-G

12 www.cs.wisc.edu/condor Globus Grid PBS LSF Condor 600 Condor jobs Condor-G

13 www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor 600 Condor jobs

14 www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs

15 www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs

16 www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs

17 www.cs.wisc.edu/condor Condor-G Globus Grid PBS LSF Condor glide-ins 600 Condor jobs

18 www.cs.wisc.edu/condor Released New Features › Stuff we’ve added in the past year › Released and ready for use in Condor 6.6

19 www.cs.wisc.edu/condor Globus ASCII Helper Protocol (GAHP) › Encapsulates Globus libraries in separate process › Simple ASCII protocol › Easy for legacy applications to use Globus when they can’t link directly with the libraries

20 www.cs.wisc.edu/condor How It Works - GAHP Schedd JobManager Condor-GGrid Resources GridManager JobManager GAHP Client GAHP Server

21 www.cs.wisc.edu/condor File Staging › Arbitrary input and output files can be staged to and from execution site › Same syntax as other universes › Limitation  Output files must be explicitly named

22 www.cs.wisc.edu/condor File Staging (cont) › Input, Output, and Error can be URLs  Files will be transferred directly to and from execution site › Output and Error can be staged or streamed

23 www.cs.wisc.edu/condor Credential Refresh › Renewed credentials are used by Condor-G and forwarded to the execution site automatically › No processes need to be restarted

24 www.cs.wisc.edu/condor Better Credential Management › One GridManager process can handle multiple credential files with same subject › More efficient when you want to have different credential lifetimes for different jobs

25 www.cs.wisc.edu/condor Grid Match-Making › Globus jobs matched with Globus resources by the Condor match- maker using ClassAds › Current limitation  User/admin must create resources ads

26 www.cs.wisc.edu/condor Fault Tolerance › Condor-G does its best to automatically recover from failures › User can guide decisions with job policy expressions  Periodic Release  GlobusResubmit  Rematch

27 www.cs.wisc.edu/condor PeriodicRelease Expression › Condor-G puts problematic jobs on hold › This expression tells Condor-G when to release and retry such jobs

28 www.cs.wisc.edu/condor GlobusResubmit Expression › Tells Condor-G when a problematic job submission should be abandoned › When this expression becomes true  Best effort is made to clean up current job submission  New job submission is attempted

29 www.cs.wisc.edu/condor Rematch Expression › Tells Condor-G when a problematic resource should be abandoned › Evaluated when GlobusResubmit evaluates to true › When this expression becomes true  Best effort is made to clean up current job submission  Job is rematched

30 www.cs.wisc.edu/condor Job Ad Example GlobusContactString = TARGET.gatekeeper_url Requirements = TARGET.Arch == “LINUX” && TARGET.OpSys == “LINUX” Rank = TARGET.Mflops PeriodicRelease = ((NumMatches 600)) GlobusResubmit = NumSystemHolds >= NumMatches Rematch = True

31 www.cs.wisc.edu/condor Hardening › Regular testing on the CMS testbed with real applications › Many bugs and integration issues found and fixed  Hostile Environment

32 www.cs.wisc.edu/condor Hostile Environment › Full disks › Machine crashes › File server lock-ups › Network outages › Power outages

33 www.cs.wisc.edu/condor One CMS Dataset Run › 300 jobs › Last fall  ~50 (16%) of the jobs stalled and required human recovery  Multiple service restarts (20 daemon crashes over 6 hours) › Now  0 jobs stalled  0 service restarts

34 www.cs.wisc.edu/condor Integration Work › Dozens of Condor-G improvements and bug fixes › Over 40 Globus “bugzilla” incidents, many with patches  Globus 2.2.4 has 21 “Advisories” as of 4/11/04 › Use latest version of both

35 www.cs.wisc.edu/condor Scalability › Submitting several hundred jobs produced high load on server  Machine became unresponsive  We saw a load average of 1000 at one point › Caused Globus JobManager processes

36 www.cs.wisc.edu/condor Grid Manager Monitor Agent › New tool Condor-G can use to reduce this load › Efficient job status polling program › Allows Condor-G to shut down JobManager processes when they’re not needed

37 www.cs.wisc.edu/condor Load Reduced › 400 jobs (/bin/sleep 900) › Without Grid Monitor  42 hours to complete  Peak load average of 610 › With Grid Monitor  40 minutes  Peak load average of 104

38 www.cs.wisc.edu/condor Miscellaneous Stuff › Email notification on job completion › Port range restrictions › Problem jobs put on hold

39 www.cs.wisc.edu/condor In Development › Stuff we’re currently working on › Will be released sometime in the next year

40 www.cs.wisc.edu/condor Job Policy Expressions › PeriodicHold › PeriodicRemove › OnExitHold › OnExitRemove

41 www.cs.wisc.edu/condor Improved GlideIn › MDS use optional  User specifies necessary information › Automatic setup  GlideIn job transfers and installs binaries if needed  Binaries can come from submit machine

42 www.cs.wisc.edu/condor New Job Types › Submit jobs directly to other schedulers (not through Globus) › Why?  Richer interface semantics  Not supported by Globus

43 www.cs.wisc.edu/condor NorduGrid › Grid batch system designed by Nordic countries › Globus GRAM didn’t offer necessary semantics  Client control of file staging  Automatic cleanup of abandoned jobs

44 www.cs.wisc.edu/condor Oracle › Oracle DBMS supports a job queue  Run this query in 5 hours  Run this query every Monday › Condor can add more management features

45 www.cs.wisc.edu/condor Generic Job Interface › Re-arrange GridManager to allow easy addition of new job types › Define appropriate interface › Plug-ins for new job types?

46 www.cs.wisc.edu/condor Globus Toolkit 3.0 › OGSA (Open Grid Services Architecture) › Submit jobs to GT3 sites › Grid Service client interface to Condor-G

47 www.cs.wisc.edu/condor Miscellaneous › Condor-G for Windows › MyProxy credential management › URLs for executable, staged files

48 www.cs.wisc.edu/condor Thank You! › Questions? › Also…  Condor-G & Globus Q/A session Wednesday, 9am-12pm, room TBA  E-mail condor-admin@cs.wisc.edu


Download ppt "Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G."

Similar presentations


Ads by Google