Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Similar presentations


Presentation on theme: "Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing."— Presentation transcript:

1 Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing Jobs to the Grid

2 Schedd Job Router a.k.a. Schedd On The Side Whats a Job Router? Specialized scheduler operating on schedds jobs. Job 1 Job 2 Job 3 Job 4 Job 5 … Job 4* job queue

3 Adapted Quill Technology Using Quill library to mirror job queue in memory o Efficient - just tails the log o Independent - mirror without clogging schedd command queue Modifying the job queue is another matter - must interact with schedd

4 Usage Case Routing: Vanilla -> Grid

5 Condor Farm Story Schedd Startd Resources Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Application condor_submit job queue Now that this is working, how can I use my collaborators resources too?

6 Option #1: Merge Farms Combine machines with collaborator into one Condor resource pool. o Everything works just like it did before. o Excellent option for small to medium clusters. o Requires bidirectional connectivity to all startds, or equivalent via GCB. o Requires some administrative coordination (e.g. upgrades, negotiator policy, security, etc.)

7 Option #1b: submit to multiple pools condor_submit -remote … Works Ok for small scale Have to manually partition jobs

8 Option #2: Flocking Together Schedd Local Startds Remote Startds full featured (std universe etc) automatic matchmaking easy to configure requires bidirectional connectivity both sites must run condor Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed

9 Gatekeeper X Option #3: Grid Universe Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed easier to live with private networks may use non-Condor resources restricted Condor feature set (e.g. no std universe over grid) must pre-allocating jobs between vanilla and grid universe vanillasite X

10 Option #4: Routing Jobs Schedd Local Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Gatekeeper X Y Z vanillasite X Random Seed Random Seed site Ysite Z dynamic allocation of jobs between vanilla and grid universes. not every job is appropriate for transformation into a grid job.

11 Example Routing Table [ GridResource = gt2 gatekeeper.site1/jobmanager-pbs; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = (…) ] [ GridResource = condor schedd.site2 collector.site2; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500 ] …

12 What About I/O? Jobs must be sandboxable (i.e. specifying input/output via transfer- files mechanism). Routing of standard universe is not supported. Must have enough storage space at site for input/output files!

13 What Types of Grids? Routing table may contain any combination of grid types supported by Condors grid universe. Example: Condor-C Schedd On The Side Schedd X Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed site X for two Condor sites, schedd-to-schedd submission requires no additional software however, still not as trivial to use as flocking

14 Source Routing Routing the old-fashioned way: universe = Grid GridResource = condor site1 … remote_universe = Grid remote_GridResource = condor site2 … remote_remote_universe = Grid remote_remote_GridResource = pbs

15 Routing At the Site Gatekeeper X Schedd On The Side Schedd X3 X2 navigate internal firewalls provide custom routes for special users improve scalability However, keep in mind I/O requirements etc.

16 Multicast in Future? Currently: route one job to one site Multicast: route one job to many sites Thin out all but first to germinate … or all but first to yield fruit.

17 Future Glidein Factory Gatekeeper X Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed true late binding of jobs to resources may run on top of non-Condor sites supports full feature-set of Condor (e.g. standard universe) requires GCB for private networks home site X Schedd On The Side glidein jobs

18 Glideing in the Factory Schedd On The Side glidein factory site X schedd-to-schedd schedd-to-gatekeeper hierarchical strategy for scalability and reliability better match for private networks may require some additional horsepower from gatekeeper machine, perhaps a dedicated element for edge services. Random Seed Random Seed Random Seed Random Seed Random Seed

19 Pluggable Router Beyond simple ClassAd transforms Pluggins would fire when job matches entry in routing table Dont yet understand semantics There is work to do!

20 Thanks Interested? Let us know. We are currently using job routing for specific users at UW. Jaime Frey Future development will focus on more use-cases.


Download ppt "Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing."

Similar presentations


Ads by Google