Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Eager, Lazy, and Just-in-Time Planning Edinburgh Workshop Oct 2003

http://www.cs.wisc.edu/condor 2 Planning –vs- Scheduling Can you control the resources? Yes? Scheduling. No? Planning. Planning is a client operation.

http://www.cs.wisc.edu/condor 3 The question of When Lots of planning open questions. An important consideration: When the planning occurs. Time Eager Just-in-TimeLazy

http://www.cs.wisc.edu/condor 4 Eager Example First Pass of EDG Resource Broker RB DAGMan Condor-G Globus Fabric Site Scheduler

http://www.cs.wisc.edu/condor 5 Eager Condor-G Submit File universe = globus globussite = beak.cs.wisc.edu/jobmanager-lsf executable = find_particle arguments = …. output = …. log = …

http://www.cs.wisc.edu/condor 6 EDG Resource Broker Gets Lazy… Addition of a DAGMan callouts DAGMan is given a command (script) to run immediately before submission of job to Condor-G (different than a PRE script on a node) The helper command is passed a copy of the job submit file when DAGMan is about to submit that node in the graph This allows changes to be made to the submit file (i.e. changing globussite attribute) at the last minute

http://www.cs.wisc.edu/condor 7 Eager Example First Pass of EDG Resource Broker RB DAGMan Condor-G Globus Fabric Site Scheduler callout

http://www.cs.wisc.edu/condor 8 Moving Condor-G to Just-In-Time Delay the binding of the task (job) to the resource until the resource is ready. Need to know when the resource is ready. One way: unimplemented globus 1.1 queue wait time estimate Not really just-in-time, because of lies, lies lies… Another way… Condor-G Glidein Mechanism.

http://www.cs.wisc.edu/condor 9 How It Works Schedd LSF Collector Condor-GGlobus Resource 600 Condor jobs

http://www.cs.wisc.edu/condor 10 How It Works Schedd LSF Collector Condor-GGlobus Resource 600 Condor jobs GlideIn jobs

http://www.cs.wisc.edu/condor 11 How It Works Schedd LSF Collector Condor-GGlobus Resource GridManager 600 Condor jobs GlideIn jobs

http://www.cs.wisc.edu/condor 12 How It Works Schedd JobManager LSF Collector Condor-GGlobus Resource GridManager 600 Condor jobs GlideIn jobs

http://www.cs.wisc.edu/condor 13 How It Works Schedd JobManager LSF Startd Collector Condor-GGlobus Resource GridManager 600 Condor jobs GlideIn jobs

http://www.cs.wisc.edu/condor 14 How It Works Schedd JobManager LSF Startd Collector Condor-GGlobus Resource GridManager 600 Condor jobs GlideIn jobs

http://www.cs.wisc.edu/condor 15 How It Works Schedd JobManager LSF User Job Startd Collector Condor-GGlobus Resource GridManager 600 Condor jobs GlideIn jobs

http://www.cs.wisc.edu/condor 16 A Just-in-time Submit executable = find_particle requirements = TARGET.Arch == Intel/Linux || TARGET.Arch == Sparc/Solaris # job describes the power rank = MFlops * 10000 + Memory

http://www.cs.wisc.edu/condor 17 Another Just-in-time Submit executable = find_particle requirements = TARGET.Arch == Intel/Linux || TARGET.Arch == Sparc/Solaris rank = sam_data_overlap(MY.dataset,TARGET.sa m_site_name) + (TARGET.Mflops / 100000) +dataset = search_space_id_0133313

http://www.cs.wisc.edu/condor 18 Lots of Tradeoffs… Just-in-Time Pro: Dynamic. Resources can come and go. Can take advantage of changing circumstances. Con: Coordination of multiple resources Eager Pro: Easier to coordinate multiple resources Con: Hard to scale… how to know about all the resources in advance? Con: Plan falls apart if assumptions change.

http://www.cs.wisc.edu/condor 19 Some observations A complete separation of task from resource is difficult. Lots and lots of structured data required. But this separation is required to in order to achieve Just-In-Time planning. Grid Protocols that do not separate task from resource cannot realistically live on the grid. Virtualization can help.

http://www.cs.wisc.edu/condor 20 Plan for failure Much effort on how to create a plan. How about a plan for when things fail?

http://www.cs.wisc.edu/condor 21 Job Failure Policy Expressions Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file. Can be used to describe a successful run, or what to do in the face of failure. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =

http://www.cs.wisc.edu/condor 22 Job Failure Policy Examples Do not remove from queue (i.e. reschedule) if exits with a signal: on_exit_remove = ExitBySignal == False Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)

http://www.cs.wisc.edu/condor 23 Thank you! http:// www.cs.wisc.edu/condor tannenba@cs.wisc.edu www.cs.wisc.edu/condor

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Similar presentations

Presentation on theme: "Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Similar presentations

Presentation on theme: "Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time."— Presentation transcript:

Similar presentations

About project

Feedback