Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Slides:



Advertisements
Similar presentations
Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.
Advertisements

Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Grid job submission using HTCondor Andrew Lahiff.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Review of Condor,SGE,LSF,PBS
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
High Availability in HTCondor
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
Genre1: Condor Grid: CSECCR
The Condor JobRouter.
Condor: Firewall Mirroring
Grid Laboratory Of Wisconsin (GLOW)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Job Submission Via File Transfer
Condor-G: An Update.
Presentation transcript:

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side

Schedd On The Side What is it? Specialized scheduler operating on schedd’s jobs. Job 1 Job 2 Job 3 Job 4 Job 5 … Job 4* job queue

Condor Farm Story Schedd Startd Resources Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Application condor_submit job queue Now that this is working, how can I use my collaborator’s resources too?

Option #1: Merge Farms › Combine machines with collaborator into one Condor resource pool. o Everything works just like it did before. o Excellent option for small to medium clusters. o Requires bidirectional connectivity to all startds, or equivalent via GCB. o Requires some administrative coordination (e.g. upgrades, negotiator policy, security, etc.)

Option #2: Flocking Together Schedd Local Startds Remote Startds full featured (std universe etc) automatic matchmaking easy to configure requires bidirectional connectivity both sites must run condor Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed

Gatekeeper X Option #3: Grid Universe Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed easier to live with private networks may use non-Condor resources restricted Condor feature set (e.g. no std universe over grid) must pre-allocating jobs between vanilla and grid universe vanillasite X

Option #4: Routing Jobs Schedd Local Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Gatekeeper X Y Z vanillasite X Random Seed Random Seed site Ysite Z dynamic allocation of jobs between vanilla and grid universes. not every job is appropriate for transformation into a grid job.

What About Flow Control? › May restrict routing to jobs which have been rejected by negotiator. › May limit maximum actively routed jobs on a per site basis. › May limit maximum idle routed jobs per site. › Periodic remove of idle routed jobs is possible, but no guarantee of optimal rescheduling. › Routing table may be reconfigured dynamically. › Multicast? Might be interesting to try.

What About I/O? › Jobs must be sandboxable (i.e. specifying input/output via transfer- files mechanism). › Routing of standard universe is not supported. › Additional restrictions may apply, depending on site network and disk.

What Types of Grids? › Routing table may contain any combination of grid types supported by the grid universe. › Example: Condor-C Schedd On The Side Schedd X Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed site X for two Condor sites, schedd-to-schedd submission requires no additional software however, still not as trivial to use as flocking

Routing Behind the Scenes Gatekeeper X Schedd On The Side Schedd X3 X2 navigate internal firewalls provide custom routes for special users improve scalability However, keep in mind I/O requirements etc.

Future Step: Glidein Factory Gatekeeper X Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed true late binding of jobs to resources may run on top of non-Condor sites supports full feature set of Condor (e.g. standard universe) requires GCB on network boundary (initiated by schedd-on-the-side?) home site X Schedd On The Side glidein jobs

Glideing in the Works Schedd On The Side glidein factory site X schedd-to-schedd schedd-to-gatekeeper hierarchical strategy for scalability and reliability better match for private networks may require some additional horsepower from gatekeeper machine, perhaps a dedicated element for “edge services”. Random Seed Random Seed Random Seed Random Seed Random Seed

Thanks Interested? Let us know. We are currently using job routing for specific users at UW. Dan Bradley Future development will focus on more use-cases.