Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Slides:



Advertisements
Similar presentations
Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.
Advertisements

Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.
Community Grids Lab1 CICC Project Meeting VOTable Developed VotableToSpreadsheet Service which accepts VOTable file location as an input, converts to Excel.
Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid job submission using HTCondor Andrew Lahiff.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Review of Condor,SGE,LSF,PBS
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Workload Management Workpackage
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
The Condor JobRouter.
Condor: Firewall Mirroring
Grid Laboratory Of Wisconsin (GLOW)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Job Submission Via File Transfer
Condor-G: An Update.
Presentation transcript:

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing Jobs to the Grid

Schedd Job Router a.k.a. Schedd On The Side Whats a Job Router? Specialized scheduler operating on schedds jobs. Job 1 Job 2 Job 3 Job 4 Job 5 … Job 4* job queue

Adapted Quill Technology Using Quill library to mirror job queue in memory o Efficient - just tails the log o Independent - mirror without clogging schedd command queue Modifying the job queue is another matter - must interact with schedd

Usage Case Routing: Vanilla -> Grid

Condor Farm Story Schedd Startd Resources Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Application condor_submit job queue Now that this is working, how can I use my collaborators resources too?

Option #1: Merge Farms Combine machines with collaborator into one Condor resource pool. o Everything works just like it did before. o Excellent option for small to medium clusters. o Requires bidirectional connectivity to all startds, or equivalent via GCB. o Requires some administrative coordination (e.g. upgrades, negotiator policy, security, etc.)

Option #1b: submit to multiple pools condor_submit -remote … Works Ok for small scale Have to manually partition jobs

Option #2: Flocking Together Schedd Local Startds Remote Startds full featured (std universe etc) automatic matchmaking easy to configure requires bidirectional connectivity both sites must run condor Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed

Gatekeeper X Option #3: Grid Universe Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed easier to live with private networks may use non-Condor resources restricted Condor feature set (e.g. no std universe over grid) must pre-allocating jobs between vanilla and grid universe vanillasite X

Option #4: Routing Jobs Schedd Local Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Gatekeeper X Y Z vanillasite X Random Seed Random Seed site Ysite Z dynamic allocation of jobs between vanilla and grid universes. not every job is appropriate for transformation into a grid job.

Example Routing Table [ GridResource = gt2 gatekeeper.site1/jobmanager-pbs; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = (…) ] [ GridResource = condor schedd.site2 collector.site2; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500 ] …

What About I/O? Jobs must be sandboxable (i.e. specifying input/output via transfer- files mechanism). Routing of standard universe is not supported. Must have enough storage space at site for input/output files!

What Types of Grids? Routing table may contain any combination of grid types supported by Condors grid universe. Example: Condor-C Schedd On The Side Schedd X Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed site X for two Condor sites, schedd-to-schedd submission requires no additional software however, still not as trivial to use as flocking

Source Routing Routing the old-fashioned way: universe = Grid GridResource = condor site1 … remote_universe = Grid remote_GridResource = condor site2 … remote_remote_universe = Grid remote_remote_GridResource = pbs

Routing At the Site Gatekeeper X Schedd On The Side Schedd X3 X2 navigate internal firewalls provide custom routes for special users improve scalability However, keep in mind I/O requirements etc.

Multicast in Future? Currently: route one job to one site Multicast: route one job to many sites Thin out all but first to germinate … or all but first to yield fruit.

Future Glidein Factory Gatekeeper X Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed true late binding of jobs to resources may run on top of non-Condor sites supports full feature-set of Condor (e.g. standard universe) requires GCB for private networks home site X Schedd On The Side glidein jobs

Glideing in the Factory Schedd On The Side glidein factory site X schedd-to-schedd schedd-to-gatekeeper hierarchical strategy for scalability and reliability better match for private networks may require some additional horsepower from gatekeeper machine, perhaps a dedicated element for edge services. Random Seed Random Seed Random Seed Random Seed Random Seed

Pluggable Router Beyond simple ClassAd transforms Pluggins would fire when job matches entry in routing table Dont yet understand semantics There is work to do!

Thanks Interested? Let us know. We are currently using job routing for specific users at UW. Jaime Frey Future development will focus on more use-cases.