Workload Management System

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Parag Mhashilkar (Fermi National Accelerator Laboratory)
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG IPv6 deployment strategy
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Service Challenge 3 CERN
BDII Performance Tests
WLCG experiments FedCloud through VAC/VCycle in the EGI
Glidein Factory Operations
High Availability in HTCondor
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
CREAM-CE/HTCondor site
Grid Deployment Board meeting, 8 November 2006, CERN
Monitoring HTCondor with Ganglia
Building Grids with Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
WMS Options: DIRAC and GlideIN-WMS
The Condor JobRouter.
Condor-G Making Condor Grid Enabled
Credential Management in HTCondor
Presentation transcript:

Workload Management System CHEP 2007 glideinWMS - A generic pilot-based Workload Management System by Igor Sfiligoi (FNAL) CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Outline What is glideinWMS? How does it work? How does it perform? Monitoring Conclusions CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What is glideinWMS? A Condor glidein-based Workload Management System Developed by CMS for CMS, but generic enough to be used by other groups, too A generalization of the CDF glidekeeper Available at: http://home.fnal.gov/~sfiligoi/glideinWMS/ CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

Why do we need a WMS? “The Grid” is really a sum of hundreds of independent Grid sites. Choosing where to try to run the jobs is not a trivial task CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What is Condor? (1) A widely used batch system Based on a fully distributed architecture Collector Negotiator Have jobs, need workers Have workers, need jobs Starter Schedd Starter Schedd Starter CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What is Condor? (2) A widely used batch system Based on a fully distributed architecture Collector Negotiator Expect a job from s2 Starter Send jobs to w2 and w3 Schedd Expect a job from s1 Expect a job from s1 Job Starter Job Send job to w1 Job Schedd Starter CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What is a glidein? (1) Just a regular starter Submitted as a Grid job Collector Negotiator Have jobs, need workers Have worker, need job Schedd The Grid Grid batch slot Starter Schedd Grid batch slot Other Grid Job CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What is a glidein? (2) Just a regular starter Submitted as a Grid job Collector Negotiator Expect a job from s2 Schedd The Grid Grid batch slot Starter Send job to wg Schedd Job Grid batch slot Other Grid Job CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What is a glidein? (3) Just a regular starter Submitted as a Grid job Collector Negotiator Have jobs, need workers Have worker, need job Schedd The Grid Grid batch slot A virtual private Condor pool! Starter Schedd Grid batch slot Other Grid Job CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

What else can a glidein do? Make sanity checks before fetching any job Discover and publish batch slot characteristics: OS version CPU model, available RAM and disk Availability of certain software Importing VO specific software Prepare the environment for the user jobs Possibly putting the VO software in the path etc. CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Why using glideins? (1) For people already using Condor An easy way to extend the pool Or to create one from scratch Can hide all the grid stuff from user jobs Can even run standard universe jobs on the Grid! For people just wanting to use the Grid (even if not Condor fans) Protect user jobs from many obvious errors A dead glidein will not pull a user job Simplifies resource selection A glidein can detect what is available on the worker and user jobs get sent only to complying workers No guessing involved, job sent after resource acquired CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Why using glideins? (2) Get all the advantages of a local batch system Locally set priorities between different users Including group quotas Or even priorities between jobs of the same user Reliable, real time monitoring Reliable file transfer Full file encryption supported, too While still running on the Grid! Any weak points will be presented at the end CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

How do I submit a glidein? Condor provides condor_glidein Simple command line tool Useful when you have just a few jobs Will submit a single glidein per invocation Install a glideinWMS instance Needs more resources and some initial effort to set it up Setup once, glideins will be launched as needed Will look for jobs that need resources Submit glideins as needed to sites that seem to match at least an idle job CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS How does it work? CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS overview (1) A thin layer on top of Condor VO frontend does the matches A Grid Site A Grid Site Collector Negotiator Glidein Factory Schedd Get list of jobs I know of two sites VO Frontend Collector Schedd Get list of jobs Get list of sites CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS overview (2) A thin layer on top of Condor VO frontend does the matches A Grid Site A Grid Site Collector Negotiator Submit glideins Glidein Factory Schedd Get requests Need 3 glideins from site 1 VO Frontend Collector Schedd More details at http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=2048 CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS overview (3) A thin layer on top of Condor A Grid Site A Grid Site Starter Job Have worker, need job Collector Submit glideins Send job to wg Negotiator Glidein Factory Schedd Have jobs, need workers VO Frontend Collector Schedd CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS glideinWMS details(1) Matchmaking done on two levels VO frontend matches glideins to sites that claim to support at least one job waiting in the queue The condor negotiator matches glidein starters to the jobs waiting in the queue The condor negotiator has the final word If a site was lying about its capabilities, the starter will not be matched and will exit within minutes The job that is sent to a starter might not be the one for which the glidein was submitted for CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS glideinWMS details(2) The WMS logic is to keep constant pressure on the Grid sites As long as there are waiting jobs that could be run on a site, it tries to keep a steady number of idle glideins in the site queues The VO frontend drives the WMS Deciding how much pressure to put on different sites The glidein factories will submit the glideins, following the orders from the VO frontend CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS glideinWMS details(3) Communication between processes based on Condor ClassAds For each site, a Glidein Factory publishes: CE attributes list of parameters it accepts Each VO Frontend replies a ClassAd containing: The target site VO parameters (a subset of the above) Number of idle glideins to keep in the queue Using a standard Condor collector CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS glideinWMS details(4) Condor-G used for glidein submission The list of sites a factory serves is a configuration parameter Can be set manually, fine tuning each and every site characteristics Easy to script Just a standard XML file The installation script can use the CRONUS information system, and ReSS information system will be added soon Or can be paired with a Condor-G matchmaker, like ReSS and CRONUS CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS How does it perform? CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Do glideins scale? (1) Synthetic tests, using a single submit machine, scaled well with ~4000 running jobs Memory a major limiting factor Further scalability can be obtained by using multiple submission machines See also Talk #216 Ignore the colors 5k 8 G CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Do glideins scale? (2) glideinWMS-based CMS MC production up to 1k jobs in parallel CDF GlideCAF up to 2.5k 1k ATLAS Cronus up to 5k 5.2k 3k CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Does glideinWMS scale? Synthetic tests with single VM frontend and single glidein factory 6 submission points with 100k queued jobs 50 grid sites Further scalability by using multiple VO frontends and multiple glidein factories 100k CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

Do glideins really work over WAN? (1) Yes, but it needs GCB to work A Condor proxy server The only requirement is that there is outgoing connectivity Collector Negotiator Schedd Have worker, need job, contact me via GCB Firewall/NAT Have jobs, need workers The Grid Schedd Grid batch slot Starter Open long lived outgoing connection CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

Do glideins really work over WAN? (2) Yes, but it needs GCB to work A Condor proxy server The only requirement is that there is outgoing connectivity Collector Negotiator Schedd Firewall/NAT The Grid Send job to wgcb Expect job from s2 Schedd Grid batch slot Job Starter Use connection to relay traffic CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What about security? (1) glideinWMS glideins use GSI for authentication All daemon to daemon communication is fully authenticated and message integrity is checked Collector Negotiator Schedd The Grid Grid batch slot Starter Schedd Grid batch slot Starter CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS What about security? (2) However, starter does not run as root! Without help the user and the starter have to run under the same account! The malicious user job can use starter privileges In this scenario the glidein should only run jobs from the factory user OSG is starting to deploy gLExec on WN Allows starter to start user job under appropriate UID See gLExec talks #43 and #94 Grid batch slot Starter Credentials User Credentials Starter User job Grid batch slot Starter Credentials Starter gLExec Credentials User Job CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Any other drawbacks?(1) Condor uses a lot of resources Be prepared to budget 1.5Mb of RAM per running process on the submit node Possibly distribute job submission over multiple nodes GCB is still in active development phase Production version stable only to ~600 running jobs per GCB/schedd pair Need to deploy many of them to scale Development team is promising much higher scalability See also Talk #216 CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Any other drawbacks?(2) Glidein factory load scales with number of Grid sites Budget one node every few dozen sites for current version or pair it with Condor-G matchmakers, like ReSS Work in progress to reduce the load If anything goes wrong with the setup, the debugging can be challenging Glidein log files are returned only when the job finishes May not get them back, if it never ends This is a fundamental Grid limitation, not much that can be done about it CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Monitoring CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

Condor collector monitoring CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Status Web monitoring CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Status XML monitoring CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

CondorView Monitoring Standard Condor tool, not glideinWMS specific This one is actually from the CRONUS site CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS Conclusions Condor Glideins Can shield user jobs from the Grid Give you total control over your jobs Allow you to have more control over the jobs scheduling GlideinWMS An automatic way to create glidein pools on the fly Needs some initial effort, but then it operates on its own CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS

glideinWMS - A generic pilot-based WMS glideinWMS home page http://home.fnal.gov/~sfiligoi/glideinWMS/ sfiligoi@fnal.gov CHEP'07 - Sep 4th, 2007 glideinWMS - A generic pilot-based WMS