Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
0Gold 11 0Gold 11 LapLink Gold 11 Firewall Service How Connections are Created A Detailed Overview for the IT Manager.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Workload Management System
Glidein Factory Operations
High Availability in HTCondor
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
CREAM-CE/HTCondor site
Monitoring HTCondor with Ganglia
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
WMS Options: DIRAC and GlideIN-WMS
The Condor JobRouter.
Condor: Firewall Mirroring
Credential Management in HTCondor
Presentation transcript:

Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi

Condor Week 2007Glidein Factories - by I. Sfiligoi2 Anybody heard of “The Grid”? ● “The Grid” is the current way forward in most sciences – Certainly in High Energy Physics (and in particular CMS) Grid Sites ● “The Grid” is the sum of “Grid Sites”, each offering a moderate amount of (mostly) computing resources – Each site has a standard “Gatekeeper”, responsible for regulating access to the site (How the “Gatekeeper” handles the computing resources, is anyone's guess) As in Open Science Grid and European Grid for E-Science

Condor Week 2007Glidein Factories - by I. Sfiligoi3 Dear public, “The Grid” And “The User” “The Grid” is not an easy place to live in!

Condor Week 2007Glidein Factories - by I. Sfiligoi4 Compare this to Condor ● A single system from the user point of view – User submits to a local scheduler – Condor does all the magic Legenda: Central manager Execute node Submit node and user(s)

Condor Week 2007Glidein Factories - by I. Sfiligoi5 So let Condor manage “The Grid”! Life is good again!

Condor Week 2007Glidein Factories - by I. Sfiligoi6 So let Condor manage “The Grid”! Life is good again! But how do we get here?

Condor Week 2007Glidein Factories - by I. Sfiligoi7 The answer: Condor glide-ins Legenda: Central manager Execute daemon Submit node and user(s) Gatekeeper Worker node

Condor Week 2007Glidein Factories - by I. Sfiligoi8 The answer: Condor glide-ins Legenda: Central manager Execute daemon Submit node and user(s) Gatekeeper Worker node

Condor Week 2007Glidein Factories - by I. Sfiligoi9 What exactly is “a glidein”? ● “A glidein” is just a regular condor_startd daemon, submitted as a Grid job ● The glidein-Grid job needs to: – validate the worker node (for example against memory and disk problems) – discover or fetch the condor binaries – configure the Condor daemons – start the Condor daemons ● For simple use-cases, you can use condor_glidein

Condor Week 2007Glidein Factories - by I. Sfiligoi10 The glidein factory Grid Sites ● Needs to know how to submit to the “Grid Sites” –... how to obtain the list of sites – For each site: ● how to talk to the “Gatekeeper” ● what is the configuration of the site (network,security, software, etc.) ● Needs to know when to submit new glideins – Slots are not free – Resources not used by my pool could be used by others ● Submit only if users need more resources (modulo speculative submissions) ● Submit only to sites who declare that can run at least a subset of user jobs

Condor Week 2007Glidein Factories - by I. Sfiligoi11 glideinWMS The glideinWMS ● A glidein-based Workload Management System (WMS) developed for USCMS – Derived from the CDF GlideCAF (Presented at CondorWeek2006) – But meant to be generic enough to support different communities ● Uses the dividi-et-impera approach Grid Sites – Glidein Factories know how to submit to the Grid Sites – VO * Frontends monitor jobs and direct the factories ● Condor Collector used for message passing * VO = Virtual Organization ~ Condor Pool

Condor Week 2007Glidein Factories - by I. Sfiligoi12 glideinWMS The glideinWMS Legenda: Central manager Execute daemon Submit node and user(s) Gatekeeper Worker node WMS Legenda: Collector Glidein factory VO frontend

Condor Week 2007Glidein Factories - by I. Sfiligoi13 glideinWMS The glideinWMS WMS Legenda: Collector Glidein factory VO frontend Legenda: Central manager Execute daemon Submit node and user(s) Gatekeeper Worker node

Condor Week 2007Glidein Factories - by I. Sfiligoi14 glideinWMS The glideinWMS WMS Legenda: Collector Glidein factory VO frontend Legenda: Central manager Execute daemon Submit node and user(s) Gatekeeper Worker node

Condor Week 2007Glidein Factories - by I. Sfiligoi15 glideinWMS glideinWMS internals Factory Name Attributes Count jobs that match factory attributes Keep requested idle glideins in the queue G Factory Name Requested idle glideins Legenda: G Condor-G scheduler Everything else like previous slide More details in the backup slides

Condor Week 2007Glidein Factories - by I. Sfiligoi16 glideinWMS glideinWMS internals ● Glidein startup script simply loads other scripts – HTTP used for network transfers (cacheable, works when no privacy issues) signature.sha1 file.lst condor_bin.tgz configs.cfg validate.sh myscript.sh start_condor.sh Worker Node Web Server Web Cache glidein_startup load file list execute scripts Erro rs? Startd Ye s No This batch slot would not be able to run a user job load files Startup script + arguments All files signed See backup slides for text description. Downloaded scripts do all the real work

Condor Week 2007Glidein Factories - by I. Sfiligoi17 Network security concerns ● Traffic on WAN insecure by definition ● Using x509 (GSI) service proxies for authentication ● Condor tools securing communication between ● VO Frontend and Glidein Factory ● Startd and Collector/Schedd ● Condor supports integrity checks to prevent data tampering and encryption for privacy ● HTTP-accessed data checked via SHA1 checksums (no privacy possible here)

Condor Week 2007Glidein Factories - by I. Sfiligoi18 Security on the Worker Nodes ● Glide-in Condor not running as a privileged user – Cannot change UID without help from the system – Condor daemons not protected from user jobs ● Open Science Grid (OSG) starting to deploy gLExec on its worker nodes – A x509-based Apache-suexec derivative – Condor can use the service proxy to run the user job under a different UID – Same security as if Condor running as root

Condor Week 2007Glidein Factories - by I. Sfiligoi19 Working over Firewalls ● Condor is based on the peer-to-peer principle – Needs two-way network traffic Grid Sites ● Most Grid Sites behind firewalls – Most have only outgoing connectivity – Some only proxied traffic ● Condor GCB can help at such sites – See GCB talks for more details ● VPNs could be another option, but are less trivial to use in user- space

Condor Week 2007Glidein Factories - by I. Sfiligoi20 Conclusion ● “The Grid” has a lot of resources (even for free) – Why not use them? ● Glideins allow you to use those resources without a single change in your jobs – You can even submit standard universe jobs! ● glideinWMS ● glideinWMS can help you automatize the maintenance of a glidein pool – Let me know if you are interested

Condor Week 2007Glidein Factories - by I. Sfiligoi21 Glidein Factories Backup slides

Condor Week 2007Glidein Factories - by I. Sfiligoi22 VO Frontend ClassAd Costumize the submitted glideins. GlideParamXXX must match the names published by the factory Due to Condor limitations, define also GlideinMyType MyType=”glideclient” ClientName=”client” ReqName=”reqX” ReqIdleGlideins=nr ReqMaxRun=nr ReqMaxSubmitXHour=nr GlideinParamWWW=”val1”... GlideinParamZZZ=”valY” GlideinMonitorNNN=”valN”... GlideinMonitorMMM=”valM” Published classad Target a specific Entry Point Request a steady stream of glideins starting Monitoring data like: Idle=”546”, Running=”222”

Condor Week 2007Glidein Factories - by I. Sfiligoi23 Glidein Factory ClassAd Due to Condor limitations, define also GlideinMyType Parameters set glidein parameter defaults like: CONDOR_HOST=”UNDEFINED”,SEC_DEFAULT_ENCRYPTION=OPTIONAL MinDisk=16G, CheckFilesExist=”/tmp/CMS,$DATA/OSG” MyType=”glidefactory” FactoryName=”factory” GlideinName=”entry” Attribute1=”...”... AttributeN=”...” GlideinParamXXX=”val1”... GlideinParamYYY=”valZ” GlideinMonitorNNN=”valN”... GlideinMonitorMMM=”valM” Published classad Attributes that describe the glidein like: ARCH=”INTEL”, MaxHours=72, Site=”Florida” Monitoring data like: TotalStatusIdle=”234”, TotalStatusRunning=”1356” TotalRequestedIdle=”50”

Condor Week 2007Glidein Factories - by I. Sfiligoi24 glideinWMS glideinWMS internals Factory Collector Factory Schedd-G Query WMS Collector Frontend Attributes Submit glideins Query Factory Schedd Count Idle Glideins Publish entry point WMS Collector ● Glidein Factory essentially a publish-read-submit loop Details about ClassAd content in the backup slides

Condor Week 2007Glidein Factories - by I. Sfiligoi25 glideinWMS glideinWMS internals Query Schedd(s) Query WMS Collector Match and count Jobs Attributes Factories Attributes Nr jobs x Factory Publish requests VO Collector VO Schedd ● VO Frontend acts as a matchmaker WMS Collector Details about ClassAd content in the backup slides

Condor Week 2007Glidein Factories - by I. Sfiligoi26 Glidein details ● Dummy startup script ● Just loads other files and execute the ones marked as executable ● File transfer implemented using HTTP ● Easy cacheable, standard tools available (Squid) ● Proven to scale, widely used in Industry ● All sensitive file transfers signed (SHA1) ● Prevent tampering, as HTTP travels in clear

Condor Week 2007Glidein Factories - by I. Sfiligoi27 Glidein details ● Standard sanity checks provided – Disk space constraints – Node blacklisting ● Generic Condor configure and startup script provided, too ● Factory admins can easily add their own customization scripts (both for checks and configs) – Allowing Frontends to add custom scripts envisioned, but not yet implemented

Condor Week 2007Glidein Factories - by I. Sfiligoi28 Condor One way firewall Reuse the permanent connection 1 Open a permanent connection

Condor Week 2007Glidein Factories - by I. Sfiligoi29 glideinWMS glideinWMS support ● glideinWMS ● glideinWMS developed by and for the CMS collaboration – No funding to support other users ● However: – Having other users would bring in new ideas ● Best-effort support will always be there for everybody – Collaboration with other groups welcome ● both for development and support