Pilot Factory using Schedd Glidein Barnett Chiu BNL 10.04.07.

Slides:



Advertisements
Similar presentations
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Grid job submission using HTCondor Andrew Lahiff.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
The Gateway Computational Web Portal Marlon Pierce Indiana University March 15, 2002.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
Condor Glidein: Condor Daemons On-The-Fly
Basic Grid Projects – Condor (Part I)
GRID Workload Management System for CMS fall production
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Problem to solve(1) Pilot  Probe the resource (http, environment, interpreter, other executables …etc)  Pull jobs from remote server (e.g. Panda server)  Matchmaking Group jobs in different categories E.g Production jobs, Analysis jobs (CHARMM …), Test jobs … Other criteria: Number of CPUs, RAM … etc

Problem to Solve (2) Current approach of pilot submissions  Local pool : Vanilla  Remote pool: Condor-G Large amounts of user jobs (production + analysis) ~ large amount of Condor-G pilot jobs ~ computational overhead on gatekeepers (e.g. large memory consumptions)

Solution Is there any way to bypass GRAM to submit jobs to remote machines? Local submissions, but how?  We need something that continuously submit local pilot jobs on the gatekeeper  Solution: Pilot Factory

Pilot Factory Overview Pilot Factory is an application that combines the following ideas:  schedd glidein  pilot submission program (or pilot generator) What is glidein?  Mini-Condor pool on a remote machine A complete Condor pool has at least 5 components: i.e. master, startd, schedd, collector, negotiator Glidein: {master, startd}, {master, schedd}, … etc  Properly configured condor daemons submitted as batch job

Glidein (1) Two major steps Condor-G #1: installation glidein setup script condor configuration file glidein startup script download Condor binaries (http, gsiftp …etc) Condor-G #2: execution exec glidein startup script  condor_master

Glidein (2) Central Manager Collector Submit Host Master schedd master schedd master startd Tarball server master startd master schedd Execute hosts … master startd master startd Glidein types ~/Condor_glidein Startup script Glidein config {master, schedd …} ?

Schedd Glidein Logics based on startd glidein (two Condor-G to set up ) Usage: By running glidein schedd on gatekeeper, the schedd then serves as a gateway between submit host and grid sites Mini Condor pool with schedd functionalities:  Submit host  Maintain persistent queue of jobs  Communicate with native batch system and forward user jobs Condor, PBS, LSF, …etc  Manipulate job queues through the followoing commands: condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio  Security Features (GSI) Who is authorized to set up Pilot Factory?

Schedd Glidein Example (1) Command: // schedd glidein #1 condor_glidein -count 1 -arch i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk01.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup Command: // schedd glidein #2 condor_glidein -count 1 -arch i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk02.racf.bnl.gov/jobmanager-fork -type schedd –forcesetup Command : // schedd glidein # 3, #4, #5 condor_glidein -count 3 -arch i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork nostos.cs.wisc.edu/jobmanager-fork -type schedd –forcesetup Use fork since we want schedd to be on gatekeeper!

Schedd Glidein Example (2) Command: condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs gridgk01.r gridgk02.r gridui01.u ribera.cs ron.cs.wis vail.cs.wi TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 0 0 0

Pilot Submission Program (Generator) Communicate with a DB server that maintains information about pilot jobs  E.g. pilot_type, pilot_queue Pulls desired pilot script from an external server Periodically submit pilot jobs (with pilot script as executable)  condor_submit  qsub? No, not necessary, since …

Build Pilot Factory with Glidein Schedd glidein installed and executed on the gatekeeper User submit a Condor-C job with pilot generator as the executable  Generator runs on the gatekeeper as a local universe job supervised by the glidein schedd Generator submits pilots  Types, frequency adjustable by users  Depending on the native batch system, pilots can be submitted as grid universe jobs  Along with GAHP and related binaries, schedd has the ability to communicate different batch systems master schedd JobManager LSF PBS schedd Grid Resource ~ Pilot generator

Pilot Factory Glidein requestSubmit Pilots Pilot Factory Gatekeeper with {Globus, Condor|PBS|…} Cluster Worker Nodes Submit Node (Collector, Master, Negotiator, Schedd) Connected to Collector master schedd ~

Future Work Integrating pilot with Condor startd to implement startd-based pilot  the startd-based pilot retrieves the payload of a user job in the same way as does the generic pilot but in addition, it also inherits functionalities of Condor startd.  Original intention was to run PFs with the startd-pilots on worker nodes (too greedy, unacceptable?)  Using Condor started makes it easier to integrate with gLexec Transform Generic PF (GPF) to Startd PF (SPF)

Reference [1] Schedd GlideinSchedd Glidein [2] Pilot FactoryPilot Factory [3] glideinWMS: An advanced applicationglideinWMS on glideins