Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Physics with SAM-Grid Stefan Stonjek University of Oxford 6 th GridPP Meeting 30 th January 2003 Coseners House.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

SEE-GRID-SCI Hands-On Session: Workload Management System (WMS) Installation and Configuration Dusan Vudragovic Institute of Physics.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

LcgCAF:CDF submission portal to LCG Federica Fanzago for CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel.

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.

Grid Computing I CONDOR.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.

Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.

Grid job submission using HTCondor Andrew Lahiff.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

DCAF(DeCentralized Analysis Farm) for CDF experiments HAN DaeHee*, KWON Kihwan, OH Youngdo, CHO Kihyeon, KONG Dae Jung, KIM Minsuk, KIM Jieun, MIAN shabeer,

LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

STAR Scheduling status Gabriele Carcassi 9 September 2002.

STAR Scheduler Gabriele Carcassi STAR Collaboration.

Integration of Physics Computing on GRID S. Hou, T.L. Hsieh, P.K. Teng Academia Sinica 04 March,

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

CDF Monte Carlo Production on LCG GRID via LcgCAF Authors: Gabriele Compostella Donatella Lucchesi Simone Pagan Griso Igor SFiligoi 3 rd IEEE International.

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Condor DAGMan: Managing Job Dependencies with Condor

LcgCAF:CDF submission portal to LCG

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Scheduling Policy John (TJ) Knoeller Condor Week 2017.

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Workload Management System

CREAM-CE/HTCondor site

Building Grids with Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Initial job submission and monitoring efforts with JClarens

Condor-G Making Condor Grid Enabled

Frieda meets Pegasus-WMS

Job Submission Via File Transfer

Presentation transcript:

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group

Condor Week 2004 The CAF ● Develop, debug and submit on the same machine ● Output to any place the user want ➔ No need to stay connected submit CAF submit

Condor Week 2004 ● User group executable, data and libraries in a directory ● The directory is tared-up ● Tar-ball sent via kerberized socket connection ● Job split in several sections  Same executable, different parameters The user side mymachine> ls mydir myexe mydata.conf lib/libfirst.so lib/libsecond.so mytar.tgz CAF mydir JobID

Condor Week 2004 The server side ● Authenticate user using kerberos ● Receive tar-ball and put it in a local dir ● Create submit description files ● Submit to Condor Submitter Condor User mytar.tgz data Server> ls submit_dir data.tgz CafExe job.dag dagman.ClassAd section_1000.ClassAd... section_3534.ClassAd condor_submit JobID

Condor Week 2004 Condor submission ● Every job has its own staging directory ● Using dagman  Script creates dagman submit description file  Plus one description file per section  Final cleanup script removes tar-ball ● Flat DAG, with only the cleanup script as child ● Using kerberos service principals for authentication  Don't want to have a Unix uid for every user

Condor Week 2004 Job specifics: ''Transfer In'' ● Using Condor transfer mechanism to transfer  Tar-ball  The startup wrapper  A kerberos keytab ● Encrypion needed for the keytab file ● Using VMx_USER ● Kerberos used for outside authentication  User specific service principle extracted from the keytab  Keytab removed before user executable starts

Condor Week 2004 Job specifics: ''Transfer Out'' ● Queued rcp used to copy output to user specified location  Section output too big for the head node  Original submission machine may be down ● Backup file server tried if first rcp fails ● Condor transfer mechanism used only to get the section log and summary files ● In case all rcp failed, data are transferred to the head node as the last resource

Condor Week 2004 Mailer ● Implemented as a separate process  Only one mail for the whole DAG  A mail must be generated even if the job is removed ● CAF specific information included ● Has a list of dagmans to watch ● Generate a mail when dagman ends

Condor Week 2004 Monitoring data: job information ● condor_q too expensive ● Parsing log files  One global log of all dagman submits  One submit log for every job  CAF specific log files dagman.log dagman 1Section 1Section k job_1/job.log job_1/section_1.out job_1/section_k.out dagman 2Section 1Section h job_2/job.log job_2/section_1.out job_2/section_h.out dagman nSection 1Section j job_n/job.log job_n/section_1.out job_j/section_j.out

Condor Week 2004 Monitoring information: system VM information ● condor_status cheap enough ● Used to map back which section runs where  Not enough information in the log files Priorities ● condor_userprio used for user priorities ● Section priorities maintained in submit description files

Condor Week 2004 Monitoring: command line ● Logically mimics a unix shell  jobs  ls, tail, cat  top  gdb ● COD used to send request to the worker node User Monitor CafExe CafRout top COD Write pipe

Condor Week 2004 ● Polling method used ● Web pages dynamically generated based on snapshot ● History data maintained using RRD (Round Robin Database) Monitoring: web interface ● See demo demo

Condor Week 2004 ● Command line tools for user administration  Kill a job  Kill one or more sections  Change relative priority  Change timeout ● Unix like User intervention

Condor Week 2004 CDF CAF in numbers ● At present  180 nodes  6 VMs per node  5 used, 1 for test  Total 900 in use ● By month end  Additional 160 nodes  Total 1700 in use (goal: 5000VM's by year end) ● 100s of users  At least 50 active at any time ● 100s CAF jobs in queue typical  Gives 10k-100k sections  -maxidle lowers Condor jobs <10k

Condor Week 2004 Condor configuration ● Single schedd  10k jobs, 100 dagmans  1k VMs, 200 nodes ➔ Single most demanding piece of the system ● Kerberos authentication ● Vanilla universe jobs  Preemption in the first minutes only ● Condor tunning  Relaxed timeouts  Delay between submissions in dagman  Optimized kerberos authentication  Schedd autoclustering  Per-file encryption ● Using pre-released ver.

Condor Week 2004 Missing: Group accounting ● Several institutions in the collaboration  Common pool financed by all  Several pools financed by single institution (15) ● Different contributions  Some more than 100 nodes  Some only a few ● Users can run in different pools  Owners must get priority treatment  CPU used by owners in the private pool must not influence the priority in the common pool

Condor Week 2004 Possible solution: flocking? Proposal ● One pool for common use ● One pool for every institution  Owners preferred ● Flocking between the common pool and any of the other pools Problems ● A management nightmare ● Small pools penalized  No preemption ● Unfair accounting for stolen CPU

Condor Week 2004 New feature: Hierarchical priorities ● Hierarchical priorities  A tree of policies ● Each node can have a different policy  Current fair share  Ranking  Belong-to  Quotas (up to x VMs) Fair share Common MIT Quota 12VMs Allow only MIT users Quota 900VMs Allow only CDF users Job

Condor Week 2004 Hierarchical priorities: Advantages ● Easy to manage ● Very flexible ● Allows for use of roles CDF commonCDF MITCMSATLAS CDFGrid3 Our Pool john#CDF/MITCDF/CMSGRID john#CDF/MITCDFjohn#CMSGRID john#CDFjohn#MITCDFjohn#CMSGRID igor#CDF/INFNCDF igor#CDF

Condor Week 2004 Future ● Better use of dagman  Wait for data to be staged  Merge section output  Expose DAG to users ● Use COD for interactive use  PEAC prototype at Supercomputing03 ● Use glide-in on remote sites that don't want to use Condor ● Opportunistic use of other pools  Flocking with D0 and Grid3 pools