OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

Mar 27, gLExec Accounting Solutions in OSG Gabriele Garzoglio gLExec Accounting Solutions in OSG Mar 27, 2008 Middleware Security Group Meeting Igor.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run.

Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,

Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.

Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.

Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.

UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.

Introduction to Operating Systems

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Workload Management System

Summary on PPS-pilot activity on CREAM CE

CREAM Status and Plans Massimo Sgaravatto – INFN Padova

Glidein Factory Operations

High Availability in HTCondor

Moving from CREAM CE to ARC CE

The CMS use of glideinWMS by Igor Sfiligoi (UCSD)

Building Grids with Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Haiyan Meng and Douglas Thain

WMS Options: DIRAC and GlideIN-WMS

Condor-G Making Condor Grid Enabled

Presentation transcript:

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for OSG by Igor Sfiligoi & Burt Holzman – Fermilab with contributions by Balamurali Ananthan CECE CECE CECE CECE CECE CECE

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi2 What did we test ● Scalability and reliability – in a single user environment – using several Grid sites, setup on top of production ones (Caltech, Fermilab, Madison,UCSD) – running simple sleep jobs(0.4h-5h), using small I/O files ● Tested WMSes – Plain Condor-G ( – ReSS ( – gLite WMS ( – glideinWMS (

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi3 Plain Condor-G ● Manual selection of the site – Base test to verify CE scalability and reliability CECE CECE CECE CECE CECE CECE

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi4 Condor-G scalability ● Scales nicely, no problems found up to 4x10k ~2k batch slots at Grid site Monitoring problem

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi5 Condor-G reliability ● Works fine when Grid site stable ● But lots of jobs fail when Grid site misbehaves – Nothing that can be done on the client side This site worked perfectly 24h ago

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi6 Condor-G reliability (2) ● Another example Running Succeeded Failed Problems around 4PM and 4AM

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi7 Condor-G reliability ● Condor-G does not handle well Grid CE crashes – If jobs are removed from the Grid queue before the CE comes back, Condor-G still thinks all the jobs are still there – If the GridMonitor process gets killed on the CE, Condor-G loses all control over the jobs that were managed by it ● I have several times observed substantial differences between what Condor-G thinks is queued and what was actually queued

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi8 ReSS ● A Condor-G based system – ReSS selects the Grid site for the user – Needs information from the Grid sites (CEMon in OSG v0.6) CECE CECE CECE CECE CECE CECE Schedd ReSS CEM on

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi9 ReSS scalability ● No problem up to 4x10k queued – Had to test on a single Grid pool (the only w/CEMon) 2k slots on Grid site Queued Running

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi10 ReSS reliability ● Similar to Condor-G ● Potentially, missconfigured CEMon can send jobs to the wrong Grid site – At least on paper... unfortunately, tested with just one site ● Certain failures could pot. be automatically recovered – Not out the box, not tested Succeeded Failed

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi11 gLite WMS ● A black box solution, needs dedicated client ● Needs support from Grid sites – BDII for site information (available on OSG) – gLite tools for job execution (not available on std. OSG) CECE CECE CECE CECE CECE CECE BD II

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi12 gLite WMS scalability (1) ● The normal submission impractical past 4x500 – Took 2 hours to submit (4x10k would take at least 40h!)

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi13 gLite WMS scalability (2) ● Bulk mode much faster: 4x4k submitted in 20mins 2 Grid sites ~2.5k Grid slot

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi14 gLite WMS scalability (3) ● The system was quite loaded at 4x4k ● Were not able to run 4x10k – All four clients reported errors on submission ● Similarly, 15x2k was disappointing – 12 out of 15 clients reported errors on submission (and each client tries 3 times)

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi15 gLite WMS reliability ● Internally uses Condor-G, so most problems the same – But it does retry a job several times if CG submission fails ● Still several jobs failed at every try ● Potentially, missconfigured BDII can send jobs to the wrong Grid site – At least on paper... did not happen during the test Succeeded Failed

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi16 GC B Schedd glideinWMS ● Essentially a standard Condor pool, with startds started in a dynamic way CECE CECE CECE CECE CECE CECE Schedd Sta rtd glideinWMS Collector Negotiator GC B StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd StartdStartd

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi17 glideinWMS scalability ● Tested up to 6x20k jobs without finding a problem Running Queued Running over 3 Grid sites

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi18 Condor scalability ● glideinWMS just a small layer on top of Condor – Condor does most of the work ● Tested both Condor v6.8.x and v6.9.x branches – Only the latest releases of both branches scale reasonably well in the WAN environment – Most tests done with pre-releases, after Condor team fixed (most) observed bugs Schedd Collector Negotiator glideinWMS Grid Sta rtd GC B

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi19 Condor Collector scalability ● Collector found scalable to at least 6k VMs – Collector was quite loaded, but jobs ran fine – Did not test higher, for lack of enough Grid cycles Only half VMs used by jobs in this setup

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi20 Condor Schedd scalability ● The main scalability issue found was memory consumption – 4M x running job! – Need to use multiple nodes ● May be a configuration issue (using strong authentication) – Regular Condor pools in OSG use less than 1M x running job Condor - Running Ganglia – Memory usage

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi21 Condor GCB scalability ● Tested up to ~1500 glideins (3k VMs) per GCB – up to ~3k glideins with 2 GCBs ● GCB seems to scale reasonably well – Test jobs were running fine (with latest version) – However, lots of error messages seen in GCB condor logs ● One critical problem fixed, other still under investigation ● GCB libraries sensitive to malformed packets – FNAL security scans occasionally crash some daemons – Condor team working on fixes, some in v6.9.2

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi22 glideinWMS reliability (1) ● User jobs almost never fail – Problematic Grid sites/nodes kill glideins not user job Succeeded Failed

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi23 glideinWMS reliability (2) ● If glidein dies after job started, Condor will restart the user job in another glidein – Just wasted CPU (Checkpointing could eliminate it) Wasted Useful

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi24 Conclusions (1) ● ReSS is very lightweight – One node can serve large number of jobs and batch slots ● However: – Failures only partially handled – No global fair share ● glideinWMS the most powerful – Virtually no job failures – Global fair share across Grid sites (not tested here) ● However – Heavyweight, needs approx. two nodes every 2k batch slots – PULL model disliked by some Grid sites – Needs gLExec on WN for proper security (not in OSG0.6) ● ReSS and glideinWMS both performed very well, gLite WMS does not scale

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi25 Conclusions (2) ● For automated tasks involving just a few entities, ReSS may be preferable – Lightweight, failures can be recovered by the submittor ● For multi-user environments sporting real users, glideinWMS is definitely the way to go if you can afford the needed hardware – Virtually no user job failures and real global fair share a must for the average user

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi26 Official selection ● This work was sponsored by USCMS to select promising WMS candidates ● Our secondary goal was to also help OSG itself select an official WMS

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi27 Next steps ● Additional tests of ReSS and glideinWMS – Bigger I/O files – Non-trivial applications – More Grid sites – Multiple users ● Integrate ReSS and glideinWMS into CMS MC and analysis tools – Performance there will be the real test