CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)

Slides:



Advertisements
Similar presentations
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Advertisements

A Computation Management Agent for Multi-Institutional Grids
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
WorkPlace Pro Utilities.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
Storage Element Security Jens G Jensen, WP5 Barcelona, May 2003.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
3/5/2007Chris Green, FNAL / OSG VO-Level Site Validation.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Condor Week 2011SaaS and the OSG glidein factory1 Condor Week 2011 Why SaaS can be good The tale of the OSG glidein factory by Igor Sfiligoi and Jeff Dost.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
Dealing with real resources Wed July 21st, 3:15pm Igor Sfiligoi, OSG Scalability Area coordinator and OSG glideinWMS factory manager.
Jean-Philippe Baud, IT-GD, CERN November 2007
Core LIMS Training: Project Management
Essentials of UrbanCode Deploy v6.1 QQ147
Dynamic Deployment of VO Specific Condor Scheduler using GT4
HORIZONT TWS/WebAdmin DS TWS/WebAdmin DS Tips & Tricks
U.S. ATLAS Grid Production Experience
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Primer for Site Debugging
Workload Management System
Glidein Factory Operations
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
The ATLAS software in the Grid Alessandro De Salvo <Alessandro
Software Testing With Testopia
Essentials of UrbanCode Deploy v6.1
Survey on User’s Computing Experience
Troubleshooting Your Jobs
Upgrading Condor Best Practices
Inside a PMI Online Course
A Scripting Server for Domain Automation Tasks
Condor-G Making Condor Grid Enabled
Grid Computing Software Interface
Credential Management in HTCondor
Troubleshooting Your Jobs
Presentation transcript:

CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)

CERN Feb 14thFactory operations2 Overview ● Refresher ● Setup and configuration ● Monitoring and Troubleshooting ● Other activities

CERN Feb 14thFactory operations3 Refresher – Glidein factory ● The glidein factory knows about the sites and does the submission ● Driven by a frontend Factory node Condor Factory Frontend node Frontend Globus CREA M Submit node Central manager Execution node glidein Execution node glidein Worker node glidein Monitor Condor Request glideins Submit glideins Match Startd Job

CERN Feb 14thFactory operations4 Refresher – Factory arch ● Condor(-G) handles most of the work ● Each site has its own process – Called entry Factory node Collector Factory Entry Spawn... Advertise entry Retrieve orders Schedd Submit Monitor Schedd glidein Web Server

CERN Feb 14thFactory operations5 Refresher - Cardinality ● A single factory may serve multiple frontends ● The OSG UCSD serves O(10) ● Different frontends could be treated differently ● Nr. sites, arguments, … ● Security Startd Glidein Factory Schedd User job Collector Negotiator VO Frontend Startd User job Schedd Collector Negotiator VO Frontend

CERN Feb 14thFactory operations6 Setup and configuration

CERN Feb 14thFactory operations7 Setup and configuration ● glideinWMS comes with an installer ● Will help you install all the needed software and do most config ● The factory config you get out of the installer is typically just a rough template ● You will have to finish it by hand ● It is in XML format

CERN Feb 14thFactory operations8 The Condor part ● The Condor components use standard config ● Just using multiple schedds for scalability ● The collector uses GSI security for WAN ● Will need x509 cert and proper security config. to talk to the frontends ● Local auth. still FS based ● Factory will use condor_root_switchboard ● For UID switching ● Must be owned by root and setuid ● Config will need to be maintained by hand Similar to sudo (More details later)

CERN Feb 14thFactory operations9 ● Frontends will be delegating proxies to the factory ● Each frontend files owned by its own UID ● The factory uses condor_root_switchboard to ● Create dirs owned by right UID ● Write files ● Submit jobs Privilege separation UID switching valid-caller-uids = gfactory valid-caller-gids = gfactory valid-target-uids = fe1 : fe2 : fecms valid-target-gids = fe1 : fe2 : fecms valid-dirs = /var/gfactory/clientlogs valid-dirs = /var/gfactory/clientproxies procd-executable = /etc/condor/privsep_config Base dirs for dir creation Auth. target users Auth. source user Must update every time a new frontend added

CERN Feb 14thFactory operations10 Frontend authorization ● The factory must whitelist all the frontends it supports, both in ● The collector (where X509 authentication happens) ● The factory Frontend security name UI D

CERN Feb 14thFactory operations11 Work in progress to improve this Adding the sites ● The factory is supposed to know what Grid sites are out there and can be used ● Unfortunately, we don't have good tools for site discovery ● The installer can query BDII and RESS, but it is very primitive ● You are better off using other tools (like ldapsearch) ● Admin needs to manually add sites in the XML file

CERN Feb 14thFactory operations12 Site attributes ● Each site will have some attributes associated with it ● Contact info – gatekeeper, jobmanager, RSL ● Site config – work dir, platform, firewalls, glexec ● Site limits – mostly wallclock time ● Site properties – supported VOs, nearby SEs ● User requested attributes – Can be anything ● Next slides explain the most often used ones

CERN Feb 14thFactory operations13 Site contact info ● How to submit to the site ● If you can reference an InfoSys, do it </entry RSL often optional Helps with monitoring

CERN Feb 14thFactory operations14 Site config ● Tells the glidein how to behave ● Note that one can host many condor binaries Special keyword

CERN Feb 14thFactory operations15 Site limits ● Grid sites typically have wallclock limits ● Pressure and sanity limits

CERN Feb 14thFactory operations16 Matchmaking attributes ● Proper site attributes ● Arbitrary other attributes

CERN Feb 14thFactory operations17 Final note on sites ● VO Frontend admins trust you to forward their proxies only to trusted parties ● You should always think twice before adding a new site ● The same site may support multiple VOs ● If possible, use the same entry (economy of scale) ● May not be always possible, though (RSL, attributes, etc.)

CERN Feb 14thFactory operations18 A note about reconfigs ● Factory has init.d-like maintenance script ●./factory_startup start|stop|reconfig ● The config file editing unconventional ● Cannot edit the master config file (glideinWMS.xml) ● Must edit a copy of it, typically in../glidein_bla.cfg/glideinWMS.xml ● Then tell reconfig where is the copy./factory_startup reconfig../glidein_bla.cfg/glideinWMS.xml Takes some time to get used to it

CERN Feb 14thFactory operations19 Factory operations Monitoring and troubleshooting

CERN Feb 14thFactory operations20 Basic system monitoring ● Basic system monitoring is just regular Condor(-G) monitoring ● condor_q -global [-globus] ● logs ● The factory also has extensive Web monitoring ● Historical graphs ● Current snapshot (both raw and nice-to-look-at)

CERN Feb 14thFactory operations21 Historical graphs Can be zoomed in Total or single entry

CERN Feb 14thFactory operations22 Table view Easier to get a detailed view Sortable

CERN Feb 14thFactory operations23 Troubleshooting views More clutter, but easier to focus on problems Sortable

CERN Feb 14thFactory operations24 Drill down to details Single entry – more space to show details

CERN Feb 14thFactory operations25 Basic system troubleshooting ● Typical error ● Glideins don't get submitted (are held) ● Just look for the held reason ● Authorization problems ● Expired proxy (from frontend) ● Site “does not exist” – Could be in maintenance – Or really decommissioned Standard Condor-G tasks

CERN Feb 14thFactory operations26 Site status ● Factory publishes a XML file with current status as advertised by Info Systems No infosys configured Likely decommissioned Expect Condor-G problems

CERN Feb 14thFactory operations27 Removing sites ● One should never remove an entry from XML file ● Impossible to add back an entry with the same name (memory effect) ● For long-term removal, just disable it ● Requires a reconfig (heavy) ● For short-term, put in downtime ●./factory_startup down -entry CMS_T2_IT_Rome_ce01

CERN Feb 14thFactory operations28 Glidein monitoring ● Factory admins are supposed to monitor the health of the glideins they are submitting ● They will have to solve any problems anyhow ● Most info in the Web interfaces showed before ● Factory also has info the frontends don't ● Glidein exit logs ● Contain troves of useful debug info

CERN Feb 14thFactory operations29 Reminder – Web interfaces Includes glidein health

CERN Feb 14thFactory operations30 Completion stats Historical graphs color coded to give an idea of the health of the system

CERN Feb 14thFactory operations31 Text reports Past 24.0 hours: Total Glideins: Total Jobs: (Average jobs/glidein: 1.52) Total time: Ms (71083 hours slots) Total time used: Ms (66182 hours slots) Total time validating: Ks (206 hours slots) Total time idle: 17.1 Ms (4759 hours slots) Total time wasted: 17.6 Ms (4900 hours slots) Time used/time wasted: 13.5 Time efficiency: 0.93 Per Entry (all frontends) stats for the past 24 hours. strt fval 0job | val idle wst badp | waste time total CMS_T2_US_MIT_ce01 0% 1% 3% | 2% 3% 6% 30% | | 952 CMS_T2_ES_IFCA_ce02 0% 0% 10% | 0% 8% 7% 10% | | 641 CMS_T2_ES_IFCA_ce01 0% 0% 11% | 0% 8% 7% 10% | | 614 CMS_T2_US_UCSD_gw2 0% 0% 8% | 0% 7% 7% 7% | | 906 CMS_T2_US_UCSD_gw4 0% 0% 11% | 0% 8% 7% 8% | | 689 CMS_T2_US_Purdue_osg 2% 0% 71% | 0% 22% 22% 38% | | 828 CMS_T2_US_Wisconsin_cms01 0% 0% 3% | 1% 7% 8% 14% | | 531 CMS_T2_US_Wisconsin_cms02 0% 0% 5% | 1% 7% 8% 16% | | 475 CMS_T2_FR_CCIN2P3_cclcgceli09_long 0% 0% 4% | 0% 10% 13% 17% | | 508 CMS_T2_AT_Vienna_lcgce 0% 0% 4% | 0% 8% 8% 8% | | 451 CMS_T3_PT_Ingrid_ce02 0% 0% 5% | 0% 10% 11% 11% | | 374 CMS_T2_FR_CCIN2P3_cclcgceli06_long 0% 0% 16% | 0% 11% 13% 16% | | 291 CMS_T2_IT_Bari_ce2 0% 0% 0% | 0% 3% 4% 10% | | 295 CMS_T2_CN_Beijing_lcg002 0% 0% 9% | 0% 24% 27% 29% | | 315 CMS_T2_US_Florida_iogw1 0% 0% 5% | 0% 7% 6% 7% | | LEGEND: Ks - kiloseconds (*1,000 seconds) Ms - megaseconds (*1,000,000 seconds) strt - % of jobs where condor failed to start fval - % of glideins that failed to validate (hit 1000s limit) 0job - % 0 jobs/glidein val - % of time used for validation idle - % of time spend idle wst - % of time wasted (Lasted - JobsLasted) badp - % of badput (Lasted - JobsGoodput) waste - wallclock time wasted (hours) (Lasted - JobsLasted) time - total wallclock time (hours) (Lasted) total - total number of glideins Quick health status overview Ordered by waste

CERN Feb 14thFactory operations32 Troubleshooting glideins ● Logs are your friends! ● Each glidein returns a stdout and stderr file ● /var/gfactory/clientlogs/user_fe1/entry_XX/job.X.Y.* ● On top of glidein_startup logs, stderr contains also: ● Complete config used by the startd ● Startd (and starter) logs – Compressed to save space (O(100k) glideins * day!) –./glideinWMS/factory/tools/cat_StartdLog.py job*.err

CERN Feb 14thFactory operations33 Typical errors ● Download problems (failed to download files) ● Validation problems ● glidein_startup fails before launching the startd ● Test that failed (usually) prints out what went wrong ● Many reasons (disk full, missing SW, VO problems) ● Condor launched but fails immediately ● Typically due to missing libraries ● Or wrong architecture

CERN Feb 14thFactory operations34 Typical errors 2 ● Condor fails to register with VO collector ● Credential or Firewall issues ● Most of the time error clear in the startd logs ● Startd never gets any user job ● Could be just lack of (proper) jobs ● But firewalls can play dirty tricks! ● Startd gets jobs but fails to start them ● glexec failures ● Startd misconfiguration but not always! Often without any error in the logs!

CERN Feb 14thFactory operations35 Doing troubleshooting ● No good automated tools ● Not past what gets reported in the text report ● Look at report to discover troublesome entries ● Then extract from completion logs a few glideins that misbehave ● Analyze (by hand) one-by-one ● It can be quite time consuming ● Work in progress to automatically tag the most common failure modes Will hopefully have something soon

CERN Feb 14thFactory operations36 Troubleshooting new frontends ● When adding a new frontend, getting the security right is usually the biggest hurdle ● Wrong DN ● Wrong mapping ● Wrong security name ● Need to look both in Condor and factory entry logs ● Usually pretty obvious Just remember to check!

CERN Feb 14thFactory operations37 Factory operations Other tasks

CERN Feb 14thFactory operations38 Grid interface ● Factory tries to hide the Grid from the VOs ● So they will consult you for anything Grid related ● Budget for this support ● All glideins submitted via the factory ● The Grid sites will contact you for any problems (directly or indirectly) ● You may need to (help) solve problems than are mostly (or even 100%) VO related (because you have the Grid knowledge)

CERN Feb 14thFactory operations39 Factory operations And the summary

CERN Feb 14thFactory operations40 Summary ● Factory operations is a time consuming job ● You are trying to shield the users from the Grid ● So you do most of the troubleshooting for them ● Most errors are beyond your (direct) control ● Plenty of monitoring tools available ● But detailed troubleshooting still manual ● Initial setup relatively easy ● But keeping up with the changes in the Grid will keep you busy Tools to help you out should be coming soon

CERN Feb 14thFactory operations41 Pointers ● The official project Web page is ● glideinWMS development team is reachable at ● OSG glidein factory at UCSD