Primer for Site Debugging

Slides:



Advertisements
Similar presentations
Open Science Grid Discovering and understanding the site environment Or, yet another site test kit.
Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Monitoring of Castor at RAL Castor F2F Rutherford Appleton Laboratory, February 18 th 2009 Cheney Ketley RAL.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Grid job submission using HTCondor Andrew Lahiff.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Mar 27, gLExec Accounting Solutions in OSG Gabriele Garzoglio gLExec Accounting Solutions in OSG Mar 27, 2008 Middleware Security Group Meeting Igor.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
3/5/2007Chris Green, FNAL / OSG VO-Level Site Validation.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
Condor Week 2011SaaS and the OSG glidein factory1 Condor Week 2011 Why SaaS can be good The tale of the OSG glidein factory by Igor Sfiligoi and Jeff Dost.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
HTCondor stats John (TJ) Knoeller Condor Week 2016.
WLCG IPv6 deployment strategy
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System ( WMS )
Workload Management System
Akiya Miyamoto KEK 1 June 2016
Factory Ops Mission Statement by Jeff Dost (UCSD)
Glidein Factory Operations
Moving from CREAM CE to ARC CE
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
CREAM-CE/HTCondor site
1 VO User Team Alarm Total ALICE ATLAS CMS
What is Bash Shell Scripting?
Submitting Many Jobs at Once
WMS Options: DIRAC and GlideIN-WMS
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
gLite Job Management Christos Theodosiou
Job Application Monitoring (JAM)
Condor-G Making Condor Grid Enabled
Troubleshooting Your Jobs
Presentation transcript:

Primer for Site Debugging This talk introduces key concepts and tools used in the following talk on site debugging By Jeff Dost (UCSD) glideinWMS training Primer for Site Debugging

Primer for Site Debugging Overview Monitoring and Reports Logs Tools glideinWMS training Primer for Site Debugging

Primer for Site Debugging Overview Monitoring and Reports Logs Tools glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow Waiting and Pending are 2 categories of Idle Idle = Waiting + Pending glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow Waiting – the glidein never left the factory (only on our local queue) glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow Pending – the glidein made it to the site batch system, but it has not been assigned to a worker node yet (made it to the site queue) glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow If either Waiting or Pending* are high but Running ~= 0 we should investigate * High Pending and 0 Running is not necessarily a problem unless no jobs start for a significant period of time (~24 hrs or more) glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow Requested Idle – number the frontend is requesting (pressure value) glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow A well behaved entry should have* Req Idle ~= Idle * An exception is when we significantly limit max idle in the factory config glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow The Frontend reports back stats about its User Collector Registered is the number of glidiens actually connected to the Collector Registered should roughly be equal to Running glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow We define Rundiff as: Running – Registered Rundiff >> 0 should be investigated glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatusNow Frontend subcategories of Registered: Claimed – glidiens running user jobs Unmatched – glideins available, but 0 jobs match their requirements Registered = Claimed + Unmatched Unmatched >> Claimed should also be investigated glideinWMS training Primer for Site Debugging

Primer for Site Debugging factoryStatus Same stats as factoryStatusNow but plotted over time: glideinWMS training Primer for Site Debugging

Primer for Site Debugging analyze_entries Report excerpt: frontend_UCSDCMS_cmspilot: strt fval 0job | val idle wst badp | waste time total Total/Average 8% 0% 16% | 0% 1% 2% 37% | 6930 280120 | 23370 ---------- --- --- --- | --- --- --- --- | --- --- --- CMS_T3_US_PuertoRico_grid0 4% 0% 45% | 16% 22% 25% 52% | 210 811 | 165 CMS_T2_US_Purdue_hadoop 64% 23% 72% | 23% 10% 34% 70% | 199 575 | 672 CMS_T3_UK_SGrid_Oxford_ce06_medium 1% 1% 86% | 2% 57% 60% 69% | 30 50 | 73 Legend: strt - % of jobs where condor failed to start fval - % of glideins that failed to validate (hit 1000s limit) 0job - % 0 jobs/glidein ---------- val - % of time used for validation idle - % of time spend idle wst - % of time wasted (Lasted - JobsLasted) badp - % of badput (Lasted - JobsGoodput) waste - wallclock time wasted (hours) (Lasted - JobsLasted) time - total wallclock time (hours) (Lasted) total - total number of glideins ------------------------------------- Lasted - total wallclock time JobsLasted - wallclock time used to run jobs JobsGoodput - wallclock time used by jobs terminatig with exit code 0 glideinWMS training Primer for Site Debugging

Primer for Site Debugging analyze_entries All of the following are counted as Waste: Condor failing startup Failing validation 0job Idle NOTE in this report, idle refers to time glidein spent running but not running user jobs, e.g. Unmatched We want to investigate whenever waste is high for an entry glideinWMS training Primer for Site Debugging

factoryCompletedStats Useful to see validation over time and short running glideins: CMS_T2_US_Purdue_hadoop (has problems!) glideinWMS training Primer for Site Debugging

Primer for Site Debugging Overview Monitoring and Reports Logs Tools glideinWMS training Primer for Site Debugging

Primer for Site Debugging Logs Glidiens have three logs associated with them, job.*.out, job.*.err, condor_activity *.err logs contain compressed condor daemon logs, as well as an XML report containing statistics Tools provided to extract the compressed logs: cat_MasterLog.py cat_StartdLog.py cat_StartdHistoryLog.py cat_StarterLog.py cat_XMLResult.py glideinWMS training Primer for Site Debugging

Primer for Site Debugging Logs job.*.out and job.*.err logs contain lots of diagnostic info, and also include any stdout or stderr written by validation scripts If a validation script provides an XML report, it is often enough to read the summary to discover validation errors glideinWMS training Primer for Site Debugging

Primer for Site Debugging Logs $ cat_XMLResult.py job.2106335.3.out <?xml version="1.0"?> <OSGTestResult logname="job.2106335.3.out" id="glidein_startup.sh" version="4.3.1"> <operatingenvironment> <env name="client_name">UCSD-o1_0.MIT</env> <env name="client_group">MIT</env> <env name="user">cuser13</env> <env name="arch">x86_64</env> <env name="os">CentOS release 6.4 (Final)</env> <env name="hostname">cabinet-8-8-11.t2.ucsd.edu</env> <env name="cwd">/data1/condor_local/execute/dir_16146</env> </operatingenvironment> <test> <tStart>2014-06-19T23:48:08-07:00</tStart> <tEnd>2014-06-19T23:49:22-07:00</tEnd> </test> <result> <status>ERROR</status> <metric name="TestID" ts="2014-06-19T23:49:21-07:00" uri="local">main/validate_node.sh</metric> <metric name="failure" ts="2014-06-19T23:49:21-07:00" uri="local">WN_Resource</metric> <metric name="CwdFreeKb" ts="2014-06-19T23:49:21-07:00" uri="local">751952</metric> <metric name="CwdMinKb" ts="2014-06-19T23:49:21-07:00" uri="local">1048576</metric> </result> <detail> Validation failed in main/validate_node.sh. Space on '.' not enough. At least 1024 MBs required, found 751952 KBs </detail> </OSGTestResult> glideinWMS training Primer for Site Debugging

Primer for Site Debugging Logs The condor_activity log contains state transitions for each glidein during its lifetime: 000 (2118175.001.000) 06/24 13:16:44 Job submitted from host: <169.228.38.36:46438> ... 017 (2118175.001.000) 06/24 13:16:57 Job submitted to Globus RM-Contact: osg-gw-6.t2.ucsd.edu:2119/jobmanager-condor JM-Contact: osg-gw-6.t2.ucsd.edu:2119/jobmanager-condor Can-Restart-JM: 1 027 (2118175.001.000) 06/24 13:16:57 Job submitted to grid resource GridResource: gt5 osg-gw-6.t2.ucsd.edu:2119/jobmanager-condor GridJobId: gt5 osg-gw-6.t2.ucsd.edu:2119/jobmanager-condor https://osg-gw-6.t2.ucsd.edu:46762/16362101895480626811/7675377755575312265/ 001 (2118175.001.000) 06/24 13:26:57 Job executing on host: gt5 osg-gw-6.t2.ucsd.edu:2119/jobmanager-condor 005 (2118175.001.000) 06/25 01:32:12 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job glideinWMS training Primer for Site Debugging

Primer for Site Debugging Logs condor_activity for a site with problems: 000 (2212350.000.000) 06/30 16:27:16 Job submitted from host: <169.228.38.36:51626> ... 017 (2212350.000.000) 06/30 16:27:29 Job submitted to Globus RM-Contact: mwt2-gk.campuscluster.illinois.edu:2119/jobmanager-condor JM-Contact: mwt2-gk.campuscluster.illinois.edu:2119/jobmanager-condor Can-Restart-JM: 1 027 (2212350.000.000) 06/30 16:27:29 Job submitted to grid resource GridResource: gt5 mwt2-gk.campuscluster.illinois.edu:2119/jobmanager-condor GridJobId: gt5 mwt2-gk.campuscluster.illinois.edu:2119/jobmanager-condor https://mwt2-gk.campuscluster.illinois.edu:22191/16434107832055453976/10682067833090772881/ 029 (2212350.000.000) 07/03 14:15:32 The job's remote status is unknown … 030 (2212350.000.000) 07/03 14:26:54 The job's remote status is known again 012 (2212350.000.000) 07/07 16:29:25 Job was held. Globus error 31: the job manager failed to cancel the job as requested Code 2 Subcode 31 026 (2212350.000.000) 07/09 06:43:03 Detected Down Grid Resource glideinWMS training Primer for Site Debugging

Primer for Site Debugging Overview Monitoring and Reports Logs Tools glideinWMS training Primer for Site Debugging

Primer for Site Debugging Tools Summary of tools cited in next talk: entry_q – convenience wrapper for condor_q to filter by entry name entry_ls – list all .err or .out logs for a particular entry, FE, date combination get_wns – extracts workernode hostnames from glidein XML reports proxy_info – obtain information about a given glidein pilot proxy glideinWMS training Primer for Site Debugging

Primer for Site Debugging OSG Status Website All OSG Sites report to RSV: It is useful to check here when a CE is unreachable, to see if the Site is down for maintenance http://myosg.grid.iu.edu/about glideinWMS training Primer for Site Debugging