Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)

Similar presentations


Presentation on theme: "CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)"— Presentation transcript:

1 CERN Feb 14thFactory operations1 Condor Training @ CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)

2 CERN Feb 14thFactory operations2 Overview ● Refresher ● Setup and configuration ● Monitoring and Troubleshooting ● Other activities

3 CERN Feb 14thFactory operations3 Refresher – Glidein factory ● The glidein factory knows about the sites and does the submission ● Driven by a frontend Factory node Condor Factory Frontend node Frontend Globus CREA M Submit node Central manager Execution node glidein Execution node glidein Worker node glidein Monitor Condor Request glideins Submit glideins Match Startd Job

4 CERN Feb 14thFactory operations4 Refresher – Factory arch ● Condor(-G) handles most of the work ● Each site has its own process – Called entry Factory node Collector Factory Entry Spawn... Advertise entry Retrieve orders Schedd Submit Monitor Schedd glidein Web Server

5 CERN Feb 14thFactory operations5 Refresher - Cardinality ● A single factory may serve multiple frontends ● The OSG factory @ UCSD serves O(10) ● Different frontends could be treated differently ● Nr. sites, arguments, … ● Security Startd Glidein Factory Schedd User job Collector Negotiator VO Frontend Startd User job Schedd Collector Negotiator VO Frontend

6 CERN Feb 14thFactory operations6 Setup and configuration

7 CERN Feb 14thFactory operations7 Setup and configuration ● glideinWMS comes with an installer ● Will help you install all the needed software and do most config ● The factory config you get out of the installer is typically just a rough template ● You will have to finish it by hand ● It is in XML format

8 CERN Feb 14thFactory operations8 The Condor part ● The Condor components use standard config ● Just using multiple schedds for scalability ● The collector uses GSI security for WAN ● Will need x509 cert and proper security config. to talk to the frontends ● Local auth. still FS based ● Factory will use condor_root_switchboard ● For UID switching ● Must be owned by root and setuid ● Config will need to be maintained by hand Similar to sudo (More details later)

9 CERN Feb 14thFactory operations9 ● Frontends will be delegating proxies to the factory ● Each frontend files owned by its own UID http://www.cs.wisc.edu/condor/CondorWeek2010/condor-presentations/sfiligoi-condorg.odp http://www.cs.wisc.edu/condor/CondorWeek2010/condor-presentations/sfiligoi-condorg.odp ● The factory uses condor_root_switchboard to ● Create dirs owned by right UID ● Write files ● Submit jobs Privilege separation UID switching valid-caller-uids = gfactory valid-caller-gids = gfactory valid-target-uids = fe1 : fe2 : fecms valid-target-gids = fe1 : fe2 : fecms valid-dirs = /var/gfactory/clientlogs valid-dirs = /var/gfactory/clientproxies procd-executable = /etc/condor/privsep_config Base dirs for dir creation Auth. target users Auth. source user Must update every time a new frontend added

10 CERN Feb 14thFactory operations10 Frontend authorization ● The factory must whitelist all the frontends it supports, both in ● The collector (where X509 authentication happens) ● The factory Frontend security name UI D

11 CERN Feb 14thFactory operations11 Work in progress to improve this Adding the sites ● The factory is supposed to know what Grid sites are out there and can be used ● Unfortunately, we don't have good tools for site discovery ● The installer can query BDII and RESS, but it is very primitive ● You are better off using other tools (like ldapsearch) ● Admin needs to manually add sites in the XML file

12 CERN Feb 14thFactory operations12 Site attributes ● Each site will have some attributes associated with it ● Contact info – gatekeeper, jobmanager, RSL ● Site config – work dir, platform, firewalls, glexec ● Site limits – mostly wallclock time ● Site properties – supported VOs, nearby SEs ● User requested attributes – Can be anything ● Next slides explain the most often used ones

13 CERN Feb 14thFactory operations13 Site contact info ● How to submit to the site ● If you can reference an InfoSys, do it </entry RSL often optional Helps with monitoring

14 CERN Feb 14thFactory operations14 Site config ● Tells the glidein how to behave ● Note that one can host many condor binaries Special keyword

15 CERN Feb 14thFactory operations15 Site limits ● Grid sites typically have wallclock limits ● Pressure and sanity limits

16 CERN Feb 14thFactory operations16 Matchmaking attributes ● Proper site attributes ● Arbitrary other attributes

17 CERN Feb 14thFactory operations17 Final note on sites ● VO Frontend admins trust you to forward their proxies only to trusted parties ● You should always think twice before adding a new site ● The same site may support multiple VOs ● If possible, use the same entry (economy of scale) ● May not be always possible, though (RSL, attributes, etc.)

18 CERN Feb 14thFactory operations18 A note about reconfigs ● Factory has init.d-like maintenance script ●./factory_startup start|stop|reconfig ● The config file editing unconventional ● Cannot edit the master config file (glideinWMS.xml) ● Must edit a copy of it, typically in../glidein_bla.cfg/glideinWMS.xml ● Then tell reconfig where is the copy./factory_startup reconfig../glidein_bla.cfg/glideinWMS.xml Takes some time to get used to it

19 CERN Feb 14thFactory operations19 Factory operations Monitoring and troubleshooting

20 CERN Feb 14thFactory operations20 Basic system monitoring ● Basic system monitoring is just regular Condor(-G) monitoring ● condor_q -global [-globus] ● logs ● The factory also has extensive Web monitoring ● Historical graphs ● Current snapshot (both raw and nice-to-look-at)

21 CERN Feb 14thFactory operations21 Historical graphs Can be zoomed in Total or single entry

22 CERN Feb 14thFactory operations22 Table view Easier to get a detailed view Sortable

23 CERN Feb 14thFactory operations23 Troubleshooting views More clutter, but easier to focus on problems Sortable

24 CERN Feb 14thFactory operations24 Drill down to details Single entry – more space to show details

25 CERN Feb 14thFactory operations25 Basic system troubleshooting ● Typical error ● Glideins don't get submitted (are held) ● Just look for the held reason ● Authorization problems ● Expired proxy (from frontend) ● Site “does not exist” – Could be in maintenance – Or really decommissioned Standard Condor-G tasks

26 CERN Feb 14thFactory operations26 Site status ● Factory publishes a XML file with current status as advertised by Info Systems No infosys configured Likely decommissioned Expect Condor-G problems

27 CERN Feb 14thFactory operations27 Removing sites ● One should never remove an entry from XML file ● Impossible to add back an entry with the same name (memory effect) ● For long-term removal, just disable it ● Requires a reconfig (heavy) ● For short-term, put in downtime ●./factory_startup down -entry CMS_T2_IT_Rome_ce01

28 CERN Feb 14thFactory operations28 Glidein monitoring ● Factory admins are supposed to monitor the health of the glideins they are submitting ● They will have to solve any problems anyhow ● Most info in the Web interfaces showed before ● Factory also has info the frontends don't ● Glidein exit logs ● Contain troves of useful debug info

29 CERN Feb 14thFactory operations29 Reminder – Web interfaces Includes glidein health

30 CERN Feb 14thFactory operations30 Completion stats Historical graphs color coded to give an idea of the health of the system

31 CERN Feb 14thFactory operations31 Text reports Past 24.0 hours: Total Glideins: 11788 Total Jobs: 17875 (Average jobs/glidein: 1.52) Total time: 255.9 Ms (71083 hours - 2961.8 slots) Total time used: 238.3 Ms (66182 hours - 2757.6 slots) Total time validating: 743.5 Ks (206 hours - 8.6 slots) Total time idle: 17.1 Ms (4759 hours - 198.3 slots) Total time wasted: 17.6 Ms (4900 hours - 204.2 slots) Time used/time wasted: 13.5 Time efficiency: 0.93 Per Entry (all frontends) stats for the past 24 hours. strt fval 0job | val idle wst badp | waste time total CMS_T2_US_MIT_ce01 0% 1% 3% | 2% 3% 6% 30% | 408 6807 | 952 CMS_T2_ES_IFCA_ce02 0% 0% 10% | 0% 8% 7% 10% | 378 4810 | 641 CMS_T2_ES_IFCA_ce01 0% 0% 11% | 0% 8% 7% 10% | 354 4643 | 614 CMS_T2_US_UCSD_gw2 0% 0% 8% | 0% 7% 7% 7% | 341 4494 | 906 CMS_T2_US_UCSD_gw4 0% 0% 11% | 0% 8% 7% 8% | 257 3300 | 689 CMS_T2_US_Purdue_osg 2% 0% 71% | 0% 22% 22% 38% | 324 1439 | 828 CMS_T2_US_Wisconsin_cms01 0% 0% 3% | 1% 7% 8% 14% | 254 2892 | 531 CMS_T2_US_Wisconsin_cms02 0% 0% 5% | 1% 7% 8% 16% | 231 2598 | 475 CMS_T2_FR_CCIN2P3_cclcgceli09_long 0% 0% 4% | 0% 10% 13% 17% | 216 1607 | 508 CMS_T2_AT_Vienna_lcgce 0% 0% 4% | 0% 8% 8% 8% | 178 2129 | 451 CMS_T3_PT_Ingrid_ce02 0% 0% 5% | 0% 10% 11% 11% | 153 1293 | 374 CMS_T2_FR_CCIN2P3_cclcgceli06_long 0% 0% 16% | 0% 11% 13% 16% | 124 888 | 291 CMS_T2_IT_Bari_ce2 0% 0% 0% | 0% 3% 4% 10% | 123 2831 | 295 CMS_T2_CN_Beijing_lcg002 0% 0% 9% | 0% 24% 27% 29% | 120 441 | 315 CMS_T2_US_Florida_iogw1 0% 0% 5% | 0% 7% 6% 7% | 97 1515 | 243... LEGEND: Ks - kiloseconds (*1,000 seconds) Ms - megaseconds (*1,000,000 seconds) strt - % of jobs where condor failed to start fval - % of glideins that failed to validate (hit 1000s limit) 0job - % 0 jobs/glidein ---------- val - % of time used for validation idle - % of time spend idle wst - % of time wasted (Lasted - JobsLasted) badp - % of badput (Lasted - JobsGoodput) ---------- waste - wallclock time wasted (hours) (Lasted - JobsLasted) time - total wallclock time (hours) (Lasted) total - total number of glideins ------------------------------------- Quick health status overview Ordered by waste

32 CERN Feb 14thFactory operations32 Troubleshooting glideins ● Logs are your friends! ● Each glidein returns a stdout and stderr file ● /var/gfactory/clientlogs/user_fe1/entry_XX/job.X.Y.* ● On top of glidein_startup logs, stderr contains also: ● Complete config used by the startd ● Startd (and starter) logs – Compressed to save space (O(100k) glideins * day!) –./glideinWMS/factory/tools/cat_StartdLog.py job*.err

33 CERN Feb 14thFactory operations33 Typical errors ● Download problems (failed to download files) ● Validation problems ● glidein_startup fails before launching the startd ● Test that failed (usually) prints out what went wrong ● Many reasons (disk full, missing SW, VO problems) ● Condor launched but fails immediately ● Typically due to missing libraries ● Or wrong architecture

34 CERN Feb 14thFactory operations34 Typical errors 2 ● Condor fails to register with VO collector ● Credential or Firewall issues ● Most of the time error clear in the startd logs ● Startd never gets any user job ● Could be just lack of (proper) jobs ● But firewalls can play dirty tricks! ● Startd gets jobs but fails to start them ● glexec failures ● Startd misconfiguration but not always! Often without any error in the logs!

35 CERN Feb 14thFactory operations35 Doing troubleshooting ● No good automated tools ● Not past what gets reported in the text report ● Look at report to discover troublesome entries ● Then extract from completion logs a few glideins that misbehave ● Analyze (by hand) one-by-one ● It can be quite time consuming ● Work in progress to automatically tag the most common failure modes Will hopefully have something soon

36 CERN Feb 14thFactory operations36 Troubleshooting new frontends ● When adding a new frontend, getting the security right is usually the biggest hurdle ● Wrong DN ● Wrong mapping ● Wrong security name ● Need to look both in Condor and factory entry logs ● Usually pretty obvious Just remember to check!

37 CERN Feb 14thFactory operations37 Factory operations Other tasks

38 CERN Feb 14thFactory operations38 Grid interface ● Factory tries to hide the Grid from the VOs ● So they will consult you for anything Grid related ● Budget for this support ● All glideins submitted via the factory ● The Grid sites will contact you for any problems (directly or indirectly) ● You may need to (help) solve problems than are mostly (or even 100%) VO related (because you have the Grid knowledge)

39 CERN Feb 14thFactory operations39 Factory operations And the summary

40 CERN Feb 14thFactory operations40 Summary ● Factory operations is a time consuming job ● You are trying to shield the users from the Grid ● So you do most of the troubleshooting for them ● Most errors are beyond your (direct) control ● Plenty of monitoring tools available ● But detailed troubleshooting still manual ● Initial setup relatively easy ● But keeping up with the changes in the Grid will keep you busy Tools to help you out should be coming soon

41 CERN Feb 14thFactory operations41 Pointers ● The official project Web page is http://tinyurl.com/glideinWMS http://tinyurl.com/glideinWMS ● glideinWMS development team is reachable at glideinwms-support@fnal.gov glideinwms-support@fnal.gov ● OSG glidein factory at UCSD http://hepuser.ucsd.edu/twiki2/bin/view/UCSDTier2/OSGgfactory http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatus.html http://hepuser.ucsd.edu/twiki2/bin/view/UCSDTier2/OSGgfactory http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatus.html


Download ppt "CERN Feb 14thFactory operations1 Condor CERN Operating a glideinWMS factory by Igor Sfiligoi (UCSD)"

Similar presentations


Ads by Google