Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.

Similar presentations


Presentation on theme: "Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09."— Presentation transcript:

1 Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09

2 Presented during the GDB The13th of January we had a post-mortem meeting with Maarten and the WMS experts to evaluate the WMS problems faced by ALICE during the Christmas time

3 The Service: WMS  Where this was hapenning?  Among other sites, basically 2 T2 sites were catching a huge number of jobs: MEPHI in Russia and the T2 is Prague  Why this was hapenning?  Normally several reasons can drive to this situation:  The destination queue is not available  The submitted jobs are then kept for a further retry: (up to 2 retries: unmatched requests are discarded after 2 hours)  But ALICE has set the Shallow resubmission to cero and explicitly asked the WMS experts to set the nodes avoiding any possible resubmission  Any configuration problem at the site keeps on submitting jobs  Since these jobs are visible nowhere, they do not exist for ALICE and therefore, the system keeps submitting and submitting  In any case the submission regime of ALICE is not so high to provoque such a huge backlogs in nodes as wms204  The previous reasons can be ingredientes to the problem, but cannot be the only reason for such a load  On wms204 the matchmaking became very slow due to unknown causes; the developers have been involved

4 Effects of the high load  ALICE was seing jobs in status READY and WAITING for a long time  The experiment still does not consider READY and WAITING as problematic status so it keeps on submitting and submitting… SNOWBALL: creating huge backlogs  Request: Could the WMS be configured to avoid new submissions once it gets in such a state?  Proposed during the post-mortem meeting with the WMS experts, it could be in place for the end of February 2009 (earliest)  As soon as the node gets overloaded the sensors can put the service automatically in draining mode (avoiding therefore any submission by the client)  This procedure excludes the definition of an alias for the Alice WMS

5 ALICE procedures ALICE stopped immediatelly the submission through wms204 at all sites putting the highest weight on wms103 and wms109 The situation was solved in wms204 but appeared in wms103 and wms109 wms103 and wms109 (gLite 3.0) had a different problem that could not be explained satisfactorily either In addition access to wms117 was also ensured to ALICE for this period The node developed the same symptoms as wms204 As result a continuous care of the WMS has been followed during this period changing the wms in production when needed

6 Possible source of problems  ALICE jdl construction?  The experiment has always defined simple jdl files for their agents  BDII overloaded?  It should be then affecting all VOs while performing the matchmaking  In addition several tests were made while quering the BDII and obtaining positive results  Network problems?  During several days?... And afecting ALICE only?  Overloading myproxy server  Indeed it was found a high load of myproxy by ALICE  However this seems to be uncorrelated with the WMS issue  Although an overload on myproxy server can slow down the WMS processing, this should then be visible for all WMS of all VOs

7 How to solve my proxy server issue Faster machines have been already requested to replaced the current nodes of myproxy server Proposed during the Christmas period the request has been already done In addition ALICE is currently changing the submission procedure to ensure a proxy delegation request once per hour In case of any problem at a VOBOX, this procedure can ensure a 'frugal' myproxy server usage The new submission procedure will have a beta version this week at Subatech (France)

8 Beta implementation at Subatech and Torino (I) Presented during the last ALICE TF Meeting, basically it consists on the folowing: We will stop refreshing the delegated user proxy before each agent submission We will do it now each 1 hour only We stop using the –a option for agent submission which performs an automatic delegation kept by WMProxy We do it with –d option which explicitly creates a named delegated credential on the WMproxy and it refers to this delegated proxy at each job submission This new procedure forces an explicit proxy delegation onto WMProxy BEFORE the job submission (to be performed just once per hour)

9 Beta implementation at Subatech and Torino (II) In detail this is what we are doing: Refreshing the user proxy onto the VOBOX for a 1st time Make the WMProxy aware of the delegated proxy glite-wms-job-delegate-proxy –d Perform the usual agent submissions with the –d option glite-wms-job-submit –d jdl file After one hour the user proxy will be refreshed again and the WMProxy will be aware again of the delegated proxy

10 Beta implementation at Subatech and Torino (III) Some effects to this procedure We change the LDAP configuration to include all WMS specified into RBLIST into the same config file glite-wms-job-delegate-proxy must be done individually for all WMS used at each VOBOX Individual WMS config files for each WMS are then needed (placed into alien-logs) This files have no submission purposes but just WMProxy delegation purposes

11 Conclusions Still pending the issue with the WMS: We still cannot conclude why such a big backlogs have been created during this vacation period Two new WMS@CERN have been already announced: wms214 and wms215 in addition to wms204 All of them with independent LB 8 core machines Glite3.1 wms103 and wms109 will be fully deprecated end of February At this moment and due to an AliRoot upate ALICE is not in full production As soon as the experiment restarts production we will follow carefully the evolution of the 3 nodes reporting any further issue to the developers

12 Final Remarks ALICE has a lack of WMS France still is not providing any WMS which can be put in production WMS provided at RDIG, Italy, NL-T1, FZK and RAL CERN WMS play a central role for many ALICE sites and are always a failover for the sites, even if a local WMS is available ALICE wishes to thank the IT/GS (Maarten and Patricia in particular) for the efficient support during the Christmas running


Download ppt "Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09."

Similar presentations


Ads by Google