Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.

Similar presentations


Presentation on theme: " Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010."— Presentation transcript:

1  Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

2 Outlook  General results in the last three months  List of general issues  News about services  HI CMS+ALICE exercise  Nagios and Monitoring  Summary and Conclusions 18/15/10 2 ALICE OFFLINE WEEK -- ALICE GRID STATUS

3 General results in the last three months 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 3

4 List of general issues  T0 site  Instabilities this summer with the local CREAM-CE  Instabilities with the AFS software area  CAF nodes quite stable  Security patches applied to all ALICE VOBOXES at CERN  Migration of out of warranty voboxes (voalice07 to voalice15 and voalice09 to voalice16)  HI combined exercise  T1 sites  CREAM-CE issues including instabilities observed in the resource BDII  SE problems found at CNAF and CC-IN2P3 related to lack of disk space  T2 sites  Usual operations, in general quite stable behavior  Challenge: new sites entering production and updates of T2 to T1 sites (from the ALICE perspective) 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 4

5 T2 sites  T1 sites  Korean and USA sites willing to become ALICE T1 sites  Assuming in terms of services provision and management  Challenge: Bandwidth  We found a poor network between these sites and CERN  Show-stop for these sites and also for new comers  1 st approach: bottleneck entering CERN? (firewall stops)  It has been found this is not the issue  Current situation: Not fully clear (Jeff in contact this week with Edoardo Martelli to report the Supercomputing results)  “Proposal for Next Generation Architecture interconnecting LHC computing sites” (Nov 2010)  Moving towards a more dynamically configured links between sites with a few static connections 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 5

6 CREAM and AliEn v2.19 1. Easy management of the OSB (OutputSandBox) 2. Removal of any reference to the CREAM-DB 3. Check out of the CREAM-CE status in the BDII 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 6

7 CREAM and AliEn v2.19  Easy management of the OSB  OSB required by ALICE for debugging purposes only  Direct submission of jobs via the CREAM-CE requires the specification of a gridftp server to save the OSB  Server specified to the level of the jdl file  ALICE solved it requiring a gridftp server at the local VOBOX  OSB cannot be retrieved from CREAM disk via any client command  Well… not fully true. Functionality possible but not exposed before CREAM1.6  Requirements to expose this feature  Automatic purge procedures (from CREAM1.5)  Limiters blocking new submissions in case of low free disk space (from CREAM1.6 )  CREAM1.6 exposes the possibility to leave the OSB in the CREAM-CE  outputsandboxbasedesturi="gsiftp://localhost"; (agent jdl level)  gridftp server at the VOBOX is not longer needed 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 7

8 CREAM and AliEn v2.19  Removal of any reference to CREAM-DB  Reporting of running/waiting jobs purposes in parallel to the BDII information  AliEn v2.18 enabled both information sources  Definable on a site by site basis through an env variable (CE_USE_BDII) included in LDAP  AliEnv2.19 keeps the env variable but removes the CREAM-DB reference as information source  Too heavy query and not always reliable  If not reliable we could collapse the sites or the opposite: simply not run  CREAM-CE developers have proposed us the creation of a tool able to provide waiting/running number of jobs querying the batch system  Therefore the maintenance of the env variable CE_USE_BDII  WARNING: THE ONLY INFO SYSTEM WE HAVE NOW IS THE RESOURCE BDII 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 8

9 CREAM and AliEnv2.19  Check out of the CREAM-CE status  “Economic” reasons… for what to keep the submission to CREAM-CEs in draining or maintenance mode?  Until AliEn v2.19: Manual approach  Non operational CEs were manually removed from LDAP  With AliEnv2.19: Automatic approach  Before any CREAM-CE operation the status of the CREAM-CE is queried to the resource BDII  If CE in maintenance of drain mode no operation is performed with this CE  If there is a list of CREAM-CEs, only those in production will be used  No need to restart services when the CE comes back in production  Procedure implemented and tested in Subatech with good results 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 9

10 CREAM Status  Current CREAM-CE production version:  CREAM1.6.3 (gLite3.2/sl5_x86_64) patch#4415  gLite3.1 version arriving (patch #4387 in staged-rollout) BUT! This will be the last CREAM-CE deployment in gLite3.1  Next CREAM-CE version:  CREAM1.7 (gLite3.2/sl5_x86_64 ONLY!)  Foreseen end of the year/beginning of 2011  Brief parenthesis…  Since the last offline week, I have submitted 27 GGUS tickets  17 associated to CREAM  4 associated to wrong information provided by the BDII  6 associated to SE issues  Let’s see the issues associated to CREAM (and observed by ALICE) in these last three months 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 10

11 CREAM Issues  Last Offline week’s advice for sites:  Migrate to CREAM1.6 as soon as possible  Lots of bug fixes reported by ALICE and new features were included in this version  However several instabilities were observed after the migration to CREAM1.6:  Connection timeout messages observed at submission time  Error messages reporting problems with the blparser service (blparser service not alive)  Issues reported to the CREAM-CE developers  We created a page for site admins describing the problems and the solutions:  http://alien2.cern.ch/index.php?option=com_content&view=article&id= 46&Itemid=103 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 11

12 CREAM Issues  Connection timeout error message observed at submission time  CREAM service is down  Bug #69554: Memory leak in util-java if CA loading throws exception. SOLVED IN CREAM1.6.2  Workaround provided by developers very easy to apply:  http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues  blparser service is not alive (glite-ce-job-status)  Well documented issue associated to the status of the BLAH blparser service  http://grid.pd.infn.it/cream/field.php?n=Main.ErrorMessagesReportedByCREAMToClient# blparsernotalive  Further problem(s):  Bug #69545: CREAM digests asynchronous commands very slow. SOLVED IN CREAM 1.6.2  Workaround provided by the developers very easy to apply  http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 12

13 Other issues  Reported by GridKa  User proxy delegation problems  At delegation time user gets “not authorized for operation” messages  Documentation available in:  http://grid.pd.infn.it/cream/field.php?n=Main.ErrorMessagesReportedByCREAM ToClient  Reported by LPSC  /tmp area of CREAM full of glexec “proxy files” (Bug #73961)  Not direct a CREAM issue although the service was affected  With CREAM1.6.3 the problem is solved  No workaround will have to be applied as soon as sites migrates to this version  Migration to CREAM1.6.3 is highly recommended 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 13

14 Other issues  Found at CERN  Lots of timeouts while querying the CREAM-DB during the summer 1. Increase of the timeout window to 3min 2. Deprecation of the CREAM-DB usage  Reported by Subatech  glite-ce-job-status fails with the message: “EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred”  Issue associated to poor memory in the CREAM-CE (~2GB when the issue was found)  CREAM-CE advice: CREAM-CE nodes should have a minimum of 8 GB of memory 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 14

15 More about CREAM  CREAM1.6.3 includes important bug fixes  See Massimo Sgaravatto’ presentation during the latest GDB meeting:  http://indico.cern.ch/conferenceDisplay.py?confId=83604  CREAM1.7 client will include glite-ce-job-output  This does not require changes in our CREAM.pm module  And the possibility to leave the OSB in the CREAM (and retrieve it on demand) is of course available 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 15

16 gLite-VOBOX  Current production version:  VOBOX 3.2.9 (gLite3.2/sl5_x86_64) patch#4257 (5.Oct 2010)  New features  new Glue 2.0 service publisher  new version of LB clients  Still gridftp server is included in this version  … included but not configured via YAIM  The startup of the service has to be treated besides YAIM  The removal of this server will be asked 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 16

17 HI CMS+ALICE exercise  Combined ALICE+CMS exercise (21-October 14:00, 22-October 14:00) to check the IT infrastructure (network and tapes) ability to cope with the expected rates  P2-CASTOR (2.5 GB/sec max) and transfer to tape  2.3 PB available on t0alice and 2.3 PB available on alicedisk  Reconstruct ~10% of data  Simultaneous copy of RAW data to disk pool (via xrd3cp, 2.5GB/sec max)  2100 TB extra space on disk pools provided before by IT  Asynchronous start up of the test  ALICE exported directly to CASTOR while CMS was performing a previous repacking before the export 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 17

18 P2  CASTOR transfers  Average rate – 2GB/sec with a max rate of 2.5 GB/sec  160 TB transferred (1-% of the expected HI volume), 60587 files (2.7 GB/file)  Several interruptions for detector reconfiguration and follow up on data transfer to tapes (realistic scenario) 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 18 (Plot provided by L. Betev)

19 CASTOR disk buffer  tapes transfers 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 19 Data in from P2 To tape Δt=1h (Plot provided by L. Betev) Average rate 2.4 GB/sec Data makes it to tape after 1 hours after being written to the CASTOR buffer 3 rd party copy delayed by 1h

20 Copy from toalice  alicedisk + reco 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 20 Copy t0alice to alicedisk (average 2.6GB/sec) RAW data reconstruction reading and writing (Plot provided by L. Betev) Average copy rate – 2.5 GB/sec Average reco “in” rate –200 MB/sec Average reco “out” rate -20MB/s

21 Monitoring: Nagios  Nagios  Monitoring of the ALICE VOBOXES in production since Summer2010  Visualization of the results via SAM is obsolete  Nagios implementation in ML still pending  Sites availability calculation: Currently being compared the calculations though SAM and through Nagios  Next MB meeting will show these results  Pending developments  Implementation of the CREAM-CE standard test suite  Redefinition of the site availability algorithm based on CREAM (currently based on LCG-CE) 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 21

22 Monitoring: Gridview  The transfer rate reported by Gridview is smaller than the real rate  Issue has been found in August 2010 but it is still pending  Track in a GGUS ticket: #61724 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 22 average transfer for the day of 20 MB/s average transfer for the day of 32 MB/s

23 Summary and conclusions  Very smooth production in these last three months  Raw data transfer to CASTOR, registration in the AliEn file catalog, transfers to T1 sites are already routine  Site inefficiencies immediately managed together with the site admins  Some changes have been included in AliEn v.2.19 concerning the CREAM- CE service  Based on the experiences gained this summer with the 1 st version of CREAM1.6  Some new improvements can be expected for the next CREAM1.7 version  Agile approach foreseen by ALICE with emphasis on the use of T2 sites (even becoming ALICE T1 sites) will be one of the topics to work in in the following months 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 23


Download ppt " Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010."

Similar presentations


Ads by Google