Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Similar presentations


Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009."— Presentation transcript:

1 WLCG Service Report Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 31 st March 2009

2 Introduction This report covers the two weeks since the last WLCG MB 2 SiteDateIssue RAL24/32 power glitches resulted in major site outage. CASTOR up at 14:30 on 25/3, other services soon after. ATLAS replication had to be restarted due to DB corruption. Some other knock-on effects… Move to new machine room & STEP’09? CNAF27/3 – 03/04Scheduled downtime – (power supply and air conditioning) due to the interconnection of the existing services to the new infrastructure system. LFC has been replicated in Roma (CHEP)CHEP PIC06/04 12h – 08/04 15h Annual power supply maintenance – scheduled downtime [ Impact on ATLAS ES(?) cloud? ] ASGC27/3 (?)Have now hired a full-time DBA (who?) Still in process of relocating services to IDC. Communication is still an issue here – (for example) many days delay in response to problems with Oracle DB & streaming for ATLAS – this is not compatible with the response times we have discussed here (MB) nor with a reliable service. Commissioning for STEP’09 needs to be understood!

3 More on ASGC ASGC T1 and Taiwan Federated T2 services will be collocated at IDC from Mar. 19. All the T1 and T2 services planned to be up and running before Mar. 23. We hope to make 2,500 cores and 1.3 Petabyte disk space available next week. Tape Library will be available about one more week later with clean tapes added gradually. During the transition, all ASGC T1 and T2 services will be shutdown and restart at the last day (Sunday, Mar. 22). The delivery of the tape system would be difficult to be online in one week due to the cleaning of existing tape system is not well fit with schedule proposed. besides, all tape drives are not able to cleanup due to the complex interior components of the LTO3/4 drives. the MES procurement expect to finish mid of next week, while the actual deliver term will extend another 45 days. though vendor promise to speed up the internal bidding as well as shipment but still might delay another two weeks or more. local IBM promise to loan for another LTO4 drive but we might have limited b/w if two VOs request the migration/stagin around the same time. we're still negotiating with local vendor see if having chance expending the drives to two or three more. migration of the data from existing cartridges will be another concern while we need to confirm other technical details, procedures before taking any action. merging/splitting of TS3500 tape system will delay another 3-4 days, depend on the labor order from the MES case. hope for your understanding, and we hope to relocate also the tape system into collocation data center area while we need better coordination due to the limited floor plane serving the tape library. Alessandro - is there any foreseen time when they will be ready? How could ATLAS handle situation with a T1 down so long? Trying to move Australian T2 to a different cloud. Maybe use LFC and FTS in TRIUMF? 3

4 GGUS Summaries 4  Alarm testing scheduled for this week – alarms should be issued and analysis complete well in advance of next week’s F2F meetings! VO concernedUSERTEAMALARMTOTAL ALICE1001 ATLAS1415029 CMS2002 LHCb112013 Totals2817045 VO concernedUSERTEAMALARMTOTAL ALICE2002 ATLAS1211023 CMS4004 LHCb82010 Totals2613039

5 5 VO boxes – voalice03/06 [ see slide notes for some discussion ] CE & SRM SRM – chimera etc CE CE & SRM Mainly SRM CE & SRM Mainly CE SRM

6 6 CE CE & SRM Mainly SRM Mainly CE SRM

7 Summary “Transitory” problems continue – possibly at a higher rate than in recent weeks (As in “here today and gone tomorrow”) Masking out scheduled or understood issues leaves a relatively good service view for this period – is this representative? Scale Testing for the Experiment Program – STEP ’09: [ Only ] Possible Schedule: May/June, to finish by end June!  ”The priority to improve the site readiness is there NOW and not only in May/June when ATLAS and the other VOs are actively scale-testing at the same time” 7

8 ATLAS metrics for STEP09 All T1’s must be fully operational – no downtimes for >4 hours are allowed during the STEP09 week T2’s are supposed to be fully operational – T2’s may sign off until June 1 We will size the tests such they all fit within 1 week – We will measure what has finished and what not Need to define precise tape metrics (rate, efficiency, losses,..) Need to define goals for production and reconstruction rates and efficiencies


Download ppt "WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009."

Similar presentations


Ads by Google