Presentation is loading. Please wait.

Presentation is loading. Please wait.

ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting.

Similar presentations


Presentation on theme: "ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting."— Presentation transcript:

1 ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting

2 Outline Current infrastructure The Incident Release and arch Resource level Monitoring services, alarm and automation Operation overview Challenges and issues Plans to Q4 2k9 and 2010

3 The incident … Move out all facilities for cleaning Container as storage and humidification Protect Racks from Dust Ceiling Removal

4 Lost all tape drives Snapshots of decommissioned tape drives after the incident

5 IDC collocation Facility install complete at Mar 27 th Tape system delay after Apr 9 th Realignment RMA for faulty parts

6 Current infrastructure (I) Shared cores services Atlas and CMS Stager, ns, dlf, repack, and lsf Same DB cluster, 3 RAC nodes SRM + stager/ns/dlf Disk pools & servers 80 disk servers (+6 will be online end of 3 rd w Oct) Total capacity: 1.67PB (0.3PB allocate dynamically) Current usage: 0.79PB (~58% usage) 14 diskpools (8 for atlas and 3 for CMS, another three for bio, SAM, and dynamic) Total capacity: 0.63PB and 0.7PB for CMS and Atlas resp. Current usage: 63% and 44% for CMS and Atlas.

7 Current infrastructure (II) Shared tape drives 12 before incident – all decommissioned 7 during STEP (2 loan LTO3 + 5 LTO4) 18 drives add 1 st w Oct. 24 (drives in tot.)

8 Monitoring services Std. nagios probes NRPE + customized plugins SMS to OSE/SM for all types of critical alarms Availability metrics Tape metrics (SLS) Throughput, capacity & scheduler per VO and diskpool

9 Tape system – during STEP Before incident: LTO3 * 8 + LTO4 * 4 720TB with LTO3 530TB with LTO4 May 2009: Two loan LOT3 drives MES: 6 LTO4 drives end of May Capacity: 1.3PB (old, LTO3,4 mixed) + 0.8PB (LTO4) New S54 model introduced 2K slots with tier model Required: Upgrade ALMS Enhanced gripper

10 Current resource level (I) Atlas - Space Tokens SpaceTokenCap./JobLimitDiskServersTapePool/Cap. atlasMCDISK163TB/7908- atlasMCTAPE38TB/802atlasMCtp/39TB atlasPrdD1T0278TB/81015- atlasPrdD0T161TB/2103atlasPrdtp/105TB atlasGROUPDISK19T/401- atlasScratchDisk28TB/801- atlasHotDisk2/40TB2- Total950TB/183546-

11 Current resource level (II) CMS Tape pools: Disk PoolCap./JobLimitDiskServersTapePool/Cap. cmsLTD0T1278T/4889* cmsPrdD1T0284T/156013 cmsWanOut72T/2204 * Dep. on tape family. Tape PoolCap(TB)/UsageDrive dedicationLTO3/4 mixed atlasMCtp8.98/40%NY atlasPrdtp101/65%NY cmsCSA08cruzet15.6/46%NN cmsCSA08reco5/0%NN cmsCSAtp639/99%NY cmsLTtp34.4/44%NN dteamTest3.5/1%NN

12 Castor rel. overview Services type OS levelReleaseRemark CoreSLC 4.7/x86-642.1.7-19Stager/ns/dlf SRMSLC 4.7/x86-642.7-183 headnodes Disk Svr.SLC 4.7/x86-642.1.7-1980 Q3 2k9 (20+ in Q4) Tape Svr.SLC 4.7/32 + 642.1.8-8X86-64 OS deployed for new tape server

13 Storage performance Env. Sequential I/O Dual channels Diff. cache size Results (IOPS) With 0.5kB IO size: 76.4k and 54k for read & write resp. Slightly decrease around 9% for both read and write when inc. IO size to 4kB.

14 Roadmap – Host I/F 2009 Q1Q2Q3Q4 4G FC ( ≈ 400 MB/sec) 8G FC ( ≈ 800 MB/sec) SAS 3G (4-lane ≈ 1200 MB/sec) iSCSI – 1Gb U320 - SCSI ( ≈ 320 MB/sec) iSCSI – 10 Gb SAS 6G (4-lane ≈ 2400 MB/sec) 3U16bay FC-SAS in May, 2U/12 and 4U/24 bay in June

15 Roadmap – Drive I/F 2009 Q1Q2Q3Q4 4G FC SAS 3G SAS 6G U320 - SCSI SATA-II 2.5” SSD (B12F series)

16 Est. Density 2009 H1 1TB, 1 rack (42U)= 240TB 2009 H2 2TB, 1 rack (42U)= 480TB 2010 H1 2TB, 1 rack (42U)= 480TB 2010 H2 3TB, 1 rack (42U)= 720TB 2012 5TB…..

17 Pledged and future expansion 2k8 0.5PB expansion of Tape system in Q2 Meet MOU target mid of Nov. 1.3MSI2k per rack base on recent E5450 processor. 2k9 Q1 150 QC blade servers 2TB per drives for raid subsystem 42TB net capacity per chassis and 0.75PB in total 2k9 Q3-4 18 LTO4 drives – mid of Oct 330 Xeon QC (SMP, Intel 5450) blades servers 2nd phase TAPE MES - 5 LTO4 drives + HA 3rd phase TAPE MES – 6 LTO4 drives ETA 0.8PB expansion delivery: mid of Nov

18 Tape HA considered – Q4 Considering the faulty accessor inc. associated controllers active accessor takes over all work requests including any in progress when the fault occurred IBMIBM IBMIBM Medium Changer Controller Operator Panel Controller Active Frame 1 Active Frame 2 Active Frame 3 Medium Changer Controller Active Frame 4 Service Bay A Service Bay B Accessor Controller XY Controller Medium Changer Controller Accessor Controller XY Controller

19 Network Infrastructure

20 Issues Network infrastructure Split edge level serving T1 MSS and T2 DPM disk servers Rotation shift +1 FTE 24x7 operation Review instruction manuals Regular meeting between OSE (shifters) and SM Evaluate performance (eLog, and ticket escalated) Release upgrade Consider gradually core service upgrade Recommend components to startup? Mimic on CTB

21 Incidents Power surge cause critical services crash twice last 4month Disk, tape servers, RAC and all core services Tape migration problem (CMS) Wrong label type (0.2k cartridges) Relabel empty tapes Controller failures no regular pattern - kernel dump at f/w level archiver log error 4T for SRM and CASTOR backup (1.6T/week) New backup scratch attach to SAN

22 Upcoming expansion 1K LTO4 cartridges + 11 LTO4 drives Move to different datacenter area. Tape system HA setup Vdqm2 Priority and better group reservation Evaluate platform for RAC NFS base on NAS 2nd Tier backup Regular restore practices (2nd backup on disk cache) TSM setup complete, POC continue end of Nov.


Download ppt "ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting."

Similar presentations


Ads by Google