Database Services at CERN Status Update Maria Girone, CERN IT-PSS
Database Service Evolution Until summer 2005 Solaris based shared Physics DB cluster (2-nodes for HA) Low CPU power, hard to extend, shared by all experiments (many) linux disk servers as DB servers High maintenance load, no resource sharing, no redundancy Now consolidation on extensible database clusters No sharing across experiments Higher quality building blocks Midrange PCs (RedHat ES) FibreChannel attached disk arrays As of last month - all LHC services moved LCG Database Workshop Maria Girone
Service Architecture - Oracle Database Clusters The Physics Database Production and Validation services are deployed on 2-node RAC/Linux, in failover mode LCG Database Workshop Maria Girone
Experience with RAC availability Managed to apply ORACLE security patches in rolling fashion Big step to decrease planned downtime Need in time patch information from Oracle Most RAC based services stayed up during last power cut - service is now on critical power Investigating some glitches on ATLAS RAC nodes Startup after service problem significantly faster than old disk-server based services LCG Database Workshop Maria Girone
DB Storage Configuration (in production) Data DG-2 Recovery DG-1 Data DG-1 Recovery DG-2 Disk Groups (ASM) DB N.1 DB N.2 Disk groups created with ‘horizontal’ slicing Benefits: more effective use of available storage High availability - Allows to keep backups on disk Higher performance (30%-50%) - Allows clusterware mirroring Oracle RAC Nodes Storage Arrays LCG Database Workshop Maria Girone
Service Throttling - Resource Usage Reports Run into degraded service after single remote user submitted many (idle) jobs Defined account profile for larger apps Db accounts are shared among many users Switched on idle session “sniping” (default = 3h idle time) Proposing (eg weekly) resource overview to experiment database coordinator Allow experiment to prioritize resources and identify unexpected usage patterns Which jobs/users got affected by what limit? LCG Database Workshop Maria Girone
RAC Hardware evolution for 2006 Linear ramp-up budgeted for hardware resources in 2006-2008 Planning next major service extension for Q3 this year Current State ALICE ATLAS CMS LHCb Grid 3D Non-LHC Validation - 2-node offline 2-node 2x2-node 2-node online test Pilot on disk server Proposed structure in Q2 2006 4-node 4--node 2-node (PDB replacement) 2-node valid/test 2-node pilot Compass?? Online? LCG Database Workshop Maria Girone
RAC Expansion for Q2 New mid-range servers received and installed Passed acceptance tests by IT-FIO Waiting for additional disk-arrays and fibre channel switches Expect delivery end of February Planning the setting up in collaboration with IT-FIO Proceed in two steps February: Extension of existing RACs with additional CPUs Cabling work for fibre channel and IP networks has started March: Creation of new RACs eg dedicated experiment validation servers after disk-arrays and switches arrived LCG Database Workshop Maria Girone
Moving to 10gR2 Proceed with move to 10gR2 as main production platform for 2006 Planning with IT-DES to migrate development service for experiments to 10gR2 this month Plan to setup new RAC servers with 10gR2 Will start with validation setups Plan to migrate production service to new release as soon as experiments have validated their apps on dev or validation service Target complete move by end of March LCG Database Workshop Maria Girone
Backups Strategy - Review with Experiments Default backup retention policy and frequency needs review by experiments Backup schedule - is the default of two full backups sufficient? Is the latency of a partial or full recovery acceptable? Can we reduce fraction of active writeable data? And thereby backup volume and latency Impact on physical data organisation and applications Database backup/recovery at Tier 1’s Any experiment requirements on latency to recover? Impact on Tier 0 services for replicated data Propose to setup meetings with experiment database coordinators document an agreed strategy and present at next workshop (summer) LCG Database Workshop Maria Girone
Summary LCG database services now fully based on RAC Benefits of consolidation and additional flexibility obtained Q2 Database extension proceeding as planned Dedicated experiment database clusters will double in CPU power Dedicated validation resources will simplify planning Second h/w extension (Q3) will need to go out soon Need to regularly plan evolution with experiment database responsible Regular resource usage reports could be a good basis Get started with backup and recovery strategy discussions LCG Database Workshop Maria Girone