Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.

Slides:



Advertisements
Similar presentations
Database Tuning. Objectives Describe the roles associated with database tuning. Describe the dependency between tuning in different development phases.
Advertisements

Refeng Wu CQ5 WCM System Administrator
The Architecture of Oracle
Acknowledgments Byron Bush, Scott S. Hilpert and Lee, JeongKyu
DB server limits (process/sessions) Carlos Fernando Gamboa, BNL Andrew Wong, TRIUMF WLCG Collaboration Workshop, CERN Geneva, April 2008.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
DataBase Administration Scheduling jobs Backing up and restoring Performing basic defragmentation and index rebuilding Using alerts Archiving.
Course Goals Introduce Terms Skills –Modern DBMS (SQL Server 2008) –SQL querying and data access –Stored procedures including parameters –Brief introduction.
1 - Oracle Server Architecture Overview
Harvard University Oracle Database Administration Session 2 System Level.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Database Backup and Recovery
Backup and Recovery Part 1.
1 Recovery and Backup RMAN TIER 1 Experience, status and questions. Meeting at CNAF June of 2007, Bologna, Italy Carlos Fernando Gamboa, BNL Gordon.
Oracle backup and recovery strategy
1 RAL Status and Plans Carmine Cioffi Database Administrator and Developer 3D Workshop, CERN, November 2009.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
IBM Software Group Washington Area Informix User Group Forum 2004 The DB2 DBA Checklist Dwaine R Snow, DB2 & Informix.
Exam QUESTION CertKiller.com has hired you as a database administrator for their network. Your duties include administering the SQL Server 2008.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Online Database Support Experiences Diana Bonham, Dennis Box, Anil Kumar, Julie Trumbo, Nelly Stanfield.
Basic Oracle Architecture
5 Copyright © 2004, Oracle. All rights reserved. Using Recovery Manager.
16 Copyright © 2007, Oracle. All rights reserved. Performing Database Recovery.
1 Robert Wijnbelt Health Check your Database A Performance Tuning Methodology.
11g(R1/R2) Data guard Enhancements Suresh Gandhi
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
Oracle Tuning Considerations. Agenda Why Tune ? Why Tune ? Ways to Improve Performance Ways to Improve Performance Hardware Hardware Software Software.
1 Oracle Architectural Components. 1-2 Objectives Listing the structures involved in connecting a user to an Oracle server Listing the stages in processing.
Copyright © Oracle Corporation, All rights reserved. 1 Oracle Architectural Components.
An Oracle server:  Is a database management system that provides an open, comprehensive, integrated approach to information management.  Consists.
Backup and Recovery Overview Supinfo Oracle Lab. 6.
CASTOR Databases at RAL Carmine Cioffi Database Administrator and Developer Castor Face to Face, RAL February 2009.
Continuous DB integration testing with RAT „RATCOIN”
DB Questions and Answers open session Carlos Fernando Gamboa, BNL WLCG Collaboration Workshop, CERN Geneva, April 2008.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
IT1001 – Personal Computer Hardware & system Operations Week7- Introduction to backup & restore tools Introduction to user account with access rights.
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Considerations for database servers Castor review – June 2006 Eric Grancher, Nilo Segura Chinchilla IT-DES.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
6 Copyright © Oracle Corporation, All rights reserved. Backup and Recovery Overview.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
14 Copyright © 2007, Oracle. All rights reserved. Backup and Recovery Concepts.
Oracle Database Architectural Components
1 Copyright © 2005, Oracle. All rights reserved. Oracle Database Administration: Overview.
Oracle structures on database applications development
How To Pass Oracle 1z0-060 Exam In First Attempt?
WLCG DB Service Reviews
WLCG Service Report 5th – 18th July
Oracle Database Monitoring and beyond
Case studies – Atlas and PVSS Oracle archiver
Index Index.
Database Backup and Recovery
Presentation transcript:

Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008

Overview Current setup Issues Lessons Learnt Monitoring Future

RAL CASTOR Architecture Our setup is for: –Atlas (Stager, SRM) –CMS (Stager, SRM) –LHCb (Stager, SRM) –General (SRM) –Name Server –DLF –Gen Stager –Repack

RAL CASTOR Architecture 12 nodes to use –Need production and test Options included: –Single instance (or small cluster) for each schema –One huge RAC –Combination of above Constraints –Licenses –Single points of failure (did lose all paths at one point) –Resources

RAL CASTOR Architecture Outcome –2 x 5 node production clusters –1 x 2 node test clusters neptune1Atlas DLF LHCB DLF neptune2Atlas SRM neptune3LHCb Stager neptune4LHCb SRM neptune5Atlas Stager pluto1Name server CMS Stager pluto2CMS SRM pluto3Gen Stager pluto4Gen SRM Repack pluto5CMS DLF Gen DLF

RAL CASTOR Architecture Oracle Enterprise RAC –Production –Test –All clusters patched with July CPU Backups –RMAN to disk –Tape to Atlas Data Store Monitoring –Oracle Enterprise Manager –Nagios and ganglia on machines

Village of CASTOR, Cambridgeshire, UK

Issues – “crosstalk” Terminology –SQL executing in wrong schema Issue –14000 files lost on LHCb Evidence –Garbage collection on CASTOR –“Deleting local file which is no longer in the stager catalog” –Also in LHCb stager log: “No object found for id : ” This is in the Atlas files2delete table

Issues – “crosstalk” Suspicion –Not seen by Oracle in –Redo logs inconclusive –Lots of areas with possible wrong config Disk server tnsnames entries IP address for VIPs on database servers Puppet config (on disk servers and central servers) Connection to wrong schema Outcome –Synchronisation is suspended –Haven’t recreated –Difficult for Oracle to analyse

Issues – core dumping Issue –ORA-600 sometimes when delete on id2type table –Happens twice a week on average Evidence –Only at least two stager schemas (and nodes) –Application and Oracle logs Outcome –Application recovers –SR Open and RDA being performed

Issues – cursor invalidation Issue –Detected after getting DML partition lock (ORA-14403) Strangeness –Oracle say resolved in (which we’re on!) –Action from Oracle “nothing to be done, error should never be returned to user” –Can not recreate at will Outcome –SR Open –Parameter to implement (needs instance restart)

Issues – constraint violations Issue –Violation of primary key constraint (ORA-00001) –Seen on Atlas Stager id2type table –Complicated Outcome –Implemented Eric’s code to trap error and log it to alert log (will be effective when existing Stager processes restarted)

Issues – Big IDs Issue –Huge numbers appearing in INSERT statements –Not from any sequences on the database –Complicated Example: insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values (' ','1002'); insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values (' ','1008'); insert into "SRMCMS"."ID2TYPE"("ID","TYPE") values (' ','1005'); insert into "SRMCMS"."ID2TYPE"("ID","TYPE") values (' ','1002');

Issues – performance Issue 1 –Stale statistics appeared even though gathered –Noticed because of poor performance –Re-gathered, pool flushed and all fine Issue 2 –Well-used SQL query time degraded on Stager (by 300%) –New SQL Profile improved performance again –Due to stats on fluctuating tables? –Cluster waits on Atlas, high network I/O in Atlas/LHCb

Issues – performance Issue 3 –CPU load increasing over 3-4 days –Bonny cleared up subrequest table –Shrank table and it was solved

CASTOR Oil Plant

Monitoring DB Load –Difficult to know if linked to requests/files –Tools of CASTOR “load” would useful –Is application “good” at being on RAC Oracle Services –Currently one “preferred” node and one “available” node for each schema –Stagers failover to SRM for example –Is two nodes per Stager better?

Lessons Learnt 1 Machine configuration –Be careful with tnsnames –IP and VIP addresses need care –Hardware should be similar –Schema names are similar Database Administration –We can add/remove cluster node without downtime –Tuning, shrinking and profiles experience –Log miner skills

Lessons Learnt 2 Volume –Very high number of transactions –200GB of archive redo logs per day (DB on 80GB) –Recovery would be an issue? Image copies? –Need lots of space for log miner Space –Space needed for analysis (e.g. log miner) –More space needed for redo logs/backups

CASTOR River, Ontario, Canada

People DBAs –Team of four –Good to share skills and experience –Not enough knowledge of application –Pressure CASTOR team at RAL –Excellent communication with DBAs –Gained knowledge of databases –Difficult to know if database or application at fault

People CERN and other Tier-1s –Invaluable support –Good communication via lists –Thanks! –More work together for future architecture –Wiki page appreciated Oracle –Metalink support has been very good

Next Steps Set-up –Moving to single instance for 2-3 weeks –Don’t change too much at once! –Difficult to rule out DB issues –Hardware resilience –Auditing? Overhead. Performance –Any more data to clean out? –Tune more SQL –More tests on failover –Backup/recovery –Proactivity

CASTOR star in Gemini (second brightest)

Questions for CERN/Tier-1s CASTOR Reporting Tools –Shaun has produced stats on SRM showing transactions –What do others use? –What would be useful? Monitoring –What do you monitor (DB and application)? –What’s important in the logs? –Any custom threshold alerts in OEM/lemon?

Questions for CERN/Tier-1s Database –Do you gather stats every night? Full? –Any other regular DB jobs? Shrinking? –Amount of transactions/redo logs? –CPU levels? –Plans for 11g? –Backups – full? Level 1? Validate every night? People –How many DBAs (working on CASTOR)? –DBAs knowledge of application? –3D/CASTORCollaboration

Questions and (hopefully) Answers