Active Data Guard at CERN

Slides:



Advertisements
Similar presentations
Using the SQL Access Advisor
Advertisements

Zhongxing Telecom Pakistan (Pvt.) Ltd
Technische Universität München + Hewlett Packard Laboratories Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 6 File Systems 6.1 Files 6.2 Directories
ORACLE DATABASE HIGH AVAILABILITY & ORACLE 11GR2 DATA GUARD 1 Güneş EROL.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
© Tally Solutions Pvt. Ltd. All Rights Reserved Shoper 9 License Management December 09.
© SafeNet Confidential and Proprietary Administering SafeNet StorageSecure Smart Card Module 3: Lesson 5 SafeNet StorageSecure Storage Security Course.
Tom Hamilton – America’s Channel Database CSE
Fast Crash Recovery in RAMCloud
ITEC474 INTRODUCTION.
1 Chapter 16 Tuning RMAN. 2 Background One of the hardest chapters to develop material for Tuning RMAN can sometimes be difficult Authors tried to capture.
9 Copyright © 2006, Oracle. All rights reserved. Automatic Performance Management.
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Database Performance Tuning and Query Optimization
© Tally Solutions Pvt. Ltd. All Rights Reserved 1 Housekeeping in Shoper 9 POS February 2010.
ETS4 - What's new? - How to start? - Any questions?
PP Test Review Sections 6-1 to 6-6
1 tRelational/DPS Overview. 2 ADABAS Data Transfer: business needs and issues tRelational & DPS Overview Summary Questions? Demo Agenda.
Replication solutions for Oracle database 11g Zbigniew Baranowski.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
4 Copyright © 2005, Oracle. All rights reserved. Extraction, Transformation, and Loading (ETL) Extraction and Transportation.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
PSSA Preparation.
Introduction to ikhlas ikhlas is an affordable and effective Online Accounting Solution that is currently available in Brunei.
INTRODUCTION TO ORACLE Lynnwood Brown System Managers LLC Oracle High Availability Solutions RAC and Standby Database Copyright System Managers LLC 2008.
Introduction to DBA.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
© 2015 Dbvisit Software Limited | dbvisit.com An Introduction to Dbvisit Standby.
10 Copyright © 2009, Oracle. All rights reserved. Managing Undo Data.
EIM April 19, Robin Weaver 13 Years with IBM Prior to Assignment at UNC Charlotte Range of Database Development/Data Management Projects and Products.
Backup and Recovery Part 1.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
National Manager Database Services
Oracle backup and recovery strategy
Introduction to Oracle Backup and Recovery
Backup & Recovery Concepts for Oracle Database
High Availability & Oracle RAC 18 Aug 2005 John Sheaffer Platform Solution Specialist
1 Data Guard Basics Julian Dyke Independent Consultant Web Version - February 2008 juliandyke.com © 2008 Julian Dyke.
Oracle Recovery Manager (RMAN) 10g : Reloaded
PPOUG, 05-OCT-01 Agenda RMAN Architecture Why Use RMAN? Implementation Decisions RMAN Oracle9i New Features.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Lost Writes, a DBA’s Nightmare? UKOUG 2013 Technology Conference Luca Canali – CERN Marcin Blaszczyk - CERN.
16 Copyright © 2007, Oracle. All rights reserved. Performing Database Recovery.
11g(R1/R2) Data guard Enhancements Suresh Gandhi
1 Data Guard. 2 Data Guard Reasons for Deployment  Site Failures  Power failure  Air conditioning failure  Flooding  Fire  Storm damage  Hurricane.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Marcin Blaszczyk, Zbigniew Baranowski – CERN Outline Overview & Architecture Use Cases for Our experience with ADG and lessons learned Conclusions.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
3 Copyright © 2006, Oracle. All rights reserved. Using Recovery Manager.
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Database Competence Centre openlab Major Review Meeting nd February 2012 Maaike Limper Zbigniew Baranowski Luigi Gallerani Mariusz Piorkowski Anton.
Overview of Oracle Backup and Recovery Darl Kuhn, Regis University.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
Emil Pilecki Credit: Luca Canali, Marcin Blaszczyk, Steffen Pade.
CERN IT Department CH-1211 Genève 23 Switzerland 1 Active Data Guard Svetozár Kapusta Distributed Database Operations Workshop November.
10 Copyright © 2007, Oracle. All rights reserved. Managing Undo Data.
Oracle Database High Availability
Backup and Recovery (1) Oracle 10g Hebah ElGibreen CAP364.
Maximum Availability Architecture Enterprise Technology Centre.
Oracle Database High Availability
Performing Database Recovery
Introduction.
Presentation transcript:

Active Data Guard at CERN Luca Canali – CERN Marcin Blaszczyk – CERN UKOUG Conference, Birmingham, 3rd-5th December 2012

Outline CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions

Outline CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions

CERN European Organization for Nuclear Research founded in 1954 20 Member States, 7 Observer States + UNESCO and UE 60 Non-member States collaborate with CERN 2400 staff members work at CERN as personnel, 10 000 more researchers from institutes world-wide 5 Nobel Laureates

LHC and Experiments Large Hadron Collider (LHC) – particle accelerator used to collide beams at very high energy 27 km long circular tunnel Located ~100m underground Protons currently travel at 99.9999972% of the speed of light Collisions are analysed with usage of special detectors and software in the experiments dedicated to LHC New particle discovered! consistent with the Higgs Boson Speed of protons at 4TeV (4000 Gev) relative to speed of light using E(v)=m(0)*c^2*1/sqrt(1-(v/c)^2) and m(0)=0.938272029 GeV for protons select sqrt(1-1/power(4000/0.938272029,2)) from dual;

WLCG The world’s largest computing grid More than 20 Petabytes of data stored and analysed every year Over 68 000 physical CPUs Over 305 000 logical CPUs 157 computer centres in 36 countries More than 8000 physicists with real-time access to LHC data

Oracle at CERN Relational DBs play a key role in the LHC production chains Accelerator logging and monitoring systems Online acquisition, offline: data (re)processing, data distribution, analysis Grid infrastructure and operation services Monitoring, dashboards, etc. Data management services File catalogues, file transfers, etc. Metadata and transaction processing for tape storage system

CERN’s Databases ~100 Oracle databases, most of them RAC Mostly NAS storage plus some SAN with ASM ~300 TB of data files for production DBs in total Examples of critical production DBs: LHC logging database ~140 TB, expected growth up to ~70 TB / year 13 Production experiments’ databases ~120 TB in total 15 Data Guard RAC clusters in Prod Active Data Guards since upgrade to 11g Redo Transport

Outline CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions

Active Data Guard – the Basics ADG enables read only access to the physical standby

Active Data Guard @CERN Replication Inside CERN from online (data acquisition) to offline (analysis) Replication to remote sites (being considered) Load Balancing Offloading queries Offloading backups Features available also with Data Guard Disaster recovery Duplication of databases Others

for users’ access and for disaster recovery Architecture We Use Maximum performance Maximum performance Redo Transport Active Data Guard for users’ access Redo Transport Redo Transport Primary Database Active Data Guard for users’ access and for disaster recovery Primary Database Active Data Guard for disaster recovery 1. Low load ADG 2. Busy & critical ADG LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL ASYNC NOAFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<standby_db_unique_name>’

Deployment Model Production RAC systems (Active) Data Guard RAC Number of nodes: 2 - 5 NAS storage with SSD cache (Active) Data Guard RAC Number of nodes: 2 - 4 Sized for average production load ASM on commodity HW, recycle ‘old production’

Outline CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions

Replication Use Cases Number of replication use cases Inside CERN Across the Grid So far handled by Oracle Streams Logical replication with messages called LCRs (Logical Change Record) Service started with 10gR1, now 11gR2 Online database OFFline database Downstream database PROPAGATION Remote Sites’ databases PROPAGATION Redo Transport PROPAGATION

ADG vs. Streams for Our Use Cases Active Data Guard Block level replication More robust & Lower latency Full database replica Less maintenance effort  More use cases than just replication  Complex replication cases not covered  Read-only replica  Streams SQL based replication More flexible Data selectivity Replica is accessible for read-write load  Unsupported data types  Throughput limitations  Can be complicated to recover 

Offloading Queries to ADG Workload distribution Transactional workload runs on primary Read-only workload can be moved to ADG Read-mostly workload DMLs can be redirected to remote database with a dblink Examples of workload on our ADGs: Ad-hoc queries, analytics and long-running reports, parallel queries, unpredictable workload and test queries

Offloading Backups to ADG Significantly reduces load on primary Removes sequential I/O of full backup ADG has great improvement for VLDBs allows usage of block change tracking for fast incremental backups

Disaster Recovery We have been using it since a few years Switchover/failover is our first line of defense Saved the day already for production services Disaster recovery site at ~10 km from our datacenter In the future remote site in Hungary * Active Data Guard option not required

DATA GUARD RAC database Duplicate for Upgrade 6 3 4 2 5 1 DATABASE downtime RW Acess RW Access Primary database RAC DATA GUARD RAC database RDBMS upgrade Upgrade complete! Clusterware 10g + RDBMS 10g Clusterware 11g + RDBMS 10g Redo Transport Redo Transport Clusterware 11g + RDBMS 11g

Duplicate for Upgrade - Comments Risk is limited  Fresh installation of the new clusterware Old system stays untouched Allows full upgrade test Allows stress testing of new system Downtime is limited  ~ 1h for RDBMS upgrade Additional hardware is required  Only for the time of the upgrade * Active Data Guard option not required

More Use Cases Load testing Recover objects against logical corruption Real application testing (RAT) Recover objects against logical corruption Human errors Leverage flashback logs Additional writes may have the negative impact on production database Data lifecycle activities

Snapshot Standby Facilitates opening standby in read-write mode Number of steps is limited to two Restore point created internally by Oracle Primary database is always protected Redo is being received when standby is opened SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY; SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;

Data Lifecycle Activities Transport big data volumes from busy production DBs Transportable Tablespaces (TTS), Data Pump Challenge: TTS requires read only tablespaces Primary database SNAPSHOT STANDBY Physical standby ARCHIVAL database Redo Transport Data Pump Transportable Tablespaces SQL> ALTER TABLESPACE data_2011q4 READ ONLY; SQL> ALTER TABLESPACE data_2011q3 READ ONLY; SQL> ALTER TABLESPACE data_2011q2 READ ONLY; SQL> ALTER TABLESPACE data_2011q1 READ ONLY;

Custom Monitoring Monitoring Racmon Availability Latency + Automatic MRP restart Racmon

Custom Monitoring Strmmon

Outline CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions

Sharing Production Experience ADG in production since Q1 2012 Oracle 11.2.0.3, OS: RHEL5 64 bit In the following: A few thoughts of how ADG is working in our environment Sharing experience from operations

ADG for Replication More robust than Streams Fewer incidents, less configuration, less manpower Users like the low latency More complex replication setups Still on Streams

Offloading Backups to ADG Problem: OLTP production slow during backup Latency for random IO is critical to this system degraded by backup sequential IO It’s a storage issue: Midrange NAS storage

Details of Storage Issue on Primary Number of waits Very slow reads appear Reads from SSD cache go to zero

Custom Backup Implementation We have modified our backup implementation and adapted it for ADG It was worth the effort for us. Many details to deal with. One example: archive log backup and deletion policies Google “Szymon Skorupinski CERN openworld 2012” for details

Backups with RMAN We find backups with RMAN still a very good idea Data Guard is not a backup solution Automatic restore and recovery system Periodic checks that backups can be recovered Block change tracking on ADG Incremental backups only read changed blocks Highly beneficial for backup performance

Redo Apply Throughput ADG redo apply by MRP Note on our configuration: Noticeable improvement in speed in 11g vs. 10g We see up to ~100 MB/s of redo processed Much higher than typical redo generation in our DBs Note on our configuration: Standby logfiles, real time apply + force logging Archiving to ADG using ASYNC NOAFFIRM

Redo Apply Latency Latency to ADG normally no more than 1 sec Apply latency spikes: ~20 sec, depends on workload Redo apply slaves wait for checkpoint completed (11g) Latency builds up when ADG is creating big datafiles One redo apply slave creates file, rest of slaves wait (11g) MRP (recovery) can stop with errors or crashes Possible workarounds: Minimize redo switch by using large redo log files Create datafiles with small initial size and then (auto)extend MRP slave processes prx slaves wait for checkpoint complete at each redo log switch. This is show by ash: select * from gv$active_session_history where event='checkpoint completed';

Latency and Applications Currently alarms to on-call DBAs When latency primary-ADG greater than programmable What if we had to guarantee latency Applications maybe would need to be modified to make them latency-aware Useful techniques to know ALTER SESSION SET STANDBY_MAX_DATA_DELAY=<n>; ALTER SESSION SYNC WITH PRIMARY; – SYNC only!

Administration In most cases ADG doesn’t require much DBA time after initial setup Not much different than handling production DBs Although we see bugs and issues specific to ADG Two examples in following slides

Failing Queries on ADG Sporadic ORA-1555 (snapshot too old) on ADG Normal behaviour if queries need read consistent images outside undo retention threshold This case is anomalous: short queries Under investigation, seems a bug Example: ORA-01555 caused by SQL statement below (SQL ID: …, Query Duration=1 sec, SCN: …) Note data blocks in ADG don’t receive (commit time) cleanout by default in 11g as redo for that is not logged (unless alter system set "_log_committime_block_cleanout"=true scope=memory sid='*';) So cleanout is done on ADG each time block is read. 11gr2 has concept of minimum active SCN to reduce work needed for block cleanout

Stuck Recovery and Lost Write Issue: ORA-600 [3020] “Redo is inconsistent with data block” Critical production with 2 ADGs. Both ADGs stop applying redo at the same time... on a Friday night! Primary keeps working Analysis: Possible cause: lost write on primary Corruption in one index on primary - rebuilt online by DBA on-call Is the root cause an HW issue or an Oracle bug? Impact: Redo apply on ADG is stuck while issue is being fixed by DBA

Some Thoughts on Lost Write Issue Complex recovery case See also note: Resolving ORA-752 or ORA-600 [3020] During Standby Recovery [ID 1265884.1] How to be proactive? Setting parameter DB_LOST_WRITE_PROTECT=TYPICAL It is a warning mechanism not a fix for lost write Impact: just a few percent increase in redo log writes Should I worry that my primary DB and ADG are not fully synchronized? Not easy to check Note ADG and primary are not exact binary copies Example: commit cleanout is not logged on primary by default unless "_log_committime_block_cleanout"=true is set (default is false)

Plans for the Future Deploy ADG for replication to remote sites Source at CERN, ADG at another grid site ADGs under administration by remote sites’ DBAs Challenges: Manage heterogeneous environment Redo transport over WAN Maintain a partial standby

Features Under Investigation Data Guard Broker fast-start failover Partial standby configuration Cascading standby Synchronous redo transport ADG in 12c

Outline CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions

Conclusions Active Data Guard provides added value to our database services Almost 1 year of production experience In particular ADG has helped us Strengthening our replication deployments Offloading production workload, including backups We find ADG has low maintenance overhead Although requires some additional effort in monitoring Found a few bugs on the way We find ADG mature Although we also look forward to additional improvements in next releases We are planning to extend our usage of ADG

Acknowledgements CERN Database Group Experiments Community at CERN Oracle contacts: Monica Marinucci, Greg Doherty, Larry Carpenter for discussions

Thank you! Luca.Canali@cern.ch Marcin.Blaszczyk@cern.ch