Schrödinger’s Backup Will your recovery work?

Slides:



Advertisements
Similar presentations
SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Advertisements

Challenge for all the Seniors (DBAs) QuestionAreaYou (Today) You (6 Months) You (1 Year) 1Design Tables 2Write Queries 3Deploy Changes 4Tune Queries 5Monitor.
Database Optimization & Maintenance Tim Richard ECM Training Conference#dbwestECM Agenda SQL Configuration OnBase DB Planning Backups Integrity.
Keith Burns Microsoft UK Mission Critical Database.
Exchange 2013 (backup &) Disaster Recovery
Module 13 Automating SQL Server 2008 R2 Management.
Five Battle-Tested Practices to Avoid Data Loss Greg Shields, MVP, vExpert.
November 2009 Network Disaster Recovery October 2014.
CN1276 Server Kemtis Kunanuraksapong MSIS with Distinction MCTS, MCDST, MCP, A+
FireRMS SQL Audit, Archiving & Purging Presented by Laura Small FireRMS Quality Assurance.
Chapter 18: Windows Server 2008 R2 and Active Directory Backup and Maintenance BAI617.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
11 DISASTER RECOVERY Chapter 13. Chapter 13: DISASTER RECOVERY2 OVERVIEW  Back up server data using the Backup utility and the Ntbackup command  Restore.
Backup & Restore The purpose of backup is to protect data from loss. The purpose of restore is to recover data that is temporarily unavailable due to some.
CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration Disaster Recovery.
HalFILE 2.1 Network Protection & Disaster Recovery.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Digging Out From Corruption Eddie Wuerch, MCM - Principal, Database Performance - Salesforce Marketing Cloud Data protection and loss recovery with SQL.
Unit 10 ITT TECHNICAL INSTITUTE NT1330 Client-Server Networking II Date: 2/24/2016 Instructor: Williams Obinkyereh.
Putting Your Head in the Cloud Working with SQL Azure David Postlethwaite 19/09/2015David Postlethwaite.
Establishing a Service Level Agreement SLA =tg= Thomas Grohser SQL Server MVP SQL Server Performance Engineering.
SQL Server High Availability Introduction to SQL Server high availability solutions.
Dealing with Database Corruption DBA 911. Who am I? 2 David M Maxwell twitter.com/dmmaxwell or twitter.com/upsearchsqltwitter.com/dmmaxwelltwitter.com/upsearchsql.
WHEN DATABASE CORRUPTION STRIKES Presented by Steve Stedman Founder/Owner of Stedman Solution, LLC.
Backup and Disaster Dr Stuart Petch CeG IT/IS Manager
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
RMAN Maintenance.
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
Database recovery contd…
RMAN Maintenance.
Planning for Application Recovery
Why are you still taking backups?
Adam Backman Chief Cat Wrangler – White Star Software
Tips for SQL Server Performance and Resiliency
Disaster Recovery Where to Begin
Establishing a Service Level Agreement SLA
What, When, Why, Where and How SCC maintains your Oracle database
Database Corruption Advanced Recovery Techniques|
Disaster Recovery and SQL for new and non-DBAs
Basic Computer Maintenance
Test Upgrade Name Title Company 9/18/2018 Microsoft SharePoint
Unit 10 NT1330 Client-Server Networking II Date: 8/16/2016
Tips for SQL Server Performance and Resiliency
How to Lose Your Job in 3 Easy Steps
dbatools! The reason to finally start learning and using Powershell
dbachecks! DBA Checklists: Reliable, Repeatable, & Automated
Database Corruption Advanced Recovery Techniques
Understanding and Handling Database Corruption
Making PowerShell Useful
Schrödinger’s Backup Will your recovery work?
Backup and Restore your SQL Server Database
Database Corruption Advanced Recovery Techniques
RPO, RTO & SLA: 3 Letter Words for When the SHT hits the FAN
Database Corruption Advanced Recovery Techniques
dbatools! The reason to finally start learning and using Powershell
Backup to Basics Tom Fox
Reliable, Repeatable, Configurable & Automated Validation with
PowerShell & PowerBi Reducing DBAs Context Switching
Making PowerShell Useful
The Ultimate Maintenance Plan By Edward Roepe Perimeter DBA, LLC
Disaster Recovery is everyone’s job!
Using the Cloud for Backup, Archiving & Disaster Recovery
GitHub 101 Using Github and Git for Source Control
Advanced Recovery Techniques
Administrator’s Manual
dbatools! The reason to finally start learning and using Powershell
Michelle Haarhues Keeping up with SSMS.
Microsoft Virtual Academy
The DBA Quit and now you’re it:
Presentation transcript:

Schrödinger’s Backup Will your recovery work? Patrick Flynn BRISBANE | 27 MAY 2017 #630 Schrödinger’s Backup Will your recovery work? #630 | Brisbane 2017

Thank you to our sponsors: #630 | Brisbane 2017

Evaluations: #630 | Brisbane 2017

Who am I MCM – SQL Server 2008 MCSM – Data Platform Patrick Flynn Twitter @sqllensman email sqllensman@outlook.com MCM – SQL Server 2008 MCSM – Data Platform Production DBA for 10+ years. #630 | Brisbane 2017

Agenda Schrödinger’s Backup Restore Strategy Requirements Demo’s #630 | Brisbane 2017

Schrödinger’s cat A thought experiment devised by Austrian physicist Erwin Schrödinger in 1935 a cat, a flask of poison, a radioactive source are placed in a sealed box. If an internal monitor detects radioactivity the flask is shattered, releasing the poison that kills the cat. While box is closed the cat can be thought to be both alive and dead. Only when box is opened can actual state be determined. #630 | Brisbane 2017

Schrodinger’s Backup Not testing your recovery plan is unknowingly running a Schrödinger’s backup experiment. Unless tested a Backup can be either good or bad. Only by completing a Restore can you be assured that your Backup was valid. A failed Schrödinger’s backup experiment will often become a RGE* RGE – Resume Generating Event GitLab.Com (used by 100,000+ organisations – January 31 2017 A tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Problems Encountered LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented Our backups to S3 apparently don’t work either: the bucket is empty We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now. So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked

Restore Strategy Requires A defined RPO* and RTO* Documented Processes and Procedures Regular and Correct Backups Tested Restore Process (automated and manual) RPO – Recovery Point Objective (how much data can be loss is acceptable) RTO – Recovery Time Objective (how much downtime is acceptable) Making sure that you are alerted that the building is burning down needs to be at the very least as important as knowing that if the building did burn down that you could restore your data. #630 | Brisbane 2017

A Defined RPO* and RTO* RPO: Recovery Point Objective In terms of time, how much data are we willing to lose? RTO: Recovery Time Objective Our goal for how quickly we can restore the database. These Are A Business Decision! Need to know how long it will take to restore if everything but Backup Files are gone Includes: Time to get backup files from Offsite / Tape / Data Domain / NAS etc Time to copy to Restore Location If you don’t have a server to restore to, how long will it take to bring one up and configure it How long will Restore take

Documented Processes and Procedures Restore Strategy must be documented How and where to restore the data Currently used versions of relevant software Contact Information Must be kept Up-to-Date Must be Tested Having the right data selected for backup and backups running regularly and correctly is still only part of the solution to know you have a good backup that can get you back up and running. You must document the procedure on how and where to restore the data and/or application that you are backing up. Restore documentation shouldn’t just contain information about the restore. It should also include hard copies of currently used versions of relevant software, serial numbers for any software, contact numbers for support, and support contract reference numbers. Any documentation should have a date of when it was last updated and tested, for wrong documentation can be worse than no documentation at all. All documentation should have a glossary page explaining the acronyms to ensure that the person reading the documentation understands what the person who wrote it was trying to say. The restore procedure should also periodically be tested, and the documentation that is created should be followed to the letter, and any changes needed to the documentation should be noted and updated. Think of it like a smoke alarm, you are supposed to test your smoke alarms when you change your clock for daylight savings time. Making sure that you are alerted that the building is burning down needs to be at the very least as important as knowing that if the building did burn down that you could restore your data.

Regular and Correct Backups Backup Your Databases Use Checksum on Backups Restore Verify Only How Do You Know When Your Backups Aren’t Successful? Alerts on Failure Run Reports to Check for Missing Backups TraceFlag 3023 Backup Report sp_Blitz SQL Script DBA Reports Backups Maintenance Plans Ola Hallengren Minion Backup Building a Centralised Database Maintenance and Monitoring Solution Manohar Punna 3:45pm

Regular and Correct Backups 3-2-1 Backup Rule How Much Do You Lose if Even Just One Backup File Goes Bad? The accepted rule for backup best practices is the three-two-one rule. It can be summarized as: if you’re backing something up, you should have: At least three copies, In two different formats, with one of those copies off-site.

Demo’s of Backup Script Maintenance Plans Ola Hallengren Minion Backup #630 | Brisbane 2017

Tested Restore Process You must Test your Restores Automated Restore Testing Can Include CheckDB Restore to Test Environments Manual Testing Fire Drills Page Level Restores, Filegroup restores etc TraceFlag 3023 Backup Report sp_Blitz SQL Script DBA Reports Backups Maintenance Plans Ola Hallengren Minion Backup Building a Centralised Database Maintenance and Monitoring Solution Manohar Punna 3:45pm

dbatools Community project, initially started out as Start-SqlMigration but has grown into a DBA’s best friend. dbatools currently sports 79 commands. How many of you have heard of dbatools? How many of you have used it? How many knew it was born in Belgium? :D dbatools is a free PowerShell module with over 200 SQL Server best practice, administration and migration commands included.

Demo’s of Backup Script Maintenance Plans Ola Hallengren Minion Backup #630 | Brisbane 2017

In Summary Build a Restore Strategy Document it. Test It! Having the right data selected for backup and backups running regularly and correctly is still only part of the solution to know you have a good backup that can get you back up and running. You must document the procedure on how and where to restore the data and/or application that you are backing up. Restore documentation shouldn’t just contain information about the restore. It should also include hard copies of currently used versions of relevant software, serial numbers for any software, contact numbers for support, and support contract reference numbers. Any documentation should have a date of when it was last updated and tested, for wrong documentation can be worse than no documentation at all. All documentation should have a glossary page explaining the acronyms to ensure that the person reading the documentation understands what the person who wrote it was trying to say. The restore procedure should also periodically be tested, and the documentation that is created should be followed to the letter, and any changes needed to the documentation should be noted and updated. Think of it like a smoke alarm, you are supposed to test your smoke alarms when you change your clock for daylight savings time. Making sure that you are alerted that the building is burning down needs to be at the very least as important as knowing that if the building did burn down that you could restore your data.

Questions: #630 | Brisbane 2017

Resources Backup Monitoring Backup Solutions Testing Restores http://minionware.net/backup/ https://ola.hallengren.com/ https://www.pluralsight.com/courses/sqlserver-database-maintenance-plans Backup Monitoring https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit https://dbareports.io/ Testing Restores https://dbatools.io/presentations/ https://sqldbawithabeard.com/2017/03/20/testing-your-sql-server-backups-the-easy-way-with-powershell-dbatools/ https://thomaslarock.com/2010/05/statistical-sampling-for-verifying-database-backups/ #630 | Brisbane 2017

Resources Scary DBA http://www.scarydba.com/2014/01/20/time-for-a-quick-rant/ http://www.scarydba.com/2016/03/07/backups-are-a-business-decision/ https://www.simple-talk.com/sql/backup-and-recovery/backup-verification-tips-for-database-backup-testing/ SQL Skills The Accidental DBA (Day 6 of 30): Backups: Understanding RTO and RPO The Accidental DBA (Day 7 of 30): Backups: Recovery Models and Backup Types The Accidental DBA (Day 8 of 30): Backups: Planning a Recovery Strategy The Accidental DBA (Day 9 of 30): Backups: Essential BACKUP Options The Accidental DBA (Day 10 of 30): Backups: Backup Testing for Validation The Accidental DBA (Day 11 of 30): Backups: Backup Storage and Retention #630 | Brisbane 2017