Distributed Backup And Disaster Recovery for AFS A work in progress Steve Simmons Dan Hyde University.

Slides:



Advertisements
Similar presentations
© 2006 DataCore Software Corp DataCore Traveller Travel in Time : Do More with Time The Continuous Protection and Recovery (CPR) Solution Time Optimized.
Advertisements

STANFORD UNIVERSITY INFORMATION TECHNOLOGY SERVICES IT Services Storage And Backup Low Cost Central Storage (LCCS) January 9,
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
 Management has become a multi-faceted complex task involving:  Storage Management  Content Management  Document Management  Quota Management.
Backing Up a Hard Disk CGS2564. Why Backup Programs? Faster Optimized to copy files Can specify only files that have changed Safer Can verify backed up.
7.1 Advanced Operating Systems Versioning File Systems Someone has typed: rm -r * However, he has been in the wrong directory. What can be done? Typical.
11 BACKING UP AND RESTORING DATA Chapter 4. Chapter 4: BACKING UP AND RESTORING DATA2 CHAPTER OVERVIEW Describe the various types of hardware used to.
1 Disk Based Disaster Recovery & Data Replication Solutions Gavin Cole Storage Consultant SEE.
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Chapter 7: Configuring Disks. 2/24 Objectives Learn about disk and file system configuration in Vista Learn how to manage storage Learn about the additional.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
MIS 431 Chapter 71 Ch. 7: Advanced File Management System MIS 431 Created Spring 2006.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Do MUCH More with Less Presented by: Jon Farley 2W Technologies.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
®® Microsoft Windows 7 for Power Users Tutorial 10 Backing Up and Restoring Files.
CHAPTER 17 Configuring RMAN. Introduction to RMAN RMAN was introduced in Oracle 8.0. RMAN is Oracle’s tool for backup and recovery. RMAN is much more.
Module 8 Implementing Backup and Recovery. Module Overview Planning Backup and Recovery Backing Up Exchange Server 2010 Restoring Exchange Server 2010.
TNT Microsoft Exchange Server 2003 Disaster Recovery Michael J. Murphy TechNet Presenter
Online Data Backup Services from
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
IBM TotalStorage ® IBM logo must not be moved, added to, or altered in any way. © 2007 IBM Corporation Break through with IBM TotalStorage Business Continuity.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 14: Problem Recovery.
1 Objectives Discuss the Windows Printer Model and how it is implemented in Windows Server 2008 Install the Print Services components of Windows Server.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
PPOUG, 05-OCT-01 Agenda RMAN Architecture Why Use RMAN? Implementation Decisions RMAN Oracle9i New Features.
Course 6425A Module 9: Implementing an Active Directory Domain Services Maintenance Plan Presentation: 55 minutes Lab: 75 minutes This module helps students.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Module 13: Configuring Availability of Network Resources and Content.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
Backup and Restore CPTE 433 John Beckett. Why Back Up? So you can restore later! SLA Restore Policy Backup Policy Backup Schedule.
Gorman, Stubbs, & CEP Inc. 1 Introduction to Operating Systems Lesson 12 Windows 2000 Server.
Chapter 7 Making Backups with RMAN. Objectives Explain backup sets and image copies RMAN Backup modes’ Types of files backed up Backup destinations Specifying.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Chapter 18: Windows Server 2008 R2 and Active Directory Backup and Maintenance BAI617.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Maintaining File Services. Shadow Copies of Shared Folders Automatically retains copies of files on a server from specific points in time Prevents administrators.
©2006 Merge eMed. All Rights Reserved. Energize Your Workflow 2006 User Group Meeting May 7-9, 2006 Disaster Recovery Michael Leonard.
| nectar.org.au NECTAR TRAINING Module 9 Backing up & Packing up.
Stephen Dart LaRDS Service Manager Monash e-Research Centre LaRDS Staging Post Enhancing Workgroup Productivity.
Module 9 Planning a Disaster Recovery Solution. Module Overview Planning for Disaster Mitigation Planning Exchange Server Backup Planning Exchange Server.
IT253: Computer Organization
11 DISASTER RECOVERY Chapter 13. Chapter 13: DISASTER RECOVERY2 OVERVIEW  Back up server data using the Backup utility and the Ntbackup command  Restore.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Component 8/Unit 9bHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9b Creating Fault Tolerant.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
“Building Journaling Databases with PostgreSQL” Cybertec Geschwinde & Schoenig Hans-Juergen Schoenig
Introduction to AFS IMSA Intersession 2003 An Overview of AFS Brian Sebby, IMSA ’96 Copyright 2003 by Brian Sebby, Copies of these slides.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
CommVault Architecture
Unit 8: Database and Storage Pool Backup and Recovery.
/usr/afs/bin/backup ● Always talks to the budb ● If given a -port, talks to a butc ● Run 'backup help' for commands.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
An introduction to JDM Pro Release 2.0
Database recovery contd…
Planning for Application Recovery
Integrating Disk into Backup for Faster Restores
Backup and Recovery for Hadoop: Plans, Survey and User Inputs
The Nitty-Gritty of Database Backups
ICOM 6005 – Database Management Systems Design
Page Replacement.
data backup & system report
Using the Cloud for Backup, Archiving & Disaster Recovery
Presentation transcript:

Distributed Backup And Disaster Recovery for AFS A work in progress Steve Simmons Dan Hyde University of Michigan

Background Assigned to work on backup and disaster recovery end of 2005 Disk-based backup desired to replace TSM Funding and political support were strong Policy was vague

Some Useless Stats 13 dataful servers, plus others Two vice partitions per server 9.8TB as of today, about 33% full Critical university services rely on AFS 245,686 volumes (not counting backups, clones, etc) Volumes up to 110GB actual usage - and users want much bigger

Process Confusion Problem: What is backup, and why do it? Solution: there are three separate needs –Restore of lost/deleted files - “file restore” –Recovery from hardware/software disaster –Long-term storage of critical data sets - “archive” Optimal solution for one is suboptimal for others

File Restore Save users from their errors Recover data when old user returns FOIA/HIPPA complicate the picture Predictability of backup and restore process A very slow process (days) is usually acceptable.

Disaster Recovery When a service is going to be out for a long time, it needs to be restored and quickly Involves files PLUS hardware, power, cooling, network, dogs, cats, professors… Always unplanned Only the most recent version is of interest As close to real-time data as possible

Archival Officially, not our issue Unofficially, we expect it to come up Timing of archive selected manually by user Retention time arbitrarily long (but FOIA, HIPPA, etc complicate things)

Backup To Disk For File Recovery Basic design done in Jan 06 afsdump.pl posted by Matthew Hoskins Heavily modified for our use Yes, we will post Some auxiliary tools as well

File Restore Basic Method For every afs server, a bfs server Incremental dumps arbitrarily deep n days to do a cycle Only 1/n full dumps per day Per-volume dump schedules that override default

Getting Started Initial backup is level 0 About 20x faster than ancient TSM (1.2 TB in 5 hour vs. 3+ days) After first backup, rewrite the db to spread future level 0s across the volumes ( reseq_dump ) Daily/full backup mix, all servers, under 1 hour

Actual Status Running in pilot today Expected to be in production in 30 days Lots more interesting tools to build, but core is done: –We can back up –We can recover Declare victory, start on disaster recovery

Quick Questions

vos shadows (history) vos move –same volumeID –deletes source volume and.backup vos copy –new volumeID; must not exist –doesn’t delete source volume nor.backup vos shadow –new or old (default) volumeID –doesn’t delete source volume nor.backup –not in VLDB –-incremental

vos shadows (changes) volIsShadow (controlled recovery) –new bit in volume header vos shadow –sets volIsShadow vos syncvldb –error if volIsShadow –override with vos syncvldb -forceshadows

vos shadows (future) Volume groups/families –rw, ro, backup, clone, parentID –where do shadows fit in? restoredFromID –“….to make sure that an incremental dump is not restored on top of something inappropriate” –dangers vos shadow a vos shadow b -incremental Database vs. listvldb/listvol/paper notes

Quick Questions

Disaster Recovery & Shadows One to one relationship of afs server to recovery server Spread afs and bfs servers across sites? Same host used for disaster recovery and disk-based backup (“file restore”) RAID configured for space, not speed Shadows don’t appear in volume db Incremental refresh of shadows is fast

When Disaster Hits Identify which server is down Promote all volumes on shadow server to production You’re back in service, with loss of only data since last refresh of shadow Incrementally update shadows as frequently as load allows - maybe 3-4 times a day

Gotchas I Must be careful to delete/create shadows when doing vos move vos moves don’t invalidate old backups, but shadows have problem (see later) slocate/updatedb is your friend Periodic reruns of reseq_dumps needed

Gotchas II Need a real db (will have soon) When 13 afs servers shadow/dump to 13 bfs servers simultaneously, bandwidth suffers Distributing level 0s speeds disk dumps, but makes tracking (and restore?) complex Per-volume dump schedules not implemented, but only a few days work

Into The Future We want clones of shadows Do the backups from the shadows, not the live volume Don’t use.backup, use clones - so 24-hour snapshots can be available even if a full backup takes longer than 24 hours

Use Clones Instead of Backups Clones should appear in db Let users access the multiple clones like.backup Have multiple.backups (clones) available

Allow many clones 30 at a minimum, hundreds is better Avoid performance issue by making many clones from the shadow, not the production When you move a shadow (it will happen), preserve the clones

Naming issues for many clones Must be comprehensible to users May have to be dynamic How to make available without training 400,000 people to do fs mkm