Recovery Management in QuickSilver Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
1 Transactions and Web Services. 2 Web Environment Web Service activities form a unit of work, but ACID properties are not always appropriate since Web.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Distributed databases
Chapter 19 Database Recovery Techniques
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Chapter 13 (Web): Distributed Databases
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
ICS 421 Spring 2010 Distributed Transactions Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/16/20101Lipyeow.
Distributed Systems 2006 Styles of Client/Server Computing.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
Distributed Database Management Systems
Overview Distributed vs. decentralized Why distributed databases
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
Chapter 12 Distributed Database Management Systems
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Introduction to Network Programming and Client-Server Design.
University of Pennsylvania 11/21/00CSE 3801 Distributed File Systems CSE 380 Lecture Note 14 Insup Lee.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
Client-Server Processing and Distributed Databases
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Networked File System CS Introduction to Operating Systems.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
DISTRIBUTED DATABASE SYSTEM.  A distributed database system consists of loosely coupled sites that share no physical component  Database systems that.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Distributed Database Systems Overview
PAVANI REDDY KATHURI TRANSACTION COMMUNICATION. OUTLINE 0 P ART I : I NTRODUCTION 0 P ART II : C URRENT R ESEARCH 0 P ART III : F UTURE P OTENTIAL 0 R.
Introduction to Microsoft Windows 2000 Integrated support for client/server and peer-to-peer networks Increased reliability, availability, and scalability.
1 Process migration n why migrate processes n main concepts n PM design objectives n design issues n freezing and restarting a process n address space.
Distributed Databases DBMS Textbook, Chapter 22, Part II.
Instructor: Marina Gavrilova. Outline Introduction Types of distributed databases Distributed DBMS Architectures and Storage Replication Synchronous replication.
Databases Illuminated
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
R*: An overview of the Architecture By R. Williams et al. Presented by D. Kontos Instructor : Dr. Megalooikonomou.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Distributed Databases
Unit OS10: Fault Tolerance
Database System Implementation CSE 507
Introduction to Operating Systems
An Introduction to Computer Networking
Commit Protocols CS60002: Distributed Systems
RELIABILITY.
Outline Announcements Fault Tolerance.
Key Manager Domains February, 2019.
UNIVERSITAS GUNADARMA
Transactions in Distributed Systems
Operating Systems Structure
Distributed Databases
Presentation transcript:

Recovery Management in QuickSilver Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center

Introduction: Problem Domain Recovery management in distributed OSs Trends in contemporary research:  Extensibility and Distribution

Contemporary Recovery Techniques timeouts  how to distinguish slow from dead? connectionless protocols / stateless servers  some actions can’t be made idempotent  retries can cause problems virtual circuits  can’t handle multiple servers replication  too expensive for some uses  how to detect failures?

Quicksilver: what’s so special? Fundamental Trade-Off:  Generality & efficiency vs. Ease of use (Quicksilver) (Camelot, Argus, etc.) Transparency isn’t always best!

Quicksilver: specs and features Client-server model System services are processes IPC message-passing More complicated set of failure modes (to handle more specific cases) Atomic transactions

Server Classes Common server classes:  Volatile (window manager)  Replicated + volatile (name server)  Recoverable (file server)  Long running transactions need log support

Design Goals Programs should be resilient to external process and machine failure Server processes should contain their own recovery code Uniform system-wide architecture for recovery management Logically related activities must execute atomically

Transaction Structure Everything belongs to a transaction Globally unique transaction identifiers (tid) Each transaction has one owner and multiple participants  Owner can commit or abort  Participants can only abort

Recovery Manager: Components Transaction Manager: manages commit coordination by communicating with servers at its own node and with transaction managers at other nodes Log Manager: serves as a common recovery log both for the TM’s commit log and the server’s recovery data Deadlock Detector: detects and resolves global deadlocks (not implemented)

Quicksilver System Structure

Transaction Manager Tracks transactions for processes on host Manages distributed commit protocol Distributed transaction is a tree  Only need to know your superior and your immediate subordinates Several alternative commit protocols available to servers  1-phase – used by volatile servers  2-phase – used by recoverable servers

2-Phase Commit Voting options  abort: undo my action, announce abort to others in 2 nd phase  commit-read-only: no recoverable resources modified, don’t include me in 2 nd phase  commit-volatile: same as read-only, but notify me of results of 2 nd phase  commit-recoverable: recoverable state modified, notify me of results of 2 nd phase

Transaction Coordination Transaction coordinator at transaction birth-site  Usually a user workstation, likely to fail  Migrate or replicate coordinator for reliability

Log Manager Log manager provides optional services  Backpointers for log replay  Block I/O access  Log replication  Log archival Servers tell LM what they need  Not penalized for services they don’t use LM does not interpret data – servers determine recovery strategy

Quicksilver Distributed IPC

Structure of a Distributed Transaction

Open questions - ??? Efficiency vs. Transparency? Still relevant for today’s hardware? …