Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.

Slides:

Advertisements

Similar presentations

Chapter 13: Query Processing

Advertisements

Consensus on Transaction Commit

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.

Advanced Piloting Cruise Plot.

Phoenix We put the SQL back in NoSQL James Taylor Demos:

Chapter 6 Structures and Classes. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 6-2 Learning Objectives Structures Structure types Structures.

Chapter 1 The Study of Body Function Image PowerPoint

Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,

Relational Database and Data Modeling

Chen Zhang Hans De Sterck University of Waterloo

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Child Health Reporting System (CHRS) How to Submit VHSS Data

Relational data objects 1 Lecture 6. Relational data objects 2 Answer to last lectures activity.

Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.

1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.

Chapter 6 File Systems 6.1 Files 6.2 Directories

1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro.

ORACLE DATABASE HIGH AVAILABILITY & ORACLE 11GR2 DATA GUARD 1 Güneş EROL.

Database Systems: Design, Implementation, and Management

SE-292 High Performance Computing

Configuration management

Information Systems Today: Managing in the Digital World

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.

13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.

Database Performance Tuning and Query Optimization

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.

ABC Technology Project

Excel Lesson 11 Improving Data Accuracy

11 Copyright © Oracle Corporation, All rights reserved. Managing Tables.

© Paradigm Publishing, Inc Access 2010 Level 1 Unit 1Creating Tables and Queries Chapter 2Creating Relationships between Tables.

Microsoft Access.

Displaying Data from Multiple Tables

Databases and Database Management Systems

Chapter Information Systems Database Management.

Access Tables 1. Creating a Table Design View Define each field and its properties Data Sheet View Essentially spreadsheet Enter fields You must go to.

Cache and Virtual Memory Replacement Algorithms

Chapter 10: Virtual Memory

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

CAR Training Module PRODUCT REGISTRATION and MANAGEMENT Module 2 - Register a New Document - Without Alternate Formats (Run as a PowerPoint show)

The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.

4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

25 seconds left…...

Copyright © Cengage Learning. All rights reserved.

We will resume in: 25 Minutes.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Chapter 15 A Table with a View: Database Queries.

Distributed DBMS©M. T. Özsu & P. Valduriez Ch.15/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.

PSSA Preparation.

1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.

Excel Lesson 13 Using Powerful Excel Functions Microsoft Office 2010 Advanced Cable / Morrison 1.

Chapter 8 Improving the User Interface

Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.

Distributed Databases

Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.

1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M.

Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean Michel L´eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh Megastore.

Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.

Paxos A Consensus Algorithm for Fault Tolerant Replication.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin,

Session 1 Module 1: Introduction to Data Integrity

Bigtable: A Distributed Storage System for Structured Data

Presentation transcript:

Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1

Introduction. Availability and scalability. Data model. Replication Brief Summary of Paxos. Read algorithm. Write algorithm. Production Metrics. Conclusion. 2

Megastore is a storage system developed to meet the storage requirements of today's interactive online services. It is novel in that it blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS. Megastore has been widely deployed within Google for several years. 3

Replication: For availability, we implemented a synchronous, fault- tolerant log replicator optimized for long distance-links. We decided to use Paxos, a proven, optimal, fault-tolerant consensus algorithm with no requirement for a distinguished master, any node can initiate reads and writes. We replicate a write-ahead log over a group of symmetric peers. 4

Partitioning and Locality: For scale, we partitioned data into a vast space of small databases, each with its own replicated log stored in a per- replica NoSQL datastore. To scale throughput and localize outages, we partition our data into a collection of entity groups,each independently and synchronously replicated over a wide area. For low latency and throughput, the data for an entity group are held in contiguous ranges of Bigtable rows. 5

6

Megastore tables are either entity group root tables or child tables. Each child table must declare a single distinguished foreign key referencing a root table, illustrated by the ENTITY GROUP KEY annotation. An entity group consists of a root entity along with all entities in child tables that reference it. 7

8

While traditional relational modeling recommends that all primary keys take surrogate values, Megastore keys are chosen to cluster entities that will be read together. Each entity is mapped into a single Bigtable row; the primary key values are concatenated to form the Bigtable row key, and each remaining property occupies its own Bigtable column. 9

The IN TABLE User directive instructs Megastore to collocate these two tables into the same Bigtable, and the key ordering ensures that Photo entities are stored adjacent to the corresponding User. We designed a service called the Coordinator, with servers in each replica's datacenter. A coordinator server tracks a set of entity groups for which its replica has observed all Paxos writes. 10

Indexes: 1.Local index: is treated as separate indexes for each entity group. It is used to find data within an entity group. The index entries are stored in the entity group and are updated atomically and consistently with the primary entity data. 2.Global index: It is used to find entities without knowing in advance the entity groups that contain them. 11

12

Storing Clause: To access entity data through indexes we read the index to find matching primary keys then these keys are used to fetch entities. We add the STORING clause to an index so the applications can store additional properties from the primary table for faster access at read time. For example,the PhotosByTag index stores the photo thumbnail URL for faster retrieval without the need for an additional lookup. 13

Mapping to Bigtable: The Bigtable column name is a concatenation of the Megastore table name and the property name. Allowing entities from different Megastore tables to be mapped into the same Bigtable row without collision. Each index entry is represented as a single Bigtable row, For example, the PhotosByTime index row keys would be the tuple (user id; time; primary key) for each photo. 14

15

Replication is done per entity group by synchronously replicating the group's transaction log to a quorum of replicas. A majority of replicas must be active and reachable for the algorithm to make progress that is, it allows up to F faults with 2F + 1 replicas. Databases typically use Paxos to replicate a transaction log, where a separate instance of Paxos is used for each position in the log. New values are written to the log at the position following the last chosen position. 16

17

18

1. Query Local: Query the local replica's coordinator to determine if the entity group is up-to-date locally. 2. Find Position: Determine the highest possibly committed log position, and select a replica that has applied through that log position. (a) (Local read) If step 1 indicates that the local replica is up-to- date, read the highest accepted log position and timestamp from the local replica. (b) (Majority read) If the local replica is not up-to-date, read from a majority of replicas to find the maximum log position that any replica has seen, and pick a replica to read from. We select the most respective or up-to-date replica, not always the local replica. 19

3. Catchup: As soon as a replica is selected, catch it up to the maximum known log position. 4. Validate: If the local replica was selected and was not previously up-to-date, send the coordinator a validate message. 5. Query Data: Read the selected replica using the timestamp of the selected log position. If the selected replica becomes unavailable, pick an alternate replica, perform catchup, and read from it instead. 20

21

1. Accept Leader: Ask the leader to accept the value as proposal number zero. If successful, skip to step Prepare: Run the Paxos Prepare phase at all replicas with a higher proposal number than any seen so far at this log position. Replace the value being written with the highest-numbered proposal discovered, if any. 3. Accept: Ask remaining replicas to accept the value. If this fails on a majority of replicas, return to step 2 after a randomized backoff. 4. Invalidate: Invalidate the coordinator at all full replicas that did not accept the value. 5. Apply: Apply the value's mutations at as many replicas as possible. If the chosen value differs from that originally proposed, return a conflict error. 22

23

Megastore has been deployed within Google for several years, more than 100 production applications use it as their storage service. Most of our customers see extremely high levels of availability. Observed average read latencies are tens of milliseconds, depending on the amount of data, showing most reads as local. Observed average write latencies are in milliseconds range, depending on the distance between datacenters, the size of data and number of full replicas. 24

25

In this paper we present Megastore, a scalable, highly available datastore designed to meet the storage requirements of interactive Internet services. Megastore's over 100 running applications and infrastructure service for higher layers, is evidence of its ease of use, generality and power. 26

27