Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall.

Slides:



Advertisements
Similar presentations
Section D: File Management
Advertisements

Introduction to cloud computing
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
The google file system Cs 595 Lecture 9.
Big Table Alon pluda.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Bigtable: A Distributed.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Bigtable A Distributed Storage System for Structured Data.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CSCI5570 Large Scale Data Processing Systems
Bigtable A Distributed Storage System for Structured Data
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
HBase Mohamed Eltabakh
Bigtable: A Distributed Storage System for Structured Data
GFS and BigTable (Lecture 20, cs262a)
Data Management in the Cloud
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Cloud Computing Storage Systems
A Distributed Storage System for Structured Data
THE GOOGLE FILE SYSTEM.
Presentation transcript:

Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 1

Motivation and Design Goal Distributed Storage System for Structured Data – Scalability Petabytes of data on Thousands of (commodity) machines – Wide Applicability Throughput-oriented and Latency-sensitive – High Performance – High Availability 10/22/2012Fall 2012: CSE 704 Web-scale Data Management2

Data Model 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 3

Data Model Not a Full Relational Data Model Provides a simple data model – Supports Dynamic Control over Data Layout – Allows clients to reason about the locality properties 10/22/2012Fall 2012: CSE 704 Web-scale Data Management4

Data Model – A Big Table A Table in Bigtable is a: – Sparse – Distributed – Persistent – Multidimensional – Sorted map 10/22/2012Fall 2012: CSE 704 Web-scale Data Management5

Data Model 10/22/2012Fall 2012: CSE 704 Web-scale Data Management6

Data Model Rows – Data maintained in lexicographic order by row key – Tablet: rows with consecutive keys Units of distribution and load balancing Columns – Column families Family:qualifier Cells Timestamps 10/22/2012Fall 2012: CSE 704 Web-scale Data Management7

Data Model – WebTable Example 10/22/2012Fall 2012: CSE 704 Web-scale Data Management8 A large collection of web pages and related information

Data Model – WebTable Example Row Key Tablet - Group of rows with consecutive keys. Unit of Distribution Bigtable maintains data in lexicographic order by row key 10/22/2012Fall 2012: CSE 704 Web-scale Data Management9

Data Model – WebTable Example Column Family Column family is the unit of access control 10/22/2012Fall 2012: CSE 704 Web-scale Data Management10

Data Model – WebTable Example Column Column key is specified by “Column family:qualifier” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management11

Data Model – WebTable Example Column You can add a column in a column family if the column family was created 10/22/2012Fall 2012: CSE 704 Web-scale Data Management12

Data Model – WebTable Example Cell Cell: the storage referenced by a particular row key, column key, and timestamp 10/22/2012Fall 2012: CSE 704 Web-scale Data Management13

Data Model – WebTable Example Different cells in a table can contain multiple versions indexed by timestamp 10/22/2012Fall 2012: CSE 704 Web-scale Data Management14

API 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 15

API Write or Delete values in Bigtable Look up values from individual rows Iterate over a subset of the data in a table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management16

API – Update a Row 10/22/2012Fall 2012: CSE 704 Web-scale Data Management17

API – Update a Row Opens a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management18

API – Update a Row We’re going to mutate the row 10/22/2012Fall 2012: CSE 704 Web-scale Data Management19

API – Update a Row Store a new item under the column key “anchor: span.org” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management20

API – Update a Row Delete an item under the column key “anchor: 10/22/2012Fall 2012: CSE 704 Web-scale Data Management21

API – Update a Row Atomic Mutation 10/22/2012Fall 2012: CSE 704 Web-scale Data Management22

API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management23 Create a Scanner instance

API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management24 Access “anchor” column family

API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management25 Specify “return all versions”

API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management26 Specify a row key

API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management27 Iterate over rows

API – Other Features Single row transaction Client-supplied scripts in the address space of the server Input source/Output target for MapReduce jobs 10/22/2012Fall 2012: CSE 704 Web-scale Data Management28

A Typical Google Machine 10/22/2012Fall 2012: CSE 704 Web-scale Data Management29

A Google Cluster 10/22/2012Fall 2012: CSE 704 Web-scale Data Management30

A Google Cluster 10/22/2012Fall 2012: CSE 704 Web-scale Data Management31

Building Blocks Chubby – Highly-available and persistent distributed lock service GFS – Store logs and data files – SSTable Google’s immutable file format A persistent, ordered immutable map from keys to values 10/22/2012Fall 2012: CSE 704 Web-scale Data Management32

SSTable For more info: log-structured-storage-leveldb/ log-structured-storage-leveldb/ 10/22/2012Fall 2012: CSE 704 Web-scale Data Management33

Chubby Highly-available and persistent distributed lock service – 5 replicas, one is elected as a master – Paxos – Provides a namespace that consists of directories and small files 10/22/2012Fall 2012: CSE 704 Web-scale Data Management34

Implementation Client Library Master – one and only one! Tablet Servers – Many 10/22/2012Fall 2012: CSE 704 Web-scale Data Management35

Implementation - Master Responsible for assigning tablets to table servers – Addition/removal of tablet server – Tablet-server load balancing – Garbage collecting files in GFS Handles schema changes Single master system (as GFS did) 10/22/2012Fall 2012: CSE 704 Web-scale Data Management36

Tablet Server Manages a set of tablets Handles read and write requests to the tablets Splits tablets that have grown too large 10/22/2012Fall 2012: CSE 704 Web-scale Data Management37

How Does a Client Find a Tablet? 10/22/2012Fall 2012: CSE 704 Web-scale Data Management38

Tablet Assignment Each tablet is assigned to at most one tablet server at a time When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request Bigtable uses Chubby to keep track of tablet servers 10/22/2012Fall 2012: CSE 704 Web-scale Data Management39

Tablet Assignment Detecting a tablet server which is no longer serving its tablets – The master periodically asks each tablet server for the status of its lock – If a tablet server reports it has lost its lock, or if the master cannot reach a tablet server, – The master attempts to acquire an exclusive lock on the server’s file – If the lock acquire is successful -> Chubby is alive, so the tablet server must have a problem – The master deletes the server’s file in Chubby to ensure the tablet server can never serve again – Then, the master move all the tablets that were previously assigned to that server into the set of unassigned tablets 10/22/2012Fall 2012: CSE 704 Web-scale Data Management40

Tablet Assignment When a master is started, the master… – Grabs a unique master lock in Chubby – Scans the servers directory in Chubby to find the live servers – Communicates with every live tablet server to discover the current tablet assignment – Scans the METADATA table and adds unassigned tablets to the set of unassigned tablets 10/22/2012Fall 2012: CSE 704 Web-scale Data Management41

Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management42

Tablet Serving Memtable – A sorted buffer – Maintains the updates on a row-by-row basis – Each row is copy-on-write to maintain row-level consistency – Older updates are stored in a sequence of SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management43

Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management44

Tablet Serving - Write Write operation – The server checks if the operation is valid – A valid mutation is written to the commit log – After the write has been committed, its contents are inserted into the memtable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management45

Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management46

Tablet Serving - Read Read operation – Check if the operation is valid – A valid operation is executed on a merged view of the sequence of SSTables and the memtable – The merged view can be formed efficiently since SSTables and the memtable are lexicographically sorted data structure 10/22/2012Fall 2012: CSE 704 Web-scale Data Management47

Tablet Serving - Recover 10/22/2012Fall 2012: CSE 704 Web-scale Data Management48

Tablet Serving - Recover Recover a table – A tablet server reads its metadata from METADATA table – The metadata contains the list of SSTables that comprise a tablet and a set of redo points – The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points 10/22/2012Fall 2012: CSE 704 Web-scale Data Management49

Compaction Minor compaction – When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable Major compaction – Rewrite multiple SSTables into one SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management50

Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management51

Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable Threshold reached 10/22/2012Fall 2012: CSE 704 Web-scale Data Management52

Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable Threshold reached 10/22/2012Fall 2012: CSE 704 Web-scale Data Management53

Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable A new memtable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management54

Compaction memtable SSTable Memory GFS Write Op Commit Log Major compaction 10/22/2012Fall 2012: CSE 704 Web-scale Data Management55

Schema Management Bigtable schemas are stored in Chubby The master update the schema by rewriting the corresponding schema file in Chubby 10/22/2012Fall 2012: CSE 704 Web-scale Data Management56

Optimization Locality Group – Client defined – An abstraction that enables clients to control their data’s storage layout – A separate SSTable is generated for each locality group in each tablet during compaction – A locality group can be declared to be in-memory 10/22/2012Fall 2012: CSE 704 Web-scale Data Management57

Optimization Compression – Client can control whether the SSTables for a locality group are compressed 10/22/2012Fall 2012: CSE 704 Web-scale Data Management58

Optimization Two-level Caching for Read Performance – Scan cache: higher level. Caches the key-value pairs returned by the SSTable interface to the tablet server code – Block cache: lower level Caches SSTable blocks 10/22/2012Fall 2012: CSE 704 Web-scale Data Management59

Optimization Bloom Filters 10/22/2012Fall 2012: CSE 704 Web-scale Data Management60

Optimization Commit-Log Implementation – Using one log per tablet server – Recovery? A tablet server hosted 100 tablets failed 100 other machines were each assigned a single tablet 100 reads? Sort the commit log by – Writing commit logs Two log-writer threads 10/22/2012Fall 2012: CSE 704 Web-scale Data Management61

Performance Evaluation Sequential writes/reads – Row keys with names 0 to R-1, partitioned into 10N equal-sized ranges – Wrote a single string under each row key – 1GB / tablet server Scan – Uses Bigtable Scan API Random writes/reads – Similar to Sequential write/read, but the row key was hashed Random reads (Mem) – 100MB / tablet server, the locality group is marked as in-memory 10/22/2012Fall 2012: CSE 704 Web-scale Data Management62

Single Tablet Server Performance 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 63

Aggregate Throughput 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 64

Real Applications 10/22/2012Fall 2012: CSE 704 Web-scale Data Management65

Lessons Learned Failures! Delay new features until it is clear how the new features will be used Monitoring Simple Design! 10/22/2012Fall 2012: CSE 704 Web-scale Data Management66

Acknowledgement Jeff Dean, “Handling Large Datasets at Google: Current Systems and Future Directions” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management67