A Distributed Storage System for Structured Data

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Big Table Alon pluda.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Bigtable: A Distributed.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CSCI5570 Large Scale Data Processing Systems
Bigtable A Distributed Storage System for Structured Data
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
HBase Mohamed Eltabakh
Hadoop.
Software Systems Development
Bigtable: A Distributed Storage System for Structured Data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
GFS and BigTable (Lecture 20, cs262a)
Data Management in the Cloud
Google File System.
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
MapReduce Simplied Data Processing on Large Clusters
Introduction to Apache
Chapter 2: Operating-System Structures
Cloud Computing Storage Systems
THE GOOGLE FILE SYSTEM.
Chapter 2: Operating-System Structures
Presentation transcript:

A Distributed Storage System for Structured Data Bigtable A Distributed Storage System for Structured Data

Credit Based on a paper by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber And the following presentations: by Jeffery Dean in UW in September 2005 Big Table : A Distributed Storage System for Structured Data by Pouria Pirzadeh and Vandana Ayyalasomayajula of University of California, Irvine (BigtableLacture) Google Bigtable - CSE, IIT Bombay by S. Sudarshan (bigtable-uw-presentaion)

What we do today Motivation Overview Data Model Client API Building blocks Implementation

Motivation Google scale is huge Petabytes of data Many incoming requests Very different demands – in data size, workloads, and configurations No commercial system big enough Even if there was one, utilization would be expansive Might have made design choices that doesn’t fit Google requirements Motivation: huge scale: Personalized Search. records user queries and clicks across a variety of Google properties such as web search, images, and news. Users can browse their search histories, and ask for personalized search results based on their historical Google usage patterns. very different demands data size <from URLs to web pages to satellite imagery> throughput-oriented batch-processing – Google Earth use a table to store raw imagery, which later processed latency-sensitive serving of data to end users - No solution: Based on “Big Table : A Distributed Storage System for Structured Data"

Overview Bigtable is widely applicable Scalable Used by more than sixty products in Google Scalable Uses petabytes of data and thousands of servers Use components already existing in Google Commodity servers Google technologies Provides only a simple data model Instead supports dynamic control over data layout and format Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. dynamic control - clients can control the locality of data through choice of schema and the data location (disk/nenory) through schema parameters. SQL

Data model Big table is divided to clusters, each containing a set of tables (row, colomn, timestamp)->data Row’s name and column’s name are aribitrary strings Data is uninterpreted array of bytes Figure taken from Paper Based on “Big Table : A Distributed Storage System for Structured Data"

Rows and Tablets Rows are ordered in lexicographic order The rows are partitioned to tablet Each tablet consist of a specific row range Used as the basic unit for distribution and load balancing (later) Operations on a single row are atomic Figure taken from Paper

Columns and Columns Families Columns are grouped into sets called column families Which are basic unit of access control Rarely changed Can contain one column, or many Column name family:qualifier Same column family Access control and both disk and memory accounting are performed at the column-family level. The language family – contains only one column key, page’s language ID The anchor family - each column key in this family represents a single anchor. The qualifier is the name of the referring site; the cell contents is the link text. Figure taken from Paper

Different versions of the data, taken in different times Timestamps Each cell in a Bigtable can contain multiple versions of the same data These versions are indexed by timestamp Bigtable provide automatic garbage-collection for this data: Keep only the last n versions of a cell Keep only new-enough versions column contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page Different versions of the data, taken in different times Figure taken from Paper

Client API Manage the schema Basic commands Create and delete tables and column families Change cluster, table, and column family metadata Basic commands Write or delete values Look up values from individual rows Scanning a subset of the data in a table Scanner - iterate over multiple column families, several mechanisms for limiting the rows, columns, and timestamps produced by a scan single-row transactions - perform atomic read-modify-write sequences on data stored under a single row key. Execution of client-supplied scripts - written in a language developed at Google. does not allow client scripts to write back into Bigtable. does allow various forms of data transformation, filtering based on arbitrary expressions, and summarization via a variety of operators.

Client API More advanced commands Single-row transactions Using cells as integer counter Execution of client-supplied scripts in the address spaces of the servers Scanner - iterate over multiple column families, several mechanisms for limiting the rows, columns, and timestamps produced by a scan single-row transactions - perform atomic read-modify-write sequences on data stored under a single row key. Execution of client-supplied scripts - written in a language developed at Google. does not allow client scripts to write back into Bigtable. does allow various forms of data transformation, filtering based on arbitrary expressions, and summarization via a variety of operators.

Building Blocks Google File System (GFS) A cluster management system Large-scale distributed file system. Used to store the on-disk files. A cluster management system Bigtable cluster run in a pool of machines Such pool usually run other applications as well So bitable depends on the cluster management system to manage its run Schedule jobs Deal with machine failures Monitor machines status And more..

Building Blocks Chubby Highly-available and persistent distributed lock service. Can store directories and small files. Used for Storing the Bigtable schema information Tracking the master and tablet servers And more Chubby - Each directory or file can be used as a lock, and reads and writes to a file are atomic

Implementation - Master Three major components: Client library One master server Many tablet servers Master server One per cluster Manage Tablet Servers Garbage-collect of files in GFS Handles schema changes 3 Components. Manage Tablet Servers - Assign tablets to tablet servers. Balance tablet-server load. Detect the addition and expiration of tablet servers. Handles schema changes - table and column family creations. Based on “Big Table : A Distributed Storage System for Structured Data"

Implementation - Tablet Server Manages a set of tablets It handles read and write requests to the tablets that it has loaded And splits tablets that have grown too large Clients communicate with servers directly Master lightly loaded Can be dynamically added or removed from a cluster Tablets moves between servers Based on “Big Table : A Distributed Storage System for Structured Data"

Tablet Location Given a row key, how can clients find a tablet? One approach: use the Master Server Problem: Master becomes a bottleneck in large system Instead: use special table containing tablet location info But how can we find this special table? Since tablets move around from server to server, given a row, how do clients find the right server? Based on Jeff Dean’s Lecture

Tablet Location – Cont. 3-level hierarchy for location storing Metadata table contains location of user tablets Root tablet contains location of Metadata tablets One file in Chubby for location of Root Tablet Row’s key: Tablet’s Table ID, End Row Client library caches tablet locations Moves up the hierarchy if location is stale Also prefetches tablet locations for range queries If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently). tablet locations are stored in memory Figure taken from Paper Based on Jeff Dean’s Lecture

(row, column, timestamp) -> data Editing a table Tablet Memtable Sorted buffer (row, column, timestamp) -> data Memory Sstable – keys and values are arbitrary byte strings. provide look up specified key, or iteration over all key/value pairs in a specified key range look up requires only one read from disk - as the index is in memory. or can be loaded entirely to memory. GFS SSTables Immutable, ordered map (row, column, timestamp) -> data Tablet Log Append-only log

Write Write Request Tablet New Write Delete Delete Write Write Delete Memory Memtable When a write operation arrives at a tablet server: checks that it is well-formed, and that the sender is authorized to perform the mutation A valid mutation is written to the commit log. After the write has been committed, its contents are inserted into the memtable.. Deletes: Represented by a special deletion entries The data is deleted later New Write Delete Delete Write Write GFS Tablet Log SSTables

Read Read Request Tablet Row’s Data Write Delete Delete Write Write Memory Memtable Row’s data When a read operation arrives at a tablet server: The server checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently. Incoming read and write operations can continue while tablets are split. New Write Delete Delete Write Write GFS Tablet Log SSTables

What the Tablet Log is for? Tablet recovery When a tablet server crashes, its tablets are moved to other tablet servers The SSTables are on the disk, so they survive But what of the memtable? The tablet server need to reconstruct the memtable Done by applying all of the updates that in the tablet log that haven’t been written to SSTable yet The locations of the tablet’s list of SSTables and tablet log are kept in the METADATA table

Minor Compaction Convert the memtable into an SSTable Tablet Write Delete Memtable Memory When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. Reduce memory usage Reduce the amount of data that has to be read from the commit log during recovery Incoming read and write operations can continue while compactions occur Row’s data Write Delete Delete Write Write GFS SSTables Tablet Log

Merging Compaction Merging compaction Reduce number of SSTables Convert the memtable and some of the SSTable to a new SSTable Tablet Write Delete Memtable Memory Merging compaction Reduce number of SSTables Without it, read operations might need to merge updates from an arbitrary number of SSTables Executed periodically The input SSTables and memtable are discarded when the compaction has finished Write Delete Delete Write Write GFS SSTables Tablet Log

Major Compaction Major compaction Convert the memtable and all the SSTable to a new SSTable Tablet Write Delete Memtable Memory Major compaction rewrites all SSTables into exactly one SSTable No deletion records, only live data allow Bigtable to reclaim resources used by deleted data, ensure that deleted data disappears from the system in time Write Delete Delete Write Write GFS SSTables Tablet Log

Questions?

Refinement – Locality Groups Remember column families? Locality groups group together several column families Each locality group is kept in a separate SSTable Allows more efficient reads Each read access smaller SSTables Can be declared in-memory Instead of on the disk Useful to small locality groups For example, page metadata in Webtable (such as language and checksums) can be in one locality group, and the contents of the page can be in a different group: an application that wants to read the metadata does not need to read through all of the page contents. SSTables for in-memory locality groups are loaded lazily into the memory of the tablet server. Once loaded, column families that belong to such locality groups can be read without accessing the disk. This feature is useful for small pieces of data that are accessed frequently: we use it internally for the location column family in the METADATA table. Based on “Big Table : A Distributed Storage System for Structured Data"

Refinement – Bloom Filters Read operation has to read from all the SSTables of the tablet Can result in many disk accesses Clients can create Bloom filters for SSTable of specific locality group Filters kept in the memory Bloom filter Probabilistic data structure that is used to test whether an element is a member of a set Can’t have false negative Don’t allow deletions Figure taken from Wikipedia

Refinement - Caching Two levels of caching Scan cache Block cache Higher-level cache Caches the key-values read from the SSTables Useful when reading the same data over and over Block cache Lower-level cache Caches the SSTable blocks that were read from the GFS Useful when scanning the data or when reading different columns in the same locality group within the same row

Refinement - Immutability SSTables are immutable Simplifies caching No need for synchronization of accesses when reading from SSTables Garbage collection of SSTables done by master On tablet split, child tablets share the SSTables of the father Only memtable accessed by both reads and write But it allows concurrent read/write Based on “Google Bigtable - CSE, IIT Bombay"

Questions?