A Distributed Storage System for Structured Data

A Distributed Storage System for Structured Data
Bigtable A Distributed Storage System for Structured Data

Credit Based on a paper by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber And the following presentations: by Jeffery Dean in UW in September 2005 Big Table : A Distributed Storage System for Structured Data by Pouria Pirzadeh and Vandana Ayyalasomayajula of University of California, Irvine (BigtableLacture) Google Bigtable - CSE, IIT Bombay by S. Sudarshan (bigtable-uw-presentaion)

What we do today Motivation Overview Data Model Client API
Building blocks Implementation

Motivation Google scale is huge
Petabytes of data Many incoming requests Very different demands – in data size, workloads, and configurations No commercial system big enough Even if there was one, utilization would be expansive Might have made design choices that doesn’t fit Google requirements Motivation: huge scale: Personalized Search. records user queries and clicks across a variety of Google properties such as web search, images, and news. Users can browse their search histories, and ask for personalized search results based on their historical Google usage patterns. very different demands data size <from URLs to web pages to satellite imagery> throughput-oriented batch-processing – Google Earth use a table to store raw imagery, which later processed latency-sensitive serving of data to end users - No solution: Based on “Big Table : A Distributed Storage System for Structured Data"

Overview Bigtable is widely applicable Scalable
Used by more than sixty products in Google Scalable Uses petabytes of data and thousands of servers Use components already existing in Google Commodity servers Google technologies Provides only a simple data model Instead supports dynamic control over data layout and format Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. dynamic control - clients can control the locality of data through choice of schema and the data location (disk/nenory) through schema parameters. SQL

Data model Big table is divided to clusters, each containing a set of tables (row, colomn, timestamp)->data Row’s name and column’s name are aribitrary strings Data is uninterpreted array of bytes Figure taken from Paper Based on “Big Table : A Distributed Storage System for Structured Data"

Rows and Tablets Rows are ordered in lexicographic order
The rows are partitioned to tablet Each tablet consist of a specific row range Used as the basic unit for distribution and load balancing (later) Operations on a single row are atomic Figure taken from Paper

Columns and Columns Families
Columns are grouped into sets called column families Which are basic unit of access control Rarely changed Can contain one column, or many Column name family:qualifier Same column family Access control and both disk and memory accounting are performed at the column-family level. The language family – contains only one column key, page’s language ID The anchor family - each column key in this family represents a single anchor. The qualifier is the name of the referring site; the cell contents is the link text. Figure taken from Paper

Different versions of the data, taken in different times
Timestamps Each cell in a Bigtable can contain multiple versions of the same data These versions are indexed by timestamp Bigtable provide automatic garbage-collection for this data: Keep only the last n versions of a cell Keep only new-enough versions column contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page Different versions of the data, taken in different times Figure taken from Paper

Client API Manage the schema Basic commands
Create and delete tables and column families Change cluster, table, and column family metadata Basic commands Write or delete values Look up values from individual rows Scanning a subset of the data in a table Scanner - iterate over multiple column families, several mechanisms for limiting the rows, columns, and timestamps produced by a scan single-row transactions - perform atomic read-modify-write sequences on data stored under a single row key. Execution of client-supplied scripts - written in a language developed at Google. does not allow client scripts to write back into Bigtable. does allow various forms of data transformation, filtering based on arbitrary expressions, and summarization via a variety of operators.

Client API More advanced commands Single-row transactions
Using cells as integer counter Execution of client-supplied scripts in the address spaces of the servers Scanner - iterate over multiple column families, several mechanisms for limiting the rows, columns, and timestamps produced by a scan single-row transactions - perform atomic read-modify-write sequences on data stored under a single row key. Execution of client-supplied scripts - written in a language developed at Google. does not allow client scripts to write back into Bigtable. does allow various forms of data transformation, filtering based on arbitrary expressions, and summarization via a variety of operators.

Building Blocks Google File System (GFS) A cluster management system
Large-scale distributed file system. Used to store the on-disk files. A cluster management system Bigtable cluster run in a pool of machines Such pool usually run other applications as well So bitable depends on the cluster management system to manage its run Schedule jobs Deal with machine failures Monitor machines status And more..

Building Blocks Chubby
Highly-available and persistent distributed lock service. Can store directories and small files. Used for Storing the Bigtable schema information Tracking the master and tablet servers And more Chubby - Each directory or file can be used as a lock, and reads and writes to a file are atomic

Implementation - Master
Three major components: Client library One master server Many tablet servers Master server One per cluster Manage Tablet Servers Garbage-collect of files in GFS Handles schema changes 3 Components. Manage Tablet Servers - Assign tablets to tablet servers. Balance tablet-server load. Detect the addition and expiration of tablet servers. Handles schema changes - table and column family creations. Based on “Big Table : A Distributed Storage System for Structured Data"

Implementation - Tablet Server
Manages a set of tablets It handles read and write requests to the tablets that it has loaded And splits tablets that have grown too large Clients communicate with servers directly Master lightly loaded Can be dynamically added or removed from a cluster Tablets moves between servers Based on “Big Table : A Distributed Storage System for Structured Data"

Tablet Location Given a row key, how can clients find a tablet?
One approach: use the Master Server Problem: Master becomes a bottleneck in large system Instead: use special table containing tablet location info But how can we find this special table? Since tablets move around from server to server, given a row, how do clients find the right server? Based on Jeff Dean’s Lecture

Tablet Location – Cont. 3-level hierarchy for location storing
Metadata table contains location of user tablets Root tablet contains location of Metadata tablets One file in Chubby for location of Root Tablet Row’s key: Tablet’s Table ID, End Row Client library caches tablet locations Moves up the hierarchy if location is stale Also prefetches tablet locations for range queries If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently). tablet locations are stored in memory Figure taken from Paper Based on Jeff Dean’s Lecture

(row, column, timestamp) -> data
Editing a table Tablet Memtable Sorted buffer (row, column, timestamp) -> data Memory Sstable – keys and values are arbitrary byte strings. provide look up specified key, or iteration over all key/value pairs in a specified key range look up requires only one read from disk - as the index is in memory. or can be loaded entirely to memory. GFS SSTables Immutable, ordered map (row, column, timestamp) -> data Tablet Log Append-only log

Write Write Request Tablet New Write Delete Delete Write Write Delete
Memory Memtable When a write operation arrives at a tablet server: checks that it is well-formed, and that the sender is authorized to perform the mutation A valid mutation is written to the commit log. After the write has been committed, its contents are inserted into the memtable.. Deletes: Represented by a special deletion entries The data is deleted later New Write Delete Delete Write Write GFS Tablet Log SSTables

Read Read Request Tablet Row’s Data Write Delete Delete Write Write
Memory Memtable Row’s data When a read operation arrives at a tablet server: The server checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently. Incoming read and write operations can continue while tablets are split. New Write Delete Delete Write Write GFS Tablet Log SSTables

What the Tablet Log is for?
Tablet recovery When a tablet server crashes, its tablets are moved to other tablet servers The SSTables are on the disk, so they survive But what of the memtable? The tablet server need to reconstruct the memtable Done by applying all of the updates that in the tablet log that haven’t been written to SSTable yet The locations of the tablet’s list of SSTables and tablet log are kept in the METADATA table

Minor Compaction Convert the memtable into an SSTable Tablet Write Delete Memtable Memory When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. Reduce memory usage Reduce the amount of data that has to be read from the commit log during recovery Incoming read and write operations can continue while compactions occur Row’s data Write Delete Delete Write Write GFS SSTables Tablet Log

Merging Compaction Merging compaction Reduce number of SSTables
Convert the memtable and some of the SSTable to a new SSTable Tablet Write Delete Memtable Memory Merging compaction Reduce number of SSTables Without it, read operations might need to merge updates from an arbitrary number of SSTables Executed periodically The input SSTables and memtable are discarded when the compaction has finished Write Delete Delete Write Write GFS SSTables Tablet Log

Major Compaction Major compaction
Convert the memtable and all the SSTable to a new SSTable Tablet Write Delete Memtable Memory Major compaction rewrites all SSTables into exactly one SSTable No deletion records, only live data allow Bigtable to reclaim resources used by deleted data, ensure that deleted data disappears from the system in time Write Delete Delete Write Write GFS SSTables Tablet Log

Questions?

Refinement – Locality Groups
Remember column families? Locality groups group together several column families Each locality group is kept in a separate SSTable Allows more efficient reads Each read access smaller SSTables Can be declared in-memory Instead of on the disk Useful to small locality groups For example, page metadata in Webtable (such as language and checksums) can be in one locality group, and the contents of the page can be in a different group: an application that wants to read the metadata does not need to read through all of the page contents. SSTables for in-memory locality groups are loaded lazily into the memory of the tablet server. Once loaded, column families that belong to such locality groups can be read without accessing the disk. This feature is useful for small pieces of data that are accessed frequently: we use it internally for the location column family in the METADATA table. Based on “Big Table : A Distributed Storage System for Structured Data"

Refinement – Bloom Filters
Read operation has to read from all the SSTables of the tablet Can result in many disk accesses Clients can create Bloom filters for SSTable of specific locality group Filters kept in the memory Bloom filter Probabilistic data structure that is used to test whether an element is a member of a set Can’t have false negative Don’t allow deletions Figure taken from Wikipedia

Refinement - Caching Two levels of caching Scan cache Block cache
Higher-level cache Caches the key-values read from the SSTables Useful when reading the same data over and over Block cache Lower-level cache Caches the SSTable blocks that were read from the GFS Useful when scanning the data or when reading different columns in the same locality group within the same row

Refinement - Immutability
SSTables are immutable Simplifies caching No need for synchronization of accesses when reading from SSTables Garbage collection of SSTables done by master On tablet split, child tablets share the SSTables of the father Only memtable accessed by both reads and write But it allows concurrent read/write Based on “Google Bigtable - CSE, IIT Bombay"

Questions?

A Distributed Storage System for Structured Data

Similar presentations

Presentation on theme: "A Distributed Storage System for Structured Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Distributed Storage System for Structured Data

Similar presentations

Presentation on theme: "A Distributed Storage System for Structured Data"— Presentation transcript:

Similar presentations

About project

Feedback