Big Table Alon pluda.

Big Table Alon pluda

(big) Table of content Introduction Data model API Building blocks
Implementation Refinements Performance Applications conclusions

Implementation Refinements Performance Applications conclusions שולחן לגו

introduction Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size petabytes of data across thousands of commodity servers Many projects at Google store data in Bigtable web indexing, Google Earth, Google Finance… These applications place very different demands on Bigtable in terms of data size (from URLs to web pages to satellite imagery) In terms of latency requirements (from backend bulk processing to real-time data serving). It is not a relational database, it is a sparse, distributed, persistent multi-dimensional sorted map (key/value store)

Data model Column Family1 Column Family2 Column Family )anchor)
Column family with one column key Column family with multiple column key Column Family1 Column Family2 Column Family )anchor) content: language: my.look.ca: cnnsi.com com.cnn.www com.bbc.www il.co.ynet.www org.apache.hbase org.apache.hadoop <html> t3 t2 <html> EN anchor1 anchor2 <html> t1 Every cell is an uninterpreted array of bytes two per-column-family settings for automatic garbage-collect: only the last n versions of a cell be kept only new-enough versions be kept 3 The row range for a table is dynamically partitioned in tablets

API The Bigtable API provides functions :
Creating and deleting tables and column families. Changing cluster , table and column family metadata. Support for single row transactions Allows cells to be used as integer counters Client supplied scripts can be executed in the address space of servers 0.5

API // Open the table Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor: "CNN"); r1.Delete("anchor: Operation op; Apply(&op, &r1); 1

API Scanner scanner(T); ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } 1

Building blocks Bigtable is built on several other pieces of Google infrastructure: Google File System (GFS): Bigtable uses the distributed Google File System (GFS) to store Metedata, data and log. Google SSTable file format(sorted string table) Bigtable data is internally stored in Google SSTable file format. An SSTable provides a persistent ordered immutable map from keys to values. Google Chubby Bigtable relies on a highly-available and persistent distributed lock service called Chubby 2

Building blocks Google file system:
shared pool of machines that run a wide variety of other distributed applications. Bigtable depends on a cluster management system for scheduling jobs, dealing with machine failures, and monitoring machine status. Three major compunents: - log files: each tablet have its Own log file. - DATA: data of the tablet (stored in SSTable file) - METADATA: tablets location (also stored in SSTable file) 2

Building blocks Google SSTable file format
Contains a sequence of 64 KB Blocks Optionally, an SSTable can be completely mapped into memory Block index stored at the end of the file Used to locate blocks Index loaded in memory when the SSTable is opened Lookup is performed with a single dist seek Find the appropriate block by performing a binary search in the in-memory index Reading the appropriate block from disk 2

Building blocks Google Chubby
consists of five active replicas, one of which is elected to be the master and actively serve requests Chubby provides A namespace that contains directories and small files (less then 256KB) – Each Directory or file can be used as a lock – Reads And writes to a file are atomic – Chubby Client library provides consistent caching of Chubby files – Each Chubby Client maintains a session with a Chubby service Bigtable use Chubby file for many tasks: – To Ensure there is at most one active master at any time – To Store the bootstrap location of Bigtable Data (Root tablet) – To Discover tablet servers and finalize tablet server deaths – To Store Bigtable Schema information (column Family information for each table) – To Store access control lists (ACL) 2

Master sever 3. Client 2. Tablet server
– Assigning tab1lets to tablet servers – Detecting the addition and expiration of tablet servers – Balancing tablet server load – Garbage collecting of files in GFS – Handling schema changes (table crea7on, column family creation/deletion) 3. Client – Do not rely on the master for tablet location information – Communicates directly with tablet servers for reads and writes 5 2. Tablet server – manages a set of tablets – Handles read and write request to the tablets – Splits tablets that have grown too large (100-200 MB)

Implementation 4

Implementation Tablet Assignment
• Each tablet is assigned to one tablet server at a time • Master keeps tracks of – the set of live tablet servers (tracking via Chubby, under directory “servers”) – the current assignment of tablet to tablet servers – the current unassigned tablets (by scaning the B+ Tree) • When a tablet is unassigned, the master assigns the tablet to an available tablet server by sending a tablet load request to that tablet server

Implementation Master startup
• When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can changes them: 1. Master grabs a unique master lock in Chubby to prevent concurrent master instantiations 2. Master scans the “servers” directory in Chubby to find the live tablet servers 3. Master communicate with every live tablet servers to discover what tablets are already assigned to each server 4. Master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet is not discovered in step 3. 5. Master scans the METADATA table to learn the set of tablets (and detect unassigned tablets)

Implementation Tablet Service Writing:
Server checks if it is well-formed Server Checks if the sender is authorized (list of permitted writers in Chubby file) A valid mutation is written to the commit After the write has been committed, its contents are inserted into the memtable. 19-24: 7:00

Implementation Tablet Service Reading:
Server checks if it is well-formed Server Checks if the sender is authorized (list of permitted readers in Chubby file) A valid read operation is executed on a merged view of the sequence of SSTables and the memtable 2

Implementation Tablet Service Recover:
Tablet server find the commit log for this tablet by iterating over the METADATA tablets (searching the b+ tree) Tablet server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have a committed since the redo points 2

Implementation Compaction Minor compaction:
When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. shrinks the memory usage of the tablet server reduces the amount of data that has to be read from the commit log during recovery 2

Implementation Compaction Marging compaction:
When the number of SSTables rich its bounde, the server reads the contents of a few SSTables and the memtable, and writes out a new SSTable Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables 2

Implementation Compaction Major compaction:
It is a merging compaction that rewrites all SSTables into exactly one SSTable that contains no deletion information or deleted data Bigtable cycles throught all of it tablets and regularly applies major compaction to them (=reclaim ressources used by deleted data in a timely fashion) 2

Refinements Caching for read performance
To improve read performance, tablet servers use two levels of cachin. Scan Cache: a high-level cache that caches key-value pairs returned by the SSTable interface Block Cache: a lower-level cache that caches SSTable blocks read from file system 2 Read K13 from SSTable Get K13 value K13 is in block 1 Is K13 in Scan cache? Is K13 in Block cache?

Refinements Bloom filter
a read operation has to read from all SSTables that make up the state of a tablet If these SSTables are not in memory, we may end up doing many disk accesses We reduce the number of accesses by using Bloom filters Bloom filter let us know with high propapility if a SSTable contain a specified row/column pair or not. (no false negative) Use only small amount of memory 1

Implementation Refinements Performance Experience conclusions

Performance . . . . . . Cluster 500 500 1786 • 500 tablet servers
– Configured to use 1 GB RAM – Dual-?core Opteron 2 GHz, Gigabit Ethernet NIC – Write to a GFS cell (1786 machines with 2 x 400 GB IDE) • 500 clients • Network roundtrip time between any machine < 1 millisecond 500 500 1 client Tablet server GFS server GFS server GFS server 1786

Performance Random reads - Similar to Sequential reads except row key hashed modulo R Random reads (memory) - Similar to Random reads benchmark except locality group that contains the data is marked as in-memory Random writes - Similar to Sequential writes except row key hashed modulo R Sequential reads - Used R row keys partitioned and assigned to N clients Sequential writes - Used R row keys partitioned and assigned to N clients Scans - Similar to Random reads but uses support provided by Bigtable API for scanning over all values in a row range (reduces RPC) 1

Experience Characteristics of a few tables in production use:

Experience Google Earth
Google operates a collection of services that provide users with access to high-resolution satellite imagery of the world's surface Relies greatly on Bigtable to keep his data Requre both low latency raction and store very big data.

Experience Google Earth aaa . . . . . . . .

Big Table Alon pluda.

Similar presentations

Presentation on theme: "Big Table Alon pluda."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Table Alon pluda.

Similar presentations

Presentation on theme: "Big Table Alon pluda."— Presentation transcript:

Similar presentations

About project

Feedback