Bigtable: A Distributed Storage System for Structured Data Authors: Chang et al Google Inc Presenter: Victoria Cooper
Introduction Create a distributed storage system for structured data that will 1. Wide applicability 2. Scalability 3. High performance 4. High availability
Outline Data Model API Construction of Bigtable Implementation and refinements Evaluation Applications Conclusions
Data Model Dynamic control over layout and format locality properties Names for indexing can be arbitrary strings Ability to dynamically control whether the data comes from
Data Model (row:string, column:string, time:int64)-> string MAP Row key Uninterpreted array of bytes Column key Timestamp
Data Model: Figure 1
Rows: Tablets Row range for a table Dynamically partitioned Unit of distribution and load balancing Table Tablet
Rows Reads of short tablets are efficient and only need a few machines This can lead to good locality
Columns Unit of access control Family before key Small number of column families Large amount of columns Column family
Columns Column key: Family names Must be printable Qualifiers Family: qualifier syntax Family names Must be printable Qualifiers Arbitrary strings
Columns: Example 1 Column family: Language webpage is written in Column key: stores the each web page’s language id
Columns: Example 2 Family: anchor Key: single anchor
Timestamps Multiple versions of the same data 64 bit integers Can be assigned by big table (real time) Can be assigned by client Need to be unique to avoid collisions Stored in decreasing order Most recent read first
Timestamps Two settings to garbage collect from column families 1) Bigtable garbage collects automatically 2) User specifies only the last n versions be kept
API Create tables/columns Delete tables/columns Alters: Cluster Table Column family metadata (access control rights)
API: Figure 2
API: Figure 3
API Single-row transactions Allows cells to be integer counters Execution of client-supplied scripts Written inn Sawzall Can be used with MapReduce
Building Blocks Google File Systems Google SSTable (file format) Chubby (lock service)
Google File System (GFS) Stores and logs files Bigtable cluster – operates on machines that do many different operations for different reasons Cluster management: properly schedule jobs Manage resources Deal with failures Monitor status
KEYS VALUES SSTable Stores Bigtable data Map from keys to values Persistent Immutable Ordered Both keys and values are arbitrary byte strings KEYS VALUES
SSTable 64 KB Block Index
Chubby Serves Requests Master Replica Replica Replica Replica
Chubby Namespace
Bigtable and Chubby One active master at a time Store bootstrap location of Bigtable data Discover tablet servers Finalize tablet server deaths Store column family information Store access control lists
Implementation Library Master server Assigning tablets to tablet servers Many Tablet servers Manages a set of tablets Tablet Servers Library Master Server Client Client Client
Implementation Tablet Server Client Data Tablets Master Server Reads Writes Client Data Tablets
Cluster Cluster Table Table Tablet Tablet Row Range Data
Tablet Location: Figure 4
METADATA table Stores location of tablet Under row key Typically stores: 1 KB data Library caches tablet locations If incorrect: moves up the hierarchy If cache is empty: could take 3 trips If cache is stale: could take 6 trips
Tablet Assignment Tablet server start Lock Unique chubby file Master looks at server’s directory Tablet stops serving Loses its lock Loses its file Try to reacquire lock If file does not exist it kills itself When dead it releases the lock
Tablet Assignment Master’s job to detect tablets who stop serving Periodically checks tablets to see if lock still exist If can’t reach or tablet has lost its lock Master tries to get exclusive lock If able: Tablet is dead Tablet is having trouble contacting Chubby Master kills tablet Moves files
Tablet Assignment The set of existing tables change: Tablet Splits Table is created Table is deleted Tables are merged Tables are split Tablet Splits Initiated by tablet server Notifies Master and updates METADATA table
Tablet Serving: Figure 5
Tablet Serving Write Operation Read Operation Checks form Checks authorization Checks a list in Chubby file Valid mutation is written Contents are inserted into memtable Checks form Checks authorization Valid read is executed on a merged view of SSTables and memtable
Minor Compaction Shrinks memory usage on tablet server Reduces amount of data to be read from commit log in case of error SSTable memtable memtable memtable
Merging Compaction Merges a given number of SSTables and the memtable into a new SSTable Discards the old data SSTable SSTable SSTable memtable SSTable
Major Compaction Merging compaction that re-writes all SSTables into 1 SSTable No deleted data Reclaim resources used by deleted data Bigtable will periodically do these SSTable SSTable SSTable memtable SSTable
Refinements Locality groups Compression Caching for read performance Bloom filters Commit-log implementation Speeding up tablet recovery Exploiting immutability
Refinements Locality Groups Compression Multiple column families grouped together Separate SSTable Efficient reads User specified compression format Compress SSTable 2 pass compression scheme Speed and space efficient
Refinements Caching for read Bloom Filters 2 levels of caching Scan Cache Block Cache Reduces number of accesses to disk memory Filter for the SSTables in a certain locality group Check to see if a SSTable might have data for a row column pair
Commit-log Implementation Speed-up Tablet Recovery Refinements Commit-log Implementation Speed-up Tablet Recovery Append mutations to a single commit log per tablet server One log has performance benefits Complicates recovery Avoid duplicating the log Master moves tablet to a different server Minor compaction Tablet Server 1 stops serving tablet Tablet loaded onto Tablet Server 2 No recovery of log entries required
Refinements Immutability SSTables are immutable Do not need to synchronize access Deleting data = garbage collection Split tablets quickly Memtable is mutable Each row is copy-on-write
Performance Evaluation N Tablet Servers that make the cluster 1 GB Write to GFS 1 GB Write to GFS 1 GB Write to GFS Client Servers Sufficient physical memory Client Servers Sufficient physical memory Client Servers Sufficient physical memory
Performance Evaluation The machines were in a two level tree-shaped switched network 100-200 Gbps of bandwidth available at root Run on the same machines Tablet servers and master Test clients GFS servers Machines ran: Tablet server Client or other job processes
Performance Evaluation Sequential write Random writes Sequential read Random reads Random reads from memory Scan
Write Benchmarks Sequential Random Used row keys with names 0 to R Partitioned into 10N equal ranges Ranges assigned to the N clients (dynamic assignment) Wrote a single string under each row key Row keys are distinct Similar to sequential write Hashed row key row key % R before writing This ensured write load was evenly distributed across row space R is the distinct number of Bigtable row keys involved with the test
Read Benchmarks Sequential Random Generation of row keys same as sequential write Reads string under row key Similar to random read Hashed the row key before reading the string under the row key
Read/ Scan Benchmarks Random from memory Scan Similar to random read Locality group marked in-memory Reads from tablet server memory Similar to sequential read Scans over all values in row range Uses Bigtable API for support Reduces number of RPC’s
Performance Evaluation: Figure 6
Single Tablet-Server Slowest: random reads Random reads from memory > random reads Random writes = sequential writes Sequential reads > random reads Scans > sequential reads
Scaling Increased number of tablet servers from 1 to 500 Performance does not increase linearly Drop from 1 to 50 servers Random reads were the worst with scaling
Real Applications Google Analytics Personalized Search Google Earth Google Finance Orkut (Google +) Writely (Google docs) www.fahad.com http://www.ubergizmo.com http://www.entrarnoorkut.net/ www.userlogos.org www.support.google.com www.google.play.com
Real Applications: Table 1
Real Applications: Table 2
Personalized search The user’s data goes in Bigtable Row name userid All user actions are stored Column family for type of action Replicated over several clusters
Google Earth Preprocessing table Stores imagery Data cleaned Entered into final table Rows named geographic segments Column families track sources of data
Google Analytics Raw click table Summery table Row for each end-user session Row name tuple (website’s name, time created) Summery table Various pre-defined summaries for the website
Lessons This type of system has vulnerabilities Memory/network corruption Problems with relied on systems Planned/unplanned matainice Understand how features will be used Have proper system level monitoring Simplicity
Conclusion Google’s distributed storage system Other Google applications Bigtable is scalable and efficent Google users found Bigtable to be easy to use and helpful
Future Work Support for secondary indicies Support for infrastructure for building cross-data-center replicated Bigtables with multiple master replicas Keep Bigtable working well and fixing bugs as they arise
Thanks/Questions?