Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data
Authors: Chang et al Google Inc Presenter: Victoria Cooper

Introduction Create a distributed storage system for structured data that will 1. Wide applicability 2. Scalability 3. High performance 4. High availability

Outline Data Model API Construction of Bigtable
Implementation and refinements Evaluation Applications Conclusions

Data Model Dynamic control over layout and format locality properties
Names for indexing can be arbitrary strings Ability to dynamically control whether the data comes from

Data Model (row:string, column:string, time:int64)-> string MAP
Row key Uninterpreted array of bytes Column key Timestamp

Data Model: Figure 1

Rows: Tablets Row range for a table Dynamically partitioned
Unit of distribution and load balancing Table Tablet

Rows Reads of short tablets are efficient and only need a few machines
This can lead to good locality

Columns Unit of access control Family before key
Small number of column families Large amount of columns Column family

Columns Column key: Family names Must be printable Qualifiers
Family: qualifier syntax Family names Must be printable Qualifiers Arbitrary strings

Columns: Example 1 Column family: Language webpage is written in
Column key: stores the each web page’s language id

Columns: Example 2 Family: anchor Key: single anchor

Timestamps Multiple versions of the same data 64 bit integers
Can be assigned by big table (real time) Can be assigned by client Need to be unique to avoid collisions Stored in decreasing order Most recent read first

Timestamps Two settings to garbage collect from column families
1) Bigtable garbage collects automatically 2) User specifies only the last n versions be kept

API Create tables/columns Delete tables/columns Alters: Cluster Table
Column family metadata (access control rights)

API: Figure 2

API: Figure 3

API Single-row transactions Allows cells to be integer counters
Execution of client-supplied scripts Written inn Sawzall Can be used with MapReduce

Building Blocks Google File Systems Google SSTable (file format)
Chubby (lock service)

Google File System (GFS)
Stores and logs files Bigtable cluster – operates on machines that do many different operations for different reasons Cluster management: properly schedule jobs Manage resources Deal with failures Monitor status

KEYS VALUES SSTable Stores Bigtable data Map from keys to values
Persistent Immutable Ordered Both keys and values are arbitrary byte strings KEYS VALUES

SSTable 64 KB Block Index

Chubby Serves Requests Master Replica Replica Replica Replica

Chubby Namespace

Bigtable and Chubby One active master at a time
Store bootstrap location of Bigtable data Discover tablet servers Finalize tablet server deaths Store column family information Store access control lists

Implementation Library Master server
Assigning tablets to tablet servers Many Tablet servers Manages a set of tablets Tablet Servers Library Master Server Client Client Client

Implementation Tablet Server Client Data Tablets Master Server Reads
Writes Client Data Tablets

Cluster Cluster Table Table Tablet Tablet Row Range Data

Tablet Location: Figure 4

METADATA table Stores location of tablet Under row key
Typically stores: 1 KB data Library caches tablet locations If incorrect: moves up the hierarchy If cache is empty: could take 3 trips If cache is stale: could take 6 trips

Tablet Assignment Tablet server start
Lock Unique chubby file Master looks at server’s directory Tablet stops serving Loses its lock Loses its file Try to reacquire lock If file does not exist it kills itself When dead it releases the lock

Tablet Assignment Master’s job to detect tablets who stop serving
Periodically checks tablets to see if lock still exist If can’t reach or tablet has lost its lock Master tries to get exclusive lock If able: Tablet is dead Tablet is having trouble contacting Chubby Master kills tablet Moves files

Tablet Assignment The set of existing tables change: Tablet Splits
Table is created Table is deleted Tables are merged Tables are split Tablet Splits Initiated by tablet server Notifies Master and updates METADATA table

Tablet Serving: Figure 5

Tablet Serving Write Operation Read Operation Checks form
Checks authorization Checks a list in Chubby file Valid mutation is written Contents are inserted into memtable Checks form Checks authorization Valid read is executed on a merged view of SSTables and memtable

Minor Compaction Shrinks memory usage on tablet server
Reduces amount of data to be read from commit log in case of error SSTable memtable memtable memtable

Merging Compaction Merges a given number of SSTables and the memtable into a new SSTable Discards the old data SSTable SSTable SSTable memtable SSTable

Major Compaction Merging compaction that re-writes all SSTables into 1 SSTable No deleted data Reclaim resources used by deleted data Bigtable will periodically do these SSTable SSTable SSTable memtable SSTable

Refinements Locality groups Compression Caching for read performance
Bloom filters Commit-log implementation Speeding up tablet recovery Exploiting immutability

Refinements Locality Groups Compression
Multiple column families grouped together Separate SSTable Efficient reads User specified compression format Compress SSTable 2 pass compression scheme Speed and space efficient

Refinements Caching for read Bloom Filters 2 levels of caching
Scan Cache Block Cache Reduces number of accesses to disk memory Filter for the SSTables in a certain locality group Check to see if a SSTable might have data for a row column pair

Commit-log Implementation Speed-up Tablet Recovery
Refinements Commit-log Implementation Speed-up Tablet Recovery Append mutations to a single commit log per tablet server One log has performance benefits Complicates recovery Avoid duplicating the log Master moves tablet to a different server Minor compaction Tablet Server 1 stops serving tablet Tablet loaded onto Tablet Server 2 No recovery of log entries required

Refinements Immutability SSTables are immutable
Do not need to synchronize access Deleting data = garbage collection Split tablets quickly Memtable is mutable Each row is copy-on-write

Performance Evaluation
N Tablet Servers that make the cluster 1 GB Write to GFS 1 GB Write to GFS 1 GB Write to GFS Client Servers Sufficient physical memory Client Servers Sufficient physical memory Client Servers Sufficient physical memory

The machines were in a two level tree-shaped switched network Gbps of bandwidth available at root Run on the same machines Tablet servers and master Test clients GFS servers Machines ran: Tablet server Client or other job processes

Sequential write Random writes Sequential read Random reads Random reads from memory Scan

Write Benchmarks Sequential Random Used row keys with names 0 to R
Partitioned into 10N equal ranges Ranges assigned to the N clients (dynamic assignment) Wrote a single string under each row key Row keys are distinct Similar to sequential write Hashed row key row key % R before writing This ensured write load was evenly distributed across row space R is the distinct number of Bigtable row keys involved with the test

Read Benchmarks Sequential Random
Generation of row keys same as sequential write Reads string under row key Similar to random read Hashed the row key before reading the string under the row key

Read/ Scan Benchmarks Random from memory Scan Similar to random read
Locality group marked in-memory Reads from tablet server memory Similar to sequential read Scans over all values in row range Uses Bigtable API for support Reduces number of RPC’s

Performance Evaluation: Figure 6

Single Tablet-Server Slowest: random reads
Random reads from memory > random reads Random writes = sequential writes Sequential reads > random reads Scans > sequential reads

Scaling Increased number of tablet servers from 1 to 500
Performance does not increase linearly Drop from 1 to 50 servers Random reads were the worst with scaling

Real Applications Google Analytics Personalized Search Google Earth
Google Finance Orkut (Google +) Writely (Google docs)

Real Applications: Table 1

Real Applications: Table 2

Personalized search The user’s data goes in Bigtable Row name userid
All user actions are stored Column family for type of action Replicated over several clusters

Google Earth Preprocessing table Stores imagery Data cleaned
Entered into final table Rows named geographic segments Column families track sources of data

Google Analytics Raw click table Summery table
Row for each end-user session Row name tuple (website’s name, time created) Summery table Various pre-defined summaries for the website

Lessons This type of system has vulnerabilities
Memory/network corruption Problems with relied on systems Planned/unplanned matainice Understand how features will be used Have proper system level monitoring Simplicity

Conclusion Google’s distributed storage system
Other Google applications Bigtable is scalable and efficent Google users found Bigtable to be easy to use and helpful

Future Work Support for secondary indicies
Support for infrastructure for building cross-data-center replicated Bigtables with multiple master replicas Keep Bigtable working well and fixing bugs as they arise

Thanks/Questions?

Bigtable: A Distributed Storage System for Structured Data

Similar presentations

Presentation on theme: "Bigtable: A Distributed Storage System for Structured Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bigtable: A Distributed Storage System for Structured Data

Similar presentations

Presentation on theme: "Bigtable: A Distributed Storage System for Structured Data"— Presentation transcript:

Similar presentations

About project

Feedback