Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD

Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Hadoop Bigtable Hbase

Projects

Sample Projects Support video processing using HDFS and Mapreduce
Image processing using cloud Security services using cloud Web analytics using cloud Cloud based MPI Novel applications of cloud based storage New pricing model Cyber physical system with cloud as the backend Bioinformatics using Mapreduce

Hadoop DFS (HDFS) Mimic GFS Same assumptions Highly similar design Different names: Master  namenode Chunkserver datanode Chunk  block Operation log  EditLog

Working with HDFS /usr/local/hadoop/ Installation
bin/ : scripts for starting/stopping the system conf/ : configure files log/ : system log files Installation Single node: Cluster:

In-Memory Accelerator for Hadoop

HDFS on different storage devices

PCM Emerging NVM technology that can replace Flash and DRAM
Much higher density; much better scalability; can do multi-level cells Non-volatile, fast reads (~50ns), slow and energy-hungry writes; limited lifetime (~10 writes per cell), no leakage

Bigtable Fay Chang, et

Global Picture

Why Bigtable? Performance of RDBMS system is good for transaction processing but for very large scale analytic processing, the solutions are commercial, expensive, and specialized. Very large scale analytic processing Big queries – typically range or table scans. Big databases (100s of TB)

Why Bigtable? (2) Map reduce on Bigtable with optionally Cascading on top to support some relational algebras may be a cost effective solution. Sharding is not a solution to scale open source RDBMS platforms Application specific Labor intensive (re)partitionaing

Bigtable BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products

BigTable Distributed multi-level map Fault-tolerant, persistent
Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Often want to examine data changes over time E.g. Contents of a web page over multiple crawls

Building Blocks Building blocks: BigTable uses of building blocks:
Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager MapReduce: simplified large-scale data processing BigTable uses of building blocks: GFS: stores persistent data (SSTable file format for storage of data) Scheduler: schedules jobs involved in BigTable serving Lock service: master election Map Reduce: often used to read/write BigTable data

Google File System Large-scale distributed “filesystem”
Master: responsible for metadata Chunk servers: responsible for reading and writing large chunks of data Chunks replicated on 3 machines, master responsible for ensuring replicas exist

(row, column, timestamp) -> cell contents
Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications

WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched.

Rows Name is an arbitrary string Rows ordered lexicographically
Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines

Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys

Columns Columns have two-level name structure: Column family
family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired

Timestamps Used to store different versions of data in a cell
New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”

SSTable Immutable, sorted file of key-value pairs
Chunks of data plus an index Index is of block ranges, not values SSTable 64K block 64K block 64K block Index

Tablet Contains some range of rows of the table
Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index

Table Multiple tablets make up the table SSTables can be shared
Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple_two_E boat SSTable SSTable SSTable SSTable

Architecture Client library Single master server Tablet servers

Bigtable Master Assigns tablets to tablet servers
Detects addition and expiration of tablet servers Balances tablet server load. Tablets are distributed randomly on nodes of the cluster for load balancing. Handles garbage collection Handles schema changes

Bigtable Tablet Servers
Each tablet server manages a set of tablets Typically between ten to a thousand tablets Each MB by default Handles read and write requests to the tablets Splits tablets that have grown too large Master responsible for load balancing and fault tolerance Use Chubby to monitor health of tablet servers, restart failed servers

A 3-level Hierarchy 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data. The root tablet never splits. 2nd Level: Each meta-data tablet contains the location of a set of user tablets. 3rd Level: A set of SSTable identifiers for each tablet.

A 3-level Hierarchy Each meta-data row stores ~ 1KB of data,
With 128 MB tablets, the three level store addresses 234 tablets (261 bytes in 128 MB tablets). Approaches a Zetabyte (million Petabytes).

Editing a Table Mutations are logged, then applied to an in-memory version Logfile stored in GFS Tablet Insert Memtable Insert Delete apple_two_E boat Insert Delete Insert SSTable SSTable

Chubby A persistent and distributed lock service.
Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files.

Bigtable and Chubby Bigtable uses Chubby to:
Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information), Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. Also keeps track of unassigned tablets. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.

API Metadata operations Writes (atomic) Reads
Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns

API Examples: Write/Modify
atomic row modification No support for (RDBMS-style) multi-row transactions

Return sets can be filtered using regular expressions:
API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*

Tablet Serving “Log Structured Merge Trees”
Image Source: Chang et al., OSDI 2006

Tablet Representation
append-only log on GFS SSTable on GFS write buffer in memory (random-access) write read Tablet SSTable: Immutable on-disk ordered map from stringstring String keys: <row, column, timestamp> triples

Client Write & Read Operations
Write operation arrives at a tablet server: Server ensures the client has sufficient privileges for the write operation (Chubby), A log record is generated to the commit log file, Once the write commits, its contents are inserted into the memtable. Read operation arrives at a tablet server: Server ensures client has sufficient privileges for the read operation (Chubby), Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.

Write Operations As writes execute, size of memtable increases.
Once memtable reaches a threshold: Memtable is frozen, A new memtable is created, Frozen metable is converted to an SSTable and written to GFS.

Compactions Minor compaction Merging compaction Major compaction
Converts the memtable into an SSTable Reduces memory usage and log traffic on restart Merging compaction Reads the contents of a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables Major compaction Merging compaction that results in only one SSTable No deletion records, only live data

Refinements: Locality Groups
Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.

Refinements: Compression
Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)

Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback