Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.

Similar presentations


Presentation on theme: "CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A."— Presentation transcript:

1 CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber by Haifa Alyahya 432920323

2 Introduction Data Model APIs Building Blocks Implementation Refinements Performance Real Applications Conclusion Outline

3 Discussion Bigtable(Bt) is a distributed storage system for managing structured data that is designed to scale to a very large size. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance.

4 Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: –Wide applicability. –Scalability. –High performance. –High availability. Introduction

5 Scale Problem –Lots of data –Millions of machines –Different project/applications –Hundreds of millions of users Storage for (semi-)structured data. No commercial system big enough –Couldn’t afford if there was one Low-level storage optimization help performance significantly – Much harder to do when running on top of a database layer Motivation

6 Data Model A sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents

7 Data Model Rows –Arbitrary string –Access to data in a row is atomic –Ordered lexicographically

8 Data Model Column –Tow-level name structure: family: qualifier –Column Family is the unit of access control

9 Data Model Timestamps –Store different versions of data in a cell –Lookup options Return most recent K values Return all values

10 Data Model The row range for a table is dynamically partitioned Each row range is called a tablet Tablet is the unit for distribution and load balancing

11 APIs Metadata operations –Create/delete tables, column families, change metadata Writes –Set(): write cells in a row –DeleteCells(): delete cells in a row –DeleteRow(): delete all cells in a row Reads –Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns

12 APIs

13 Building Blocks Google File System (GFS) –stores persistent data (SSTable file format) Scheduler –schedules jobs onto machines Chubby –Lock service: distributed lock manager –master election, location bootstrapping MapReduce (optional) –Data processing –Read/write Bigtable data

14 Chubby {lock/file/name} service Coarse-grained locks Each clients has a session with Chubby. –The session expires if it is unable to renew its session lease within the lease expiration time. 5 replicas, need a majority vote to be active Also an OSDI ’06 Paper

15 Implementation The Bigtable implementation has three major components: –A library that is linked into every client –One master server –Many tablet servers

16 Tablet Location Management

17 Refinements Locality groups: –Clients can group multiple column families together into a locality group. Compression: –Uses Bentley and McIlroy's scheme and fast compression algorithm. Caching for read performance: –Uses Scan Cache and Block Cache. Bloom filters: –Reduce the number of accesses.

18 Performance Evaluation

19 Real Applications Google Analytics –http://analytics.google.com Google Earth –http://earth.google.com Personalized search –www.google.com/psearch

20 Conclusions Users like… –the performance and high availability provided by the Bigtable implementation –that they can scale the capacity of their clusters by simply adding more machines to the system as their resource demands change over time –There are significant advantages to building a custom storage solution Challenges… –User adoption and acceptance of a new interface –Implementation issues


Download ppt "CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A."

Similar presentations


Ads by Google