Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall.

Similar presentations


Presentation on theme: "Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall."— Presentation transcript:

1 Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 1

2 Motivation and Design Goal Distributed Storage System for Structured Data – Scalability Petabytes of data on Thousands of (commodity) machines – Wide Applicability Throughput-oriented and Latency-sensitive – High Performance – High Availability 10/22/2012Fall 2012: CSE 704 Web-scale Data Management2

3 Data Model 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 3

4 Data Model Not a Full Relational Data Model Provides a simple data model – Supports Dynamic Control over Data Layout – Allows clients to reason about the locality properties 10/22/2012Fall 2012: CSE 704 Web-scale Data Management4

5 Data Model – A Big Table A Table in Bigtable is a: – Sparse – Distributed – Persistent – Multidimensional – Sorted map 10/22/2012Fall 2012: CSE 704 Web-scale Data Management5

6 Data Model 10/22/2012Fall 2012: CSE 704 Web-scale Data Management6

7 Data Model Rows – Data maintained in lexicographic order by row key – Tablet: rows with consecutive keys Units of distribution and load balancing Columns – Column families Family:qualifier Cells Timestamps 10/22/2012Fall 2012: CSE 704 Web-scale Data Management7

8 Data Model – WebTable Example 10/22/2012Fall 2012: CSE 704 Web-scale Data Management8 A large collection of web pages and related information

9 Data Model – WebTable Example Row Key Tablet - Group of rows with consecutive keys. Unit of Distribution Bigtable maintains data in lexicographic order by row key 10/22/2012Fall 2012: CSE 704 Web-scale Data Management9

10 Data Model – WebTable Example Column Family Column family is the unit of access control 10/22/2012Fall 2012: CSE 704 Web-scale Data Management10

11 Data Model – WebTable Example Column Column key is specified by “Column family:qualifier” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management11

12 Data Model – WebTable Example Column You can add a column in a column family if the column family was created 10/22/2012Fall 2012: CSE 704 Web-scale Data Management12

13 Data Model – WebTable Example Cell Cell: the storage referenced by a particular row key, column key, and timestamp 10/22/2012Fall 2012: CSE 704 Web-scale Data Management13

14 Data Model – WebTable Example Different cells in a table can contain multiple versions indexed by timestamp 10/22/2012Fall 2012: CSE 704 Web-scale Data Management14

15 API 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 15

16 API Write or Delete values in Bigtable Look up values from individual rows Iterate over a subset of the data in a table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management16

17 API – Update a Row 10/22/2012Fall 2012: CSE 704 Web-scale Data Management17

18 API – Update a Row Opens a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management18

19 API – Update a Row We’re going to mutate the row 10/22/2012Fall 2012: CSE 704 Web-scale Data Management19

20 API – Update a Row Store a new item under the column key “anchor:www.c- span.org” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management20

21 API – Update a Row Delete an item under the column key “anchor:www.abc.com” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management21

22 API – Update a Row Atomic Mutation 10/22/2012Fall 2012: CSE 704 Web-scale Data Management22

23 API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management23 Create a Scanner instance

24 API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management24 Access “anchor” column family

25 API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management25 Specify “return all versions”

26 API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management26 Specify a row key

27 API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management27 Iterate over rows

28 API – Other Features Single row transaction Client-supplied scripts in the address space of the server Input source/Output target for MapReduce jobs 10/22/2012Fall 2012: CSE 704 Web-scale Data Management28

29 A Typical Google Machine 10/22/2012Fall 2012: CSE 704 Web-scale Data Management29

30 A Google Cluster 10/22/2012Fall 2012: CSE 704 Web-scale Data Management30

31 A Google Cluster 10/22/2012Fall 2012: CSE 704 Web-scale Data Management31

32 Building Blocks Chubby – Highly-available and persistent distributed lock service GFS – Store logs and data files – SSTable Google’s immutable file format A persistent, ordered immutable map from keys to values 10/22/2012Fall 2012: CSE 704 Web-scale Data Management32

33 SSTable For more info: log-structured-storage-leveldb/ log-structured-storage-leveldb/ 10/22/2012Fall 2012: CSE 704 Web-scale Data Management33

34 Chubby Highly-available and persistent distributed lock service – 5 replicas, one is elected as a master – Paxos – Provides a namespace that consists of directories and small files 10/22/2012Fall 2012: CSE 704 Web-scale Data Management34

35 Implementation Client Library Master – one and only one! Tablet Servers – Many 10/22/2012Fall 2012: CSE 704 Web-scale Data Management35

36 Implementation - Master Responsible for assigning tablets to table servers – Addition/removal of tablet server – Tablet-server load balancing – Garbage collecting files in GFS Handles schema changes Single master system (as GFS did) 10/22/2012Fall 2012: CSE 704 Web-scale Data Management36

37 Tablet Server Manages a set of tablets Handles read and write requests to the tablets Splits tablets that have grown too large 10/22/2012Fall 2012: CSE 704 Web-scale Data Management37

38 How Does a Client Find a Tablet? 10/22/2012Fall 2012: CSE 704 Web-scale Data Management38

39 Tablet Assignment Each tablet is assigned to at most one tablet server at a time When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request Bigtable uses Chubby to keep track of tablet servers 10/22/2012Fall 2012: CSE 704 Web-scale Data Management39

40 Tablet Assignment Detecting a tablet server which is no longer serving its tablets – The master periodically asks each tablet server for the status of its lock – If a tablet server reports it has lost its lock, or if the master cannot reach a tablet server, – The master attempts to acquire an exclusive lock on the server’s file – If the lock acquire is successful -> Chubby is alive, so the tablet server must have a problem – The master deletes the server’s file in Chubby to ensure the tablet server can never serve again – Then, the master move all the tablets that were previously assigned to that server into the set of unassigned tablets 10/22/2012Fall 2012: CSE 704 Web-scale Data Management40

41 Tablet Assignment When a master is started, the master… – Grabs a unique master lock in Chubby – Scans the servers directory in Chubby to find the live servers – Communicates with every live tablet server to discover the current tablet assignment – Scans the METADATA table and adds unassigned tablets to the set of unassigned tablets 10/22/2012Fall 2012: CSE 704 Web-scale Data Management41

42 Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management42

43 Tablet Serving Memtable – A sorted buffer – Maintains the updates on a row-by-row basis – Each row is copy-on-write to maintain row-level consistency – Older updates are stored in a sequence of SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management43

44 Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management44

45 Tablet Serving - Write Write operation – The server checks if the operation is valid – A valid mutation is written to the commit log – After the write has been committed, its contents are inserted into the memtable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management45

46 Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management46

47 Tablet Serving - Read Read operation – Check if the operation is valid – A valid operation is executed on a merged view of the sequence of SSTables and the memtable – The merged view can be formed efficiently since SSTables and the memtable are lexicographically sorted data structure 10/22/2012Fall 2012: CSE 704 Web-scale Data Management47

48 Tablet Serving - Recover 10/22/2012Fall 2012: CSE 704 Web-scale Data Management48

49 Tablet Serving - Recover Recover a table – A tablet server reads its metadata from METADATA table – The metadata contains the list of SSTables that comprise a tablet and a set of redo points – The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points 10/22/2012Fall 2012: CSE 704 Web-scale Data Management49

50 Compaction Minor compaction – When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable Major compaction – Rewrite multiple SSTables into one SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management50

51 Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management51

52 Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable Threshold reached 10/22/2012Fall 2012: CSE 704 Web-scale Data Management52

53 Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable Threshold reached 10/22/2012Fall 2012: CSE 704 Web-scale Data Management53

54 Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable A new memtable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management54

55 Compaction memtable SSTable Memory GFS Write Op Commit Log Major compaction 10/22/2012Fall 2012: CSE 704 Web-scale Data Management55

56 Schema Management Bigtable schemas are stored in Chubby The master update the schema by rewriting the corresponding schema file in Chubby 10/22/2012Fall 2012: CSE 704 Web-scale Data Management56

57 Optimization Locality Group – Client defined – An abstraction that enables clients to control their data’s storage layout – A separate SSTable is generated for each locality group in each tablet during compaction – A locality group can be declared to be in-memory 10/22/2012Fall 2012: CSE 704 Web-scale Data Management57

58 Optimization Compression – Client can control whether the SSTables for a locality group are compressed 10/22/2012Fall 2012: CSE 704 Web-scale Data Management58

59 Optimization Two-level Caching for Read Performance – Scan cache: higher level. Caches the key-value pairs returned by the SSTable interface to the tablet server code – Block cache: lower level Caches SSTable blocks 10/22/2012Fall 2012: CSE 704 Web-scale Data Management59

60 Optimization Bloom Filters 10/22/2012Fall 2012: CSE 704 Web-scale Data Management60

61 Optimization Commit-Log Implementation – Using one log per tablet server – Recovery? A tablet server hosted 100 tablets failed 100 other machines were each assigned a single tablet 100 reads? Sort the commit log by – Writing commit logs Two log-writer threads 10/22/2012Fall 2012: CSE 704 Web-scale Data Management61

62 Performance Evaluation Sequential writes/reads – Row keys with names 0 to R-1, partitioned into 10N equal-sized ranges – Wrote a single string under each row key – 1GB / tablet server Scan – Uses Bigtable Scan API Random writes/reads – Similar to Sequential write/read, but the row key was hashed Random reads (Mem) – 100MB / tablet server, the locality group is marked as in-memory 10/22/2012Fall 2012: CSE 704 Web-scale Data Management62

63 Single Tablet Server Performance 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 63

64 Aggregate Throughput 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 64

65 Real Applications 10/22/2012Fall 2012: CSE 704 Web-scale Data Management65

66 Lessons Learned Failures! Delay new features until it is clear how the new features will be used Monitoring Simple Design! 10/22/2012Fall 2012: CSE 704 Web-scale Data Management66

67 Acknowledgement Jeff Dean, “Handling Large Datasets at Google: Current Systems and Future Directions” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management67


Download ppt "Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall."

Similar presentations


Ads by Google