Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.

Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational DBMS supports multiple index structures on a table:  Create B-tree index on Salary attribute of Emp  Create Hash index on SS# of Emp  When an application deletes a record from the Emp table (say using the hash index given the SS# of the employee to be fired), all index structures are updated.  The same concept exists with a relational storage manager (Berkeley DB).  Create primary and secondary databases and associate them with one another:  Create a B+-tree primary database on Salary  Create a Hash secondary database on SS#  Associate the two indexes together.

Homework 2 When we verify that all of our records have been stored correctly, is it sufficient to just count the number of retrieved records, or do we have to keep our own copy of all the records and verify each row in the database against our "shadow" copy? When we verify that all of our records have been stored correctly, is it sufficient to just count the number of retrieved records, or do we have to keep our own copy of all the records and verify each row in the database against our "shadow" copy?  Either approach is acceptable. Hint 1: Start by defining very small amount of memory for your main memory database (say 10 MB) to minimize time required to debug your program. Once your program is stable, scale to a large amount of memory. Hint 1: Start by defining very small amount of memory for your main memory database (say 10 MB) to minimize time required to debug your program. Once your program is stable, scale to a large amount of memory. Hint 2: Do not be surprised if Berkeley DB stores 12 MB of data into 10 MB – read the documentation carefully! Hint 2: Do not be surprised if Berkeley DB stores 12 MB of data into 10 MB – read the documentation carefully!

Google’s Bigtable Shahram Ghandeharizadeh Computer Science Department University of Southern California

Overall Architecture Shared-nothing architecture consisting of thousands of nodes! Shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin …….

Bigtable A data model (a schema). A data model (a schema). A sparse, distributed persistent multi-dimensional sorted map. A sparse, distributed persistent multi-dimensional sorted map. Data is partitioned across the nodes seamlessly. Data is partitioned across the nodes seamlessly. The map is indexed by a row key, column key, and a timestamp. The map is indexed by a row key, column key, and a timestamp. Output value in the map is an un-interpreted array of bytes. Output value in the map is an un-interpreted array of bytes.  (row: byte[ ], column: byte[ ], time: int64)  byte[ ]

Rows A row key is an arbitrary string. A row key is an arbitrary string.  Typically 10-100 bytes in size, up to 64 KB. Every read or write of data under a single row is atomic. Every read or write of data under a single row is atomic. Data is maintained in lexicographic order by row key. Data is maintained in lexicographic order by row key. The row range for a table is dynamically partitioned. The row range for a table is dynamically partitioned. Each partition (row range) is named a tablet. Each partition (row range) is named a tablet.  Unit of distribution and load-balancing. Objective: make read operations single-sited! Objective: make read operations single-sited!  E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.

Column Families Column keys are grouped into sets called column families. Column keys are grouped into sets called column families. A column family must be created before data can be stored in a column key. A column family must be created before data can be stored in a column key. Hundreds of static column families. Hundreds of static column families. Syntax is family:key, e.g., Language:English, Language:German, etc. Syntax is family:key, e.g., Language:English, Language:German, etc.

Timestamps 64 bit integers 64 bit integers Assigned by: Assigned by:  Bigtable: real-time in microseconds,  Client application: when unique timestamps are a necessity. Items in a cell are stored in decreasing timestamp order. Items in a cell are stored in decreasing timestamp order. Application specifies how many versions (n) of data items are maintained in a cell. Application specifies how many versions (n) of data items are maintained in a cell.  Bigtable garbage collects obsolete versions.

Bigtable Used in different applications supported by Google. Used in different applications supported by Google.

Application 1: Google Analytics Enables webmasters to analyze traffic pattern at their web sites. Statistics such as: Enables webmasters to analyze traffic pattern at their web sites. Statistics such as:  Number of unique visitors per day and the page views per URL per day,  Percentage of users that made a purchase given that they earlier viewed a specific page. How? How?  A small JavaScript program that the webmaster embeds in their web pages.  Every time the page is visited, the program is executed.  Program records the following information about each request:  User identifier  The page being fetched

Application 1: Google Analytics (Cont…) Two of the Bigtables Two of the Bigtables  Raw click table (~ 200 TB)  A row for each end-user session.  Row name include website’s name and the time at which the session was created.  Clustering of sessions that visit the same web site. And a sorted chronological order.  Compression factor of 6-7.  Summary table (~ 20 TB)  Stores predefined summaries for each web site.  Generated from the raw click table by periodically scheduled MapReduce jobs.  Each MapReduce job extracts recent session data from the raw click table.  Row name includes website’s name and the column family is the aggregate summaries.  Compression factor is 2-3. Single-sited

Application 2: Google Earth & Maps Functionality: Pan, view, and annotate satellite imagery at different resolution levels. Functionality: Pan, view, and annotate satellite imagery at different resolution levels. One Bigtable stores raw imagery (~ 70 TB): One Bigtable stores raw imagery (~ 70 TB):  Row name is a geographic segments. Names are chosen to ensure adjacent geographic segments are clustered together.  Column family maintains sources of data for each segment. There are different sets of tables for serving client data, e.g., index table. There are different sets of tables for serving client data, e.g., index table. Single-sited

Application 3: Personalized Search Records user queries and clicks across Google properties. Records user queries and clicks across Google properties. Users browse their search histories and request for personalized search results based on their historical usage patterns. Users browse their search histories and request for personalized search results based on their historical usage patterns. One Bigtable: One Bigtable:  Row name is userid  A column family is reserved for each action type, e.g., web queries, clicks.  User profiles are generated using MapReduce.  These profiles personalize live search results.  Replicated geographically to reduce latency and increase availability. Single-sited

Bigtable API Implements interfaces to Implements interfaces to  create and delete tables and column families,  modify cluster, table, and column family metadata such as access control rights,  Write or delete values in Bigtable,  Look up values from individual rows,  Iterate over a subset of the data in a table,  Atomic R-M-W sequences on data stored in a single row key (No support for Xacts across multiple rows).

Function Shipping Similar to Gamma, Bigtable is based on function shipping. Similar to Gamma, Bigtable is based on function shipping. Yay! Very smart!

Assumptions Uses GFS to store log and data files. Uses GFS to store log and data files. Bigtable processes share the same machines with processes from other applications. Bigtable processes share the same machines with processes from other applications.  A shared cluster of commodity PCs. A cluster management system: A cluster management system:  Schedules jobs,  Manages resources on shared machines,  Monitors PC status and handles failures.

Building Blocks Google File System Google File System  High availability. SSTable SSTable  A key/value database. Chubby Chubby  Name space.

SSTable A database similar to a BDB database: A database similar to a BDB database:  Stores and retrieves key/data pairs.  Key and data are arbitrary byte arrays.  Cursors to iterate key/value pairs given a selection predicate (exact and range).  Configurable to use either persistent store (disk) or main-memory based. A SSTable is stored in GFS. A SSTable is stored in GFS.

Bigtable: Hybrid Range Partitioning [VLDB’90] To minimize the impact of load imbalance, construct more (HN) ranges than (N) nodes, e.g., 10 ranges for a 5 node system; H = 2. To minimize the impact of load imbalance, construct more (HN) ranges than (N) nodes, e.g., 10 ranges for a 5 node system; H = 2. H is higher in practice; 10 in the experimental section of the paper. H is higher in practice; 10 in the experimental section of the paper. A range is named a tablet. A tablet is represented as: A range is named a tablet. A tablet is represented as:  A set of SSTable files.  A set of redo points which are pointers into any commit logs that main contain data for the tablet. 0-1051-60 11-2061-70 21-3071-80 31-4081-9041-5091-100

Chubby A persistent and distributed lock service. A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files. Implements a nameservice that consists of directories and files.

Software Infrastructure 1. A Bigtable library linked to every client. 2. Many tablet servers.  Tablet servers are added and removed dynamically.  Ten to a thousand tablets assigned to a tablet server.  Each tablet is typically 100-200 MB in size. 3. One master server responsible for:  Assigning tablets to tablet servers,  Detecting the addition and deletion of tablet servers,  Balancing tablet-server load,  Garbage collection of files in GFS. Client communicates directly with tablet server for reads/writes. Client communicates directly with tablet server for reads/writes.

Location of Tablets (Ranges) A 3-level hierarchy: A 3-level hierarchy: 1 st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data. 1 st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data.  The root tablet never splits. 2 nd Level: Each meta-data tablet contains the location of a set of user tablets. 2 nd Level: Each meta-data tablet contains the location of a set of user tablets. 3 rd Level: A set of SSTable identifiers for each tablet. 3 rd Level: A set of SSTable identifiers for each tablet. Analysis: Analysis:  Each meta-data row stores ~ 1KB of data,  With 128 MB tablets, the three level store addresses 2 34 tablets (2 61 bytes in 128 MB tablets).  Approaches a Zetabyte (million Petabytes).

Client/Master Client caches tablet locations. Client caches tablet locations.

Bigtable and Chubby Bigtable uses Chubby to: Bigtable uses Chubby to:  Ensure there is at most one active master at a time,  Store the bootstrap location of Bigtable data (Root tablet),  Discover tablet servers and finalize tablet server deaths,  Store Bigtable schema information (column family information),  Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

Placement of Tablets A tablet is assigned to one tablet server at a time. A tablet is assigned to one tablet server at a time. Master maintains: Master maintains:  The set of live tablet servers,  Current assignment of tablets to tablet servers (including the unassigned ones) Chubby maintains tablet servers: Chubby maintains tablet servers:  A tablet server creates and acquires an eXclusive lock on a uniquely named file in a specific chubby directory (named server directory),  Master monitors server directory to discover tablet server,  A tablet server stops processing requests if it loses its X lock (network partitioning).  Tablet server will try to obtain an X lock on its uniqely named file as long as it exists.  If the uniquely named file of a tablet server no longer exists then the tablet server kills itself. Goes back to a free pool to be assigned tablets by the master.

Placement of Tablets Master detects when a tablet server is in the free pool. Master detects when a tablet server is in the free pool.  How? Master periodically probes each tablet server for the status of its lock.

Master Should the Master die, a new Master is initiated. The master executes the following steps: Should the Master die, a new Master is initiated. The master executes the following steps:

Client Write & Read Operations Write operation arrives at a tablet server: Write operation arrives at a tablet server:  Server ensures the client has sufficient privileges for the write operation (Chubby),  A log record is generated to the commit log file,  Once the write commits, its contents are inserted into the memtable. Read operation arrives at a tablet server: Read operation arrives at a tablet server:  Server ensures client has sufficient privileges for the read operation (Chubby),  Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.

Write Operations As writes execute, size of memtable increases. As writes execute, size of memtable increases. Once memtable reaches a threshold: Once memtable reaches a threshold:  Memtable is frozen,  A new memtable is created,  Frozen metable is converted to an SSTable and written to GFS. This minor compaction minimizes memory usage of tablet server, and reduces recovery time in the presence of crashes (checkpoints). This minor compaction minimizes memory usage of tablet server, and reduces recovery time in the presence of crashes (checkpoints). Merging compaction (in the background) reads a few SSTables and memtable to produce one SSTable. (Input SSTables and memtable are discareded.) Merging compaction (in the background) reads a few SSTables and memtable to produce one SSTable. (Input SSTables and memtable are discareded.) Major compaction rewrites all SSTables into exactly one SSTable (containing no deletion entries). Major compaction rewrites all SSTables into exactly one SSTable (containing no deletion entries).

System Performance Experiments involving random reads (from GFS and main memory) and writes, sequential reads and writes, and scans. Experiments involving random reads (from GFS and main memory) and writes, sequential reads and writes, and scans.  Scan: A single RPC fetches a large sequence of values from the tablet server.

Random Reads

Sequential Reads A read request is for 1000 bytes. A read request is for 1000 bytes. Sequential reads perform better because a tablet server caches the 64 KB SSTable block (from GFS) and uses it to serve the next 64 read requests.

Random Reads from Memory Random reads from memory avoid the overhead of fetching a 64 KB block from GFS. Data is mapped onto the memory of the tablet server

Writes Tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes to GFS efficiently. Tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes to GFS efficiently.

Scale-up As the number of tablet servers is increased by a factor of 500: As the number of tablet servers is increased by a factor of 500:  Performance of random reads from memory increases by a factor of 300.  Performance of scans increases by a factor of 260. Why? Why?

3 R1R2R3 R1R3R2 R1R3 R2R3R1 R3R1R2 R3R2R1 {R1, R2, R3} {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 6 Ideal cases {R1, R3}R2 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 21 27 ways to assign 3 requests to the 3 nodes!

Brain Teaser Given N servers and M requests, Given N servers and M requests,  compute the probability of:  M/N requests per node.  Number of ways M requests may map onto N servers and the probability of each scenario.  Reward for correct answer:

Data Shipping? Data shipping will saturate resources. Data shipping will saturate resources. Do not be fooled by this discussion because Bigtable has “function shipping” (not reported in the evaluation section). Do not be fooled by this discussion because Bigtable has “function shipping” (not reported in the evaluation section).

Lessons Many types of errors in a real system. Many types of errors in a real system. Delay adding new features until it is clear how the new feature will be used. Delay adding new features until it is clear how the new feature will be used. Very important to have eyes that can see: Very important to have eyes that can see:

Conclusion From the very first lecture of this semester: From the very first lecture of this semester:

Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.

Similar presentations

Presentation on theme: "Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.

Similar presentations

Presentation on theme: "Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational."— Presentation transcript:

Similar presentations

About project

Feedback