BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Author: Murray Stokely Presenter: Pim van Pelt Distributed Computing at Google.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
SaaS 傅汝緯 李碩元 林子驥 1. What is SaaS?  Definition :Software as a service  a software delivery model in which software and associated data are centrally.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
DATABASE SYSTEMS. DATABASE u A filing system for holding data u Contains a set of similar files –Each file contains similar records Each record contains.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Session 1 Module 1: Introduction to Data Integrity
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Model and Storage in NoSQL Systems (Bigtable, HBase) 1 Slides from Mohamed Eltabakh.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
SQL Basics Review Reviewing what we’ve learned so far…….
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Bigtable A Distributed Storage System for Structured Data.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
Column-Based.
Software Systems Development
Bigtable: A Distributed Storage System for Structured Data
CSE-291 (Cloud Computing) Fall 2016
A Distributed Storage System for Structured Data
Presentation transcript:

BigTable and Accumulo CMSC 461 Michael Wilson

BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up into the petabyte range  Used by many of Google’s online applications  Google Maps, Youtube,

BigTable storage  There are many “tables” in a single deployment of BigTable  Each “cell” in a table consists of three values  Row key (string)  Column key (string)  Byte array  Timestamp  It takes nothing more than this

Tablets  Tables are split into tablets  Tablets are subdivisions of the parent table  Spread across the machines present in a BigTable cluster  When tablets become too large ~200 MB, they “split”  The bigger the table becomes, the more tablets you have

Metadata tables  Spread amongst the regular tablets are META1 tablets  META1 tablets have information on where specific data is located  Ranges of data (row keys, column keys)  META0 tablet  Special tablet that maintains the locations of the various META1 tablets

Scanning for data  Queries in BigTable are done by row key and column key  These are pure strings – really just linearly searching through each tablet in order to find the appropriate strings  Metadata tables help you figure out which tablets are worth searching  To impose some sort of structure or order on the various cells, you need to cleverly construct your row/column keys

Pre-splitting a table  Tables can be set up with existing “split points”  If a developer knows that the data being inserted into the table will cause (or likely cause) a number of splits, split in advance  Splits can be costly  Tables not big enough to trigger splits  If you have a lot of small data, it can be useful to split it yourself

Split sizes  You can also control the split sizes of a table  Data is not very large, but you want it to grow in an intelligent way  Set the split sizes to a more reasonable number  Typically determined in bytes

Accumulo  Accumulo is based on the Google’s BigTable paper  Accumulo does things slightly differently

Server roles  Master  This is the equivalent of META0 and META1 combined  Tablet server  This houses various tablets spread across a table  A tablet server can host many tablets  Logger  Maintains writeahead logs

Server roles  gc  Garbage collects across the various tablets  monitor  Web page one can access to monitor various data about Accumulo

Cell break down  Cells in Accumulo are similar, but there are differences  Row ID  Column Family  Column Qualifier  Column Visibility  Timestamp  Value

Cell level security  Accumulo adds something that BigTable doesn’t have by default  Column visibility field  Allows you to assign “authorizations” to various cells in a table  You can issue the same query with different authorizations and get different results  Allows you to compartmentalize sensitive data from users who may not have access  RDBMSes have label security for a similar purpose

Duplicate behavior  Accumulo is capable of maintaining duplicates due to keeping track of the timestamp  It can be configured to behave in particular ways  By default, rows are overwritten with the most recent timestamp version

Iterators  Accumulo also has “iterators” which can be applied to tables  Allow for arbitrary code to be executed on tables exhibiting certain properties  Example: SummingCombiner  Can establish that the values of all rows with the same row ID and column family should be summed together  Can be applied to whole tables or can be issued as part of a standalone scan

MapReduce and Accumulo  One can pass an Accumulo table as input to a MapReduce job, or use an Accumulo table as output from a MapReduce job  The number of tablets you have = the number of mappers you have  One tablet processed by one mapper