Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Optimizing HBase scanner performance
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
-A APACHE HADOOP PROJECT
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Hadoop Ecosystem Overview
Distributed storage for structured data
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
HADOOP ADMIN: Session -2
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Bigtable: A Distributed Storage System for Structured Data 1.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
1 HBase Intro 王耀聰 陳威宇
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Nov 2006 Google released the paper on BigTable.
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Model and Storage in NoSQL Systems (Bigtable, HBase) 1 Slides from Mohamed Eltabakh.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Next Generation of Apache Hadoop MapReduce Owen
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Bigtable A Distributed Storage System for Structured Data.
Big Data Infrastructure Week 10: Mutable State (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Amit Ohayon, seminar in databases, 2017
Column-Based.
HBase Mohamed Eltabakh
Software Systems Development
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CLOUDERA TRAINING For Apache HBase
CSE-291 (Cloud Computing) Fall 2016
Gowtham Rajappan.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
A Distributed Storage System for Structured Data
SDMX meeting Big Data technologies
Pig Hive HBase Zookeeper
Presentation transcript:

Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda HBase Accumulo Maybe Redis

APACHE HBASE

Overview Distributed, scalable, column-oriented key/value store Implementation of Google’s Big Table for Hadoop Provides random, real-time read/write access to tables Billions of rows millions by millions of columns on HDFS Three core components – HBase Master – HBase RegionServer – ZooKeeper

How is data stored? Namespace – Table Region – Store – One Store per ColumnFamily » MemStore » StoreFile Block

HBase Architecture Master ZooKeeper RegionServer Region Store HFile MemStore HFile Store HFile MemStore HFile Client HDFS RegionServer Region Store HFile MemStore HFile Store HFile MemStore HFile

Data Model Column families defined at table creation Key Row ID Column Family Timestamp Column Qualifier Value byte[]

Locality Groups Locality groups are a means to define different sets of columns that have different access patterns – Done via Column Families – Store metadata in one family, and images in another family – Set the proper column family based on what you need Physically separated in HDFS to provide faster access times

Locality Groups Row ID Column Family Column Qualifier Value com.abccontent-<!DOCTYPE … com.cnbccontent-<!DOCTYPE … com.cnbclinkgoogle.comto cnbc com.cnncontent-<!DOCTYPE … com.cnnlinkgoogle.comcnn.com com.cnnlinkyahoo.comcnn.com com.nbccontent-<!DOCTYPE … com.nbclinkyahoo.comNBC

Row ID Column Family Column Qualifier Value com.abccontent-<!DOCTYPE … com.cnbccontent-<!DOCTYPE … com.cnbclinkgoogle.comto cnbc com.cnncontent-<!DOCTYPE … com.cnnlinkgoogle.comcnn.com com.cnnlinkyahoo.comcnn.com com.nbccontent-<!DOCTYPE … com.nbclinkyahoo.comNBC Locality Groups Query: link data for CNBC and CNN

Locality Groups … Row ID Column Family Column Qualifier Value com.abccontent-<!DOCTYPE … com.cnbccontent-<!DOCTYPE … com.cnncontent-<!DOCTYPE … com.nbccontent-<!DOCTYPE … com.cnbclinkgoogle.comto cnbc com.cnnlinkgoogle.comcnn.com com.cnnlinkyahoo.comcnn.com com.nbclinkyahoo.comNBC … Query: link data for CNBC and CNN

How is Data Stored? IDNameCreatedNum Followers FastCoDesign CorazoonBipolar Telkomsel WorIdComedy profile:created profile:followers profile:name FastCoDesign profile:created profile:followers profile:name CorazoonBipolar profile:created profile:followers profile:name Telkomsel profile:created profile:followers profile:name WorIdComedy 'profile' Table View Actual View

Regions Regions are split on row ID – i.e. you cannot have multiple key/value pairs with the same row ID in two regions or HFiles Regions are indexed and Bloom filtered to give HBase RegionServers the ability to quickly seek into an HDFS block and get the data

Regions

Bloom Filters and Block Caching Use these for optimal fetch performance! Bloom Filters – Stored in memory on each RegionServer – Used as a preliminary test prior to opening a region on HDFS – Very effective for fetches that are likely to have a null value Block Caching – Configurable number of key/value pairs to read into memory when a RegionServer fetches data – Very effective for multiple fetches with similar keys Can configure HBase to store all regions in-memory

Compactions Minor – Picks up a few StoreFiles and merges them together – Can sometimes pick up all the files in the Store and promote itself to a Major compaction Major – Single StoreFile per Store – All expired cells will be dropped Does not occur in minor compactions

Creating and Managing Tables Tables contain Column Families You can (and should) pre-define your table split keys – Defines the regions of a table – Allows for better data distribution, especially when doing a bulk-load of data HBase will split regions automatically as needed – Master has no part in this Lower number of regions preferred, in the range of 20 to low-hundreds per RegionServer Can split manually

Bulk Importing Create table Use MapReduce to generate HFiles in batch Tell HBase where the table files are Drastically reduces run-time for table ingestion

What can I do with it? HBase is designed for fast fetches (~10ms) of your big data sets Random Inserts/Updates/Deletes of data Versioning Changing schemas

What shouldn’t I do with it? Full-table scans – Slow – Use MapReduce instead (still slow) High-throughput transactions – Use Redis or another in-memory solution for data sets that can fit in-memory Monotonically Increasing Row IDs – There are work arounds!

Types of Operations Three Java objects to work with a table – Put – Get – Delete Scanning can be done with the 'Scan' object

Table Manipulation HBaseAdmin – Management commands of creating tables, enabling/disabling tables, deleting tables, etc. HTable – Actually putting/fetching/deleting/scanning data

Simple Example A Basic HBase application that demonstrates: – Creating a Table – Deleting a Table – Putting data – Getting data – Scanning data With a simple Column Family filter

APACHE ACCUMULO

Overview Google's BigTable for Hadoop w/Security Similar to HBase Generally, Accumulo is faster at Writes, HBase is faster at Reads

Accumulo Architecture Master ZooKeeper TabletServer Tablet CF TFile MemStore TFile CF TFile MemStore TFile Client HDFS TabletServer Tablet CF TFile MemStore TFile CF TFile MemStore TFile

Data Model Identical to HBase, with an additional 'visibility' label Column families defined dynamically Key Row ID Column Family Column Qualifier Timestamp Value byte[]Visibility

Features Include Creating/Deleting Tables Major/Minor Compactions Bloom Filters/Block Caching Bulk Importing Transactions via Mutations Two Types of Range Scans – Scanner vs Batch Scanner Iterators

Real-Time processing framework Provide "Reduce-like" functionality, but at very low latency Iterators are configured to run at: – Scan time – Minor Compaction – Major Compation AgeOffIterator – automatically age off key/value pairs during scans and compactions

Scan Time Iterator

Minor Compaction Iterator

Major Compaction Iterator

Iterator Types Versioning – Configure the number of identical key/value pairs to store Filtering – Apply arbitrary filtering to key/value pairs Combiners – Aggregate values from keys that shares a Row ID, Column Family, and Column Qualifier people. technology. integrity.

Versioning Iterator Given multiple version of the same row, what operations can we perform? Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3”

Versioning Iterator Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3”

Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Age-Off Iterator Current Time: 1102 Entries <= 100s old Entries > 100s old

Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Age-Off Iterator Current Time: 1103 Entries <= 100s old Entries > 100s old

Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Age-Off Iterator Current Time: 1104 Entries <= 100s old Entries > 100s old

Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Combiner Iterators Apply a function to all available versions of a particular key MIN 4’3”

References