The Hadoop Stack, Part 2 Introduction to HBase CSE 40822 – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Introduction to Hadoop Richard Holowczak Baruch College.
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
The Hadoop Stack, Part 3 Introduction to Spark
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Distributed storage for structured data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Component 4: Introduction to Information and Computer Science Unit 6a Databases and SQL.
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Cloudera Kudu Introduction
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Model and Storage in NoSQL Systems (Bigtable, HBase) 1 Slides from Mohamed Eltabakh.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
SQL Basics Review Reviewing what we’ve learned so far…….
Bigtable A Distributed Storage System for Structured Data.
and Big Data Storage Systems
Amit Ohayon, seminar in databases, 2017
Column-Based.
HBase Mohamed Eltabakh
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
Gowtham Rajappan.
Overview of big data tools
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

The Hadoop Stack, Part 2 Introduction to HBase CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame

Three Case Studies Workflow: Pig Latin A dataflow language and execution system that provides an SQL- like way of composing workflows of multiple Map-Reduce jobs. Storage: HBase A NoSQl storage system that brings a higher degree of structure to the flat-file nature of HDFS. Execution: Spark An in-memory data analysis system that can use Hadoop as a persistence layer, enabling algorithms that are not easily expressed in Map-Reduce.

References Fay Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI Introduction to Hbase Schema Design, Amandeep Khurana, Login; Magazine, number-5/introduction-hbase-schema-design number-5/introduction-hbase-schema-design Apache HBase Documentation

From HDFS to HBase HDFS provides us with a filesystem consisting of arbitrarily large files that can only be written once and are should be read sequentially, end to end. This works fine for sequential OLAP queries like Pig. OLTP workloads want to read and write individual cells in a large table. (e.g. update inventory and price as orders come in.) HBase implements OLTP interactions on top of HDFS by using additional storage and memory to organize the tables, and writing them back to HDFS as needed. HBase is an open source implementation of the BigTable design published by Google.

HBase Data Model A database consists of multiple tables. Each table consists of multiple rows, sorted by row key. Each row contains a row key and one or more column families. Each column family is defined when the table is created. Column families can contain multiple columns. (family:column) A cell is uniquely identified by (table,row,family:column). A cell contains an uninterpreted array of bytes and a timestamp.

Data in Tabular Form NameHomeOffice KeyFirstLastPhone Phone wobegon.org 102MarilynTollerud wels.org

Data in Tabular Form NameHomeOfficeSocial KeyFirstLastPhone Phone FacebookID 101FlorianGarfieldKrepsbach wobegon.org MarilynTollerud PastorInqvist wels.org New columns can be added at runtime. Column families cannot be added at runtime.

Don’t be fooled by the picture: Hbase is really a sparse table.

Nested Data Representation Table People ( Name, Home, Office ) { 101: { Timestamp: T403; Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”}, Home: {Phone=“ ”, Office: {Phone=“ ”, }, 102: { Timestamp: T593; Name: { First=“Marilyn”, Last=“Tollerud”}, Home: { Phone=“ ” }, Office: { Phone=“ ” } }, … }

Nested Data Representation GET People: : { Timestamp: T403; Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”}, Home: {Phone=“ ”, Office: {Phone=“ ”, } GET People:101:Name People:101:Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”} GET People:101:Name:First People:101:Name:First=“Florian”

Fundamental Operations CREATE table, families PUT table, rowid, family:column, value PUT table, rowid, whole-row GET table, rowid SCAN table (WITH filters) DROP table

Consistency Model Atomicity: Entire rows are updated atomically or not at all. Consistency: A GET is guaranteed to return a complete row that existed at some point in the table’s history. (Check the timestamp to be sure!) A SCAN must include all data written prior to the scan, and may include updates since it started. Isolation: Not guaranteed outside a single row. Durability: All successful writes have been made durable on disk.

The Implementation I’ll give you the idea in several steps: Idea 1: Put an entire table in one file. (It doesn’t work.) Idea 2: Log + one file. (Better, but doesn’t scale to large data.) Idea 3: Log + one file per column family. (Getting better!) Idea 4: Partition table into regions by key. Keep in mind that “one file” means one gigantic file stored in HDFS. We don’t have to worry about the details of how the file is split into blocks.

Idea 1: Put the Table in a Single File How do we do the following operations? CREATE DELETE SCAN GET PUT 101: {Timestamp: T403;Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”},Home: {Phone=“ ”, {Phone=“ ”, {Timestamp: T593;Name: { First=“Marilyn”, Last=“Tollerud”},Home: { Phone=“ ” },Office: { Phone=“ ” }},... File “People”

Variable-Length Data is Fundamentally Hard! Fun reading on the topic: IDFirstNameLastNamePhone 101FlorianKrepsbach Marilyn Tollerud PastorIngvist : {Timestamp: T403;Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”},Home: {Phone=“ ”, {Phone=“ ”, {Timestamp: T593;Name: { First=“Marilyn”, Last=“Tollerud”},Home: { Phone=“ ” },Office: { Phone=“ ” }},... SQL Table: People( ID: Integer, FirstName: CHAR[20], LastName: Char[20], Phone: CHAR[8] ) UPDATE People SET Phone=“ ” WHERE ID=403; HBase Table People( ID, Name, Home, Office ) PUT People, 403, Home:Phone, Each row is exactly 52 bytes long. To move to the next row, just fseek(file,+52); To get to Row 401, fseek(file,401*52); Overwrite the data in place. ?

Idea 2: One Tablet + Transaction Log 101: {Timestamp: T403;Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”},Home: {Phone=“ ”, {Phone=“ ”, {Timestamp: T593;Name: { First=“Marilyn”, Last=“Tollerud”},Home: { Phone=“ ” },Office: { Phone=“ ” }},... Table for People PUT 101:Office:Phone = “ ” PUT 102:Home: = …. Transaction Log for Table People: Changes are applied only to the log file, Then the resulting record is cached in memory. Reads must consult both memory and disk. Memory Cache for Table People: GET People:101 GET People:103 PUT People:101:Office:Phone = “ ”

Idea 2 Requires Periodic Log Compression 101: {Timestamp: T403;Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”},Home: {Phone=“ ”, {Phone=“ ”, {Timestamp: T593;Name: { First=“Marilyn”, Last=“Tollerud”},Home: { Phone=“ ” },Office: { Phone=“ ” }},... Table for People on Disk: (Old) PUT 101:Office:Phone = “ ” PUT 102:Home: = …. Transaction Log for Table People: 101: {Timestamp: T403;Name: {First=“Florian”, Middle=“Garfield”, Last=“Krepsbach”},Home: {Phone=“ ”, {Phone=“ ”, {Timestamp: T593;Name: { First=“Marilyn”, Last=“Tollerud”},Home: { Phone=“ ”, },... Table for People on Disk: (New) Write out a new copy of the table, with all of the changes applied. Delete the log and memory cache, and start over.

Idea 3: Partition by Column Family Data for Column Family Name Tablets for People on Disk: (Old) PUT 101:Office:Phone = “ ” PUT 102:Home: = …. Transaction Log for Table People: Tablets for People on Disk: (New) Write out a new copy of the tablet, with all of the changes applied. Delete the log and memory cache, and start over. Data for Column Family Home Data for Column Family Office Data for Column Family Home (Changed) Data for Column Family Office (Changed) Data for Column Family Name

Idea 4: Split Into Regions Region 1: Keys Region 2: Keys Region 3: Keys Region 4: Keys Region Server Region Master Region Server Region Server Region Server Transaction Log Memory Cache Tablet Hbase Client (Detail of One Region Server) Column Family Name Column Family Home Column Family Office Where are the servers for table People? Access data in tables

Column Family Name Column Family Home Column Family Office Region 1 Keys Region 2 Keys Region 3 Keys Table People

The consistency model is tightly coupled to the scalability of the implementation!

Consistency Model Atomicity: Entire rows are updated atomically or not at all. Consistency: A GET is guaranteed to return a complete row that existed at some point in the table’s history. (Check the timestamp to be sure!) A SCAN must include all data written prior to the scan, and may include updates since it started. Isolation: Not guaranteed outside a single row. Durability: All successful writes have been made durable on disk.

Questions for Discussion What are the tradeoffs between having a very tall vertical table versus having a very wide horizontal table? Suppose a table starts off small, and then a large number of rows are added to the table. What happens and what should the system do? Using the People example, come up with some examples of simple queries that are very fast, and some that are very slow. Is there anything that can be done about the slow queries? Would it be possible to have SCAN return a consistent snapshot of the system? How would that work?