Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

Slides:



Advertisements
Similar presentations
Phoenix We put the SQL back in NoSQL James Taylor Demos:
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Toolbox Mirror -Overview Effective Distributed Learning.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
-A APACHE HADOOP PROJECT
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Distributed storage for structured data
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Part 06 – A More Complex Data Model Entity Framework and MVC NTPCUG Tom Perkins.
The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hive Facebook 2009.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
An Introduction to HDInsight June 27 th,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 HBase Intro 王耀聰 陳威宇
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Indexes / Session 2/ 1 of 36 Session 2 Module 3: Types of Indexes Module 4: Maintaining Indexes.
HBase Elke A. Rundensteiner Fall 2013
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Constraints Lesson 8. Skills Matrix Constraints Domain Integrity: A domain refers to a column in a table. Domain integrity includes data types, rules,
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
BI Practice March-2006 COGNOS 8BI TOOLS COGNOS 8 Framework Manager TATA CONSULTANCY SERVICES SEEPZ, Mumbai.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Linux Operations and Administration
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Model and Storage in NoSQL Systems (Bigtable, HBase) 1 Slides from Mohamed Eltabakh.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
SQL Basics Review Reviewing what we’ve learned so far…….
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Column-Based.
HBase Mohamed Eltabakh
Hadoop.
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CLOUDERA TRAINING For Apache HBase
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
ISC440: Web Programming 2 Server-side Scripting PHP 3
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Hbase – NoSQL Database Presented By: 13MCEC13.
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

Data storing and data access

Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create(); 2.Establishing connection Connection connection = ConnectionFactory.createConnection(config); 3.Table opening Table table = connection.getTable(TableName.valueOf("users")); 4.Create a put object Put p = new Put(key); 5.Set values for columns p.addColumn(family, col_name, value);…. 6.Push the data to the table table.put(p);

Hand on (1) Lets store all people in the galaxy in a single table.. Person description: – id – first_name – last_name – data of brith – profession

Hand on (1) Get the scripts wget cern.ch/zbaranow/hbase.zip unzip hbase.zip cd hbase_tut Preview: UsersLoader.java Compile and run javac –cp `hbase classpath` UsersLoader.java java –cp `hbase classpath` UserLoader../data/users.csv 2> /dev/null

Bulk loading If we need to load big data in an optimal way 1.Load the data into HDFS 2.Generate a hfiles with the data using MapReduce write your own or use importtsv – has some limitations 3.Embed generated files into hbase

Bulk load – Hands on 2 1.Load the data 2.Run ImportTsv 3.Change permissions on the created directory HBase has to be able modify it 4.Run LoadIncrementalHFiles cd bulk; cat importtsv.txt

Hands on 3 - counter Generate a new id when inserting a person – incremental – use hbase counter feature (it is atomic) 1.Create a table for a counter (single row, single columns) 2.Modify the code to use the counter (increment and read the value) Do it yourself: vi UsersLoaderAutoKey.java

Getting data with Java API 1.Open a connection and instantiate a table object 2.Create a get object Get g = new Get(key); 3.(optional) specify certain family or column only g.addColumn(family_name, col_name); //or g.addFamily(family_name); 4.Get the data from the table Result result= table.get(g); 5.Get values from the result object byte [] value = result.getValue(family, col_name); //….

Scanning the data with Java API 1.Open a connection and instantiate a table object 2.Create a Scan object Scan s = new Scan(start_row,stop_row); 3.(optional) Set columns to be retrieved s.addColumn(family,col_name) 4.Get a result scanner object ResultScanner scanner = table.getScanner(s); 5.Go through results for (Result row : scanner) { do something with the row }

Scanning the data with Java API Filtering the data – will not prevent from reading the data set -> will reduce the network utilization only! – there are many filters available for column names, values etc. – …and can be combained Filter valFilter = new ValueFilter(GREATER_OR_EQUAL,1500); scan.setFilter(valFilter); Optimizations – Block caching : scan.setCacheBlocks(true) – Limit maximum number of values returned per call scan.setBatch(int batch) – Limit the max result size scan.setMaxResultSize(long maxResultSize)

Data access vs. data pruning

Hands on 4 : distributed storage Let’s imagine we need to provide a backend storage system for a large scale application – e.g. for a mail service, for a cloud drive We want the storage to be – distributed – content addressed In the following hands on we’ll see how Hbase can do this

Distributed storage: insert client The application will be able to upload a file from a local file system and save a reference to it in ‘users’ table A file will be reference by its SHA-1 fingerprint General steps: – read a file and calculate a fingerprint – check for file existence – save in ‘files’ table if not exists – add a reference in ‘users’ table in ‘media’ column family

Distributed storage: download client The application will be able to download a file given an user ID and a file (media) name General steps: – retrieve a fingerprint from ‘users’ table – get the file data from ‘files’ table – save the data to a local file system

Distributed storage: exercise location Download source files – Fill the TODOs – support with docs and previous examples Compile with javac -cp `hbase classpath` InsertFile.java javac -cp `hbase classpath` GetMedia.java

Schema design consideration

Tables Two options – Wide - large number of columns – Tall - large number of rows KeyF1:COL1F1:COL2F2:COL3F2:COL4 r1r1v1r1v2r1v3r1v4 r2r2v1r2v2r2v3r2v4 r3r3v1r3v2r3v3r3v4 KeyF1:V r1_col1r1v1 r1_col2r1v1 r1_col3r1v3 r1_col4r1v4 r2_col1r2v1 r2_col2r2v2 r2_col3r2v3 r2_col4r2v4 r3_col1r3v1 Region 1 Region 2 Region 3 Region 4

Flat-Wide table Consistent operations on multiple cells – Operations on a single row are atomic Poor data distributions across a cluster – Column families of certain regions are stored on the same server – writes can be slow Range scanning only on main id KeyF1:COL1F1:COL2F2:COL3F2:COL4 r1r1v1r1v2r1v3r1v4 r2r2v1r2v2r2v3r2v4 r3r3v1r3v2r3v3r3v4 Region 1 Family 1 Family 2

Tall-Narrow Recommended for most of the cases Fast data access by key (row_id) – Horizontal partitioning – Local index Automatic sharding of all the data KeyF1:V r1_col1r1v1 r1_col2r1v1 r1_col3r1v3 r1_col4r1v4 r2_col1r2v1 r2_col2r2v2 r2_col3r2v3 r2_col4r2v4 r3_col1r3v1 Region 1 Region 2 Region 3 Region 4

Key values Is the most important aspect in designing Fast data access (range scans) – very often the key is compound (concatenated) – keep in mind the right order of the key parts “username+timestamp” vs “timestamp+username” – for fast recent data retrievals it is better to insert new rows into the first region of the table Example: key= timestamp Fast data storing – distribute rows across regions Salting Hashing

Indexing of non-key columns Two approaches – Additional table is pointing to the pk of the main one – Additional table has a full copy of the data Update of indexes can be done – In real time: by the client code – In real time: by Hbase co-oprocessors – Periodically: by a MapReduce job There are already couple of custom solutions available

Other aspects Isolation of access patterns – with separated column families Column qualifiers can be used to store the data – [‘person1’,’likes:person3’]=“Person3 Name” Data denormalization can skip a lot of reading – But introduces complex writing It is easier to start with HBase without having RDBMS backgrounds

SQL on HBase

Running SQL on HBase From Hive or Impala – HTable mapped to an external table type of the key should be a ‘string’ – Some DMLs are supported insert (but not overwrite) updates are available by duplicating a row with insert statement

Use cases for SQL on HBase Data warehouses – facts table : big data scanning -> impala + parquet – dimensional table: random lookups -> hbase Read – write storage – metadata – counters Some extreme cases – very wide but sparse tables

How to? Create an external table with hive – Provide column names and types (key column should be always a string) –STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ –WITH SERDEPROPERTIES "hbase.columns.mapping” = ":key,main:first_name,main:last_name…." –TBLPROPERTIES("hbase.table.name" = "users”); Try it out (hands on)! cat sql/user_table.txt

Summary

What was not covered Writing co-oprocessors - stored procedures Hbase table permissions Filtering of data scanner the results Using map reduce for data storing and retrieving data Bulk data loading with custom map reduce Using different APIs - Thrift

When to use In general: – For data too big to store on some central storage – For random data access: quick lookups of individual records – Data set can be represented by key-value Database of objects (photos, text, videos) – E.g for storing and accessing heterogeneous metrics When data set – has to be updated – is sparse – records have variable number of attributes – has custom data types (serialization)

When NOT to use For massive data processing – Data analytics – use MR, Spark, Hive, Impala… For data sets with high frequency insertion rates – stability concerns - from own experience The schema of the data is not simple If “I do not know which solution to use”