A NoSQL Database - Hive Dania Abed Rabbou.

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Holdings Management Overview
Database Basics. What is Access? Database management system Computer-based equivalent of a manual database Makes it easy to organize and update information.
CC SQL Utilities.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
Management Information Systems, Sixth Edition
Chapter 18 - Data sources and datasets 1 Outline How to create a data source How to use a data source How to use Query Builder to build a simple query.
Access Lesson 2 Creating a Database
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
ASP.NET Programming with C# and SQL Server First Edition
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
1 MySQL and phpMyAdmin. 2 Navigate to and log on (username: pmadmin)
Integrating and managing your Engaging Networks data Top ten data features.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Cloud Computing Other High-level parallel processing languages Keke Chen.
® Microsoft Office 2013 Access Building a Database and Defining Table Relationships.
Hive Facebook 2009.
Introduction to Sqoop. Table of Contents Sqoop - Introduction Integration of RDBMS and Sqoop Sqoop use case Sample sqoop commands Key features of Sqoop.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:
® Microsoft Access 2010 Tutorial 2 Building a Database and Defining Table Relationships.
® Microsoft Office 2010 Building a Database and Defining Table Relationships.
Hive – SQL on top of Hadoop
Moodle with Style Integrating new technologies to empower learning and transform leadership.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
1 Chapter 20 – Data sources and datasets Outline How to create a data source How to use a data source How to use Query Builder to build a simple query.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Microsoft FrontPage 2003 Illustrated Complete Integrating a Database with a Web Site.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
3 DAYS ON JANUARY 16 th, 17 th & 18 th 2015 Santa Clara Convention Center, 5001 Great America Parkway, Santa Clara, CA 95054, United States.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
MYSQL AND MYSQL WORKBENCH MIS2502 Data Analytics.
Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
MSBIC Hadoop Series Querying Data with Hive Bryan Smith
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
HIVE A Warehousing Solution Over a MapReduce Framework
Access Tutorial 2 Building a Database and Defining Table Relationships
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Introduction To Database Systems
mysql and mysql workbench
Hive Mr. Sriram
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Access Tutorial 2 Building a Database and Defining Table Relationships
CSE 491/891 Lecture 24 (Hive).
Access Tutorial 2 Building a Database and Defining Table Relationships
Chapter 7 Using SQL in Applications
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
CS3220 Web and Internet Programming SQL and MySQL
CS3220 Web and Internet Programming SQL and MySQL
05 | Processing Big Data with Hive
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

A NoSQL Database - Hive Dania Abed Rabbou

Hive A data-warehousing framework built on top of Hadoop by Facebook Grew from a need to analyze huge volumes of daily data traffic (~10 TB) generated by Facebook Facebook owns the second largest Hadoop cluster in the world (~2 PB)

Hadoop & Hive Usage at Facebook To produce daily and hourly summaries such as reports on the growth of users, page views, average time spend on different pages etc. To perform backend processing for site features such as people you may like and applications you may like. To quantify the success of advertisement campaigns and products. To maintain the integrity of the website and detect suspicious activity.

Hive vs. RDBMs Transactions: Schema on read: Traditionally the table’s schema is enforced at data load time (schema on write). Hive enforces it at query time (a load operation is simply a quick file move) Updates: Table updates are only possible by transforming all the data into a new table (i.e. no appends) Transactions : Hive does not support concurrent accesses to tables and hence application-level concurrency and locking mechanisms are needed. 4. Indexes: Support provided but relatively immature

HiveQL: Hive’s SQL Dialect HiveQL adopts a SQL-like syntax HiveQL supports the following datatypes: Primitive: Complex: TINYINT (1 byte), SMALLINT (2 bytes), INT (4 bytes), BIGINT (8 bytes), DOUBLE, BOOLEAN, STRING ARRAY, MAP, STRUCT Eg: CREATE TABLE tbl ( col1 ARRAY<INT>, col2 MAP<STRING, INT>, col3 STRUCT<a:STRING, b:INT, c:DOUBLE> );

Hue: Hadoop’s Web Interface Hue is an open-source user-friendly web-interface for Hadoop components (including HDFS, Hive, Pig, etc.) Browse to your Hue interface located at: <andrew_id>-hdp.qatar.cmu.local:8000 username: hue password: SummerYet

Loading Data into HDFS Load some test datasets into HDFS: Any datasets needed for loading into tables must be moved to HDFS Load some test datasets into HDFS: Navigate to the File Browser Create a new directory, say DatasetsSource Move into DatasetsSource and upload three csv files namely customer_details, recharge_details, and customer_details_with_addresses

Creating Databases To create a new Hive database: Browse to Beewax (Hive’s UI) Click on the Databases tab Create a new database, say Customers

Creating Tables Create two tables under the database Customers: In Beewax, click on the Query Editor tab Create tables customer_details & recharge_details CREATE TABLE IF NOT EXISTS customer_details (phone_num STRING, plan STRING, date STRING, status STRING, balance STRING, region STRING) COMMENT "Customer Details" ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE; CREATE TABLE IF NOT EXISTS recharge_details channel STRING, amount STRING) COMMENT "Recharge Details"

Loading Data from HDFS to Tables Load data into the two tables: In the Query Editor tab, load each dataset previously uploaded to HDFS into its respective table LOAD DATA INPATH "/user/hue/DatasetsSource/customer_details.csv" OVERWRITE INTO TABLE customer_details; LOAD DATA INPATH "/user/hue/DatasetsSource/recharge_details.csv" OVERWRITE INTO TABLE recharge_details; Browse back to /user/hue/DatasetsSource; the datasets loaded into tables disappeared!! Hive moves the datasets to a default warehousing folder

Deleting Tables To delete a table: In the Tables tab, choose a table to delete and click drop The table including its metadata and data is deleted! In other words, the loaded data no longer exists anywhere!

Creating External Tables To control the creation and deletion of data, use external tables CREATE EXTERNAL TABLE IF NOT EXISTS customer_details (phone_num STRING, plan STRING, date STRING, status STRING, balance STRING, region STRING) COMMENT "Customer Details" ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE LOCATION "/user/hue/LoadedDatasets"; A path were the loaded dataset will be stored. If the table is deleted, the data stays around

Displaying Data in Tables Consider the schemas of our two tables: customer_details(phone_num, plan, date, status, balance, region) recharge_details(phone_num, date, channel, plan, amount) Display the records in customer_details SQL: SELECT * FROM customer_details; HiveQL: SELECT * FROM customer_details;

The entire table contents is re-written Updating Tables Consider the schemas of our two tables: customer_details(phone_num, plan, date, status, balance, region) recharge_details(phone_num, date, channel, plan, amount) Let’s update plan 4060 to a recharge amount of 500 SQL: UPDATE recharge_details SET amount=500 WHERE plan=4060; The entire table contents is re-written HiveQL: INSERT OVERWRITE TABLE recharge_details SELECT phone_num, date, channel, plan, CASE WHEN plan=4060 THEN 500 ELSE amount END as amount FROM recharge_details;

Joining Tables Consider the schemas of our two tables: customer_details(phone_num, plan, date, status, balance, region) recharge_details(phone_num, date, channel, plan, amount) Let’s display the recharge amount per customer SQL: SELECT c.phone_num, r.amount FROM customers_details c, recharge_details r WHERE c.phone_num = r.phone_num; HiveQL: SELECT customer_details.phone_num, recharge_details.amount FROM customer_details JOIN recharge_details ON (customer_details.phone_num = recharge_details.phone_num);

Tables with Complex Datatypes Let's add a new field to the customer_details table called "addresses". This field shall hold a list of addresses per customer CREATE TABLE IF NOT EXISTS customer_details_2 (phone_num STRING, plan STRING, date STRING, status STRING, balance STRING, region STRING addresses ARRAY<STRING>) COMMENT "Customer Details" ROW FORMAT DELIMITED FIELDS TERMINATED BY "," COLLECTION ITEMS TERMINATED BY “;" STORED AS TEXTFILE; LOAD DATA INPATH "/user/hue/DatasetsSource/ customer_details_with_addresses.csv" OVERWRITE INTO TABLE customer_details_2; SELECT * FROM customer_details_2; SELECT addresses[0] FROM customer_details_2;

Built-In Functions Hive provides many built-in functions. To list them all: To understand the functionality of a function: Let’s display those customer records whose addresses include ‘Qatar’ SHOW FUNCTIONS; DESCRIBE FUNCTION array_contains; SELECT * FROM customer_details_2 WHERE array_contains(addresses, "Oman");

Hive’s Additional Features Allows User-Defined Functions (UDFs). UDFs can be written in Java and integrated with Hive. Support a new construct (TRANSFORM .. USING ..) to invoke an external script or program. Hive ships invokes the specified program, feeds it data, and reads data back. Useful for pre-processing datasets before loading them into tables etc.