Hive - A Warehousing Solution Over a Map-Reduce Framework.

Slides:



Advertisements
Similar presentations
Yukon – What is New Rajesh Gala. Yukon – What is new.NET Framework Programming Data Types Exception Handling Batches Databases Database Engine Administration.
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Session-01. Hibernate Framework ? Why we use Hibernate ?
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Chapter 2 CIS Sungchul Hong
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Mr.Prasad Sawant, MIT Pune India Introduction to DBMS.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Physical Layer of a Repository. March 6, 2009 Agenda – What is a Repository? –What is meant by Physical Layer? –Data Source, Connection Pool, Tables and.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Databases and DBMSs Todd S. Bacastow January 2005.
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
A Warehousing Solution Over a Map-Reduce Framework
Getting Data into Hadoop
Hive Mr. Sriram
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Data, Databases, and DBMSs
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
MANAGING DATA RESOURCES
Overview of big data tools
Pig - Hive - HBase - Zookeeper
Setup Sqoop.
Data Warehousing in the age of Big Data (1)
CSE 491/891 Lecture 24 (Hive).
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Hive - A Warehousing Solution Over a Map-Reduce Framework

Agenda Why Hive??? What is Hive? Hive Data Model Hive Architecture HiveQL Hive SerDe’s Pros and Cons Hive v/s Pig Graphs

Data Analysts with Hadoop

Challenges that Data Analysts faced Data Explosion - TBs of data generated everyday Solution – HDFS to store data and Hadoop Map- Reduce framework to parallelize processing of Data What is the catch? -Hadoop Map Reduce is Java intensive -Thinking in Map Reduce paradigm can get tricky

… Enter Hive!

Hive Key Principles

HiveQL to MapReduce Data Analyst Hive Framework SELECT COUNT(1) FROM Sales; rowcount,1 MR JOB Instance rowcount, N Sales: Hive table rowcount,1 N

Hive Data Model Data in Hive organized into : Tables Partitions Buckets

Hive Data Model Contd. Tables - Analogous to relational tables -Each table has a corresponding directory in HDFS -Data serialized and stored as files within that directory - Hive has default serialization built in which supports compression and lazy deserialization - Users can specify custom serialization –deserialization schemes ( SerDe’s )

Hive Data Model Contd. Partitions -Each table can be broken into partitions -Partitions determine distribution of data within subdirectories Example - CREATE_TABLE Sales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT) So each partition will be split out into different folders like Sales/country=US/year=2012/month=12

Hierarchy of Hive Partitions /hivebase/Sales /country=US /country=CANADA /year=2012 /year=2015 /year=2012 /year=2014 /month=12 /month=11

Hive Data Model Contd. Buckets -Data in each partition divided into buckets -Based on a hash function of the column - H(column) mod NumBuckets = bucket number -Each bucket is stored as a file in partition directory

Architecture Externel Interfaces - CLI, WebUI, JDBC, ODBC programming interfaces Thrift Server – Cross Language service framework. Metastore - Meta data about the Hive tables, partitions Driver - Brain of Hive! Compiler, Optimizer and Execution engine

Hive Thrift Server Framework for cross language services Server written in Java Support for clients written in different languages - JDBC(java), ODBC(c++), php, perl, python scripts

Metastore System catalog which contains metadata about the Hive tables Stored in RDBMS/local fs. HDFS too slow(not optimized for random access) Objects of Metastore  Database - Namespace of tables  Table - list of columns, types, owner, storage, SerDes  Partition – Partition specific column, Serdes and storage

Hive Driver Driver - Maintains the lifecycle of HiveQL statement Query Compiler – Compiles HiveQL in a DAG of map reduce tasks Executor - Executes the tasks plan generated by the compiler in proper dependency order. Interacts with the underlying Hadoop instance

Compiler Converts the HiveQL into a plan for execution Plans can - Metadata operations for DDL statements e.g. CREATE - HDFS operations e.g. LOAD Semantic Analyzer – checks schema information, type checking, implicit type conversion, column verification Optimizer – Finding the best logical plan e.g. Combines multiple joins in a way to reduce the number of map reduce jobs, Prune columns early to minimize data transfer Physical plan generator – creates the DAG of map-reduce jobs

HiveQL DDL : CREATE DATABASE CREATE TABLE ALTER TABLE SHOW TABLE DESCRIBE DML: LOAD TABLE INSERT QUERY: SELECT GROUP BY JOIN MULTI TABLE INSERT

Hive SerDe SELECT Query Record Reader Deserialize Hive Row Object Object Inspector Map Fields Hive Table End User  Hive built in Serde: Avro, ORC, Regex etc  Can use Custom SerDe’s (e.g. for unstructured data like audio/video data, semistructured XML data)

Good Things Boon for Data Analysts Easy Learning curve Completely transparent to underlying Map-Reduce Partitions(speed!) Flexibility to load data from localFS/HDFS into Hive Tables

Cons and Possible Improvements Extending the SQL queries support(Updates, Deletes) Parallelize firing independent jobs from the work DAG Table Statistics in Metastore Explore methods for multi query optimization Perform N- way generic joins in a single map reduce job Better debug support in shell

Hive v/s Pig Similarities:  Both High level Languages which work on top of map reduce framework  Can coexist since both use the under lying HDFS and map reduce Differences:  Language  Pig is a procedural ; (A = load ‘mydata’; dump A)  Hive is Declarative (select * from A)  Work Type  Pig more suited for adhoc analysis (on demand analysis of click stream search logs)  Hive a reporting tool (e.g. weekly BI reporting)

Hive v/s Pig  Users  Pig – Researchers, Programmers (build complex data pipelines, machine learning)  Hive – Business Analysts  Integration  Pig - Doesn’t have a thrift server(i.e no/limited cross language support)  Hive - Thrift server  User’s need  Pig – Better dev environments, debuggers expected  Hive - Better integration with technologies expected(e.g JDBC, ODBC) Differences:

Head-to-Head (the bee, the pig, the elephant) Version: Hadoop – 0.18x, Pig:786346, Hive:786346

REFERENCES ons ons latin-sql-constructing-data-processing-pipelines-444.html latin-sql-constructing-data-processing-pipelines-444.html Hortonworks tutorials (youtube) Graph : ive_benchmark_ pdf