Hive - A Warehousing Solution Over a Map-Reduce Framework

Slides:

Advertisements

Similar presentations

Shark:SQL and Rich Analytics at Scale

Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Hive - A Warehousing Solution Over a Map-Reduce Framework.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Introduction to Hive Liyin Tang

Hive: A data warehouse on Hadoop

Cloud Computing Other Mapreduce issues Keke Chen.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.

1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Hive : A Petabyte Scale Data Warehouse Using Hadoop

Cloud Computing Other High-level parallel processing languages Keke Chen.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Introduction to Hadoop and HDFS

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

Hive Facebook 2009.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

A NoSQL Database - Hive Dania Abed Rabbou.

Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Oracle Business Intelligence Foundation – Testing and Deploying OBI Repository.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

Image taken from: slideshare

Databases (CS507) CHAPTER 2.

Databases and DBMSs Todd S. Bacastow January 2005.

HIVE A Warehousing Solution Over a MapReduce Framework

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

Chapter 2: Database System Concepts and Architecture - Outline

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Spark Presentation.

A Warehousing Solution Over a Map-Reduce Framework

Getting Data into Hadoop

Hive Mr. Sriram

Presented by: Warren Sifre

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

Hadoop EcoSystem B.Ramamurthy.

Rekha Singhal, Amol Khanapurkar, TCS Mumbai.

Data, Databases, and DBMSs

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Server & Tools Business

MANAGING DATA RESOURCES

Overview of big data tools

The Idea of Pig Or Pig Concepts

Pig - Hive - HBase - Zookeeper

Database Systems Instructor Name: Lecture-3.

Data processing with Hadoop

Data Warehousing in the age of Big Data (1)

CSE 491/891 Lecture 24 (Hive).

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Map Reduce, Types, Formats and Features

Pig Hive HBase Zookeeper

Presentation transcript:

Hive - A Warehousing Solution Over a Map-Reduce Framework

Overview Why Hive? What is Hive? Hive Data Model Hive Architecture HiveQL Hive SerDe’s Pros and Cons Hive v/s Pig Graphs

Challenges that Data Analysts faced Data Explosion - TBs of data generated everyday Solution – HDFS to store data and Hadoop Map- Reduce framework to parallelize processing of Data What is the catch? Hadoop Map Reduce is Java intensive Thinking in Map Reduce paradigm not trivial

… Enter Hive!

Hive Key Principles Data Warehouse - A system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources ETL – process of extracting data from source and bringing it into data warehouse. Extract Transform and Load.

HiveQL to MapReduce Hive Framework N Data Analyst SELECT COUNT(1) FROM Sales; rowcount, N rowcount,1 rowcount,1 Sales: Hive table MR JOB Instance

Hive Data Model Data in Hive organized into : Tables Partitions Buckets

Hive Data Model Contd. Tables - Analogous to relational tables Each table has a corresponding directory in HDFS Data serialized and stored as files within that directory - Hive has default serialization built in which supports compression and lazy deserialization - Users can specify custom serialization –deserialization schemes (SerDe’s)

Hive Data Model Contd. Partitions Each table can be broken into partitions Partitions determine distribution of data within subdirectories Example - CREATE_TABLE Sales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT) So each partition will be split out into different folders like Sales/country=US/year=2012/month=12

Hierarchy of Hive Partitions /hivebase/Sales /country=US /country=CANADA /year=2012 /year=2012 Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries. /year=2015 /year=2014 /month=12 /month=11 /month=11 File File File

Hive Data Model Contd. Buckets Data in each partition divided into buckets Based on a hash function of the column H(column) mod NumBuckets = bucket number Each bucket is stored as a file in partition directory

Architecture Externel Interfaces- CLI, WebUI, JDBC, ODBC programming interfaces Thrift Server – Cross Language service framework . Metastore - Meta data about the Hive tables, partitions Driver - Brain of Hive! Compiler, Optimizer and Execution engine

Hive Thrift Server Framework for cross language services Server written in Java Support for clients written in different languages - JDBC(java), ODBC(c++), php, perl, python scripts

Metastore System catalog which contains metadata about the Hive tables Stored in RDBMS/local fs. HDFS too slow(not optimized for random access) Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDes Partition – Partition specific column, Serdes and storage Metadata for table contains list of columns and their types, owner, storage and SerDe information. It can also contain any user supplied key and value data; this facility can be used to store table statistics in the future. Storage information includes location of the table’s data in the underlying file system, data formats and bucketing information. SerDe metadata includes the implementation class of serializer and deserializer methods and any supporting information required by that implementation. Hive now records the schema version in the metastore database and verifies that the metastore schema version is compatible with Hive binaries that are going to accesss the metastore. Note that the Hive properties to implicitly create or alter the existing schema are disabled by default. Hive will not attempt to change the metastore schema implicitly. When you execute a Hive query against an old schema, it will fail to access the metastore.

Hive Driver Driver - Maintains the lifecycle of HiveQL statement Query Compiler – Compiles HiveQL in a DAG of map reduce tasks Executor - Executes the tasks plan generated by the compiler in proper dependency order. Interacts with the underlying Hadoop instance

Compiler Converts the HiveQL into a plan for execution Plans can - Metadata operations for DDL statements e.g. CREATE - HDFS operations e.g. LOAD Semantic Analyzer – checks schema information, type checking, implicit type conversion, column verification Optimizer – Finding the best logical plan e.g. Combines multiple joins in a way to reduce the number of map reduce jobs, Prune columns early to minimize data transfer Physical plan generator – creates the DAG of map-reduce jobs These repartition operators mark the boundary between the map phase and a reduce phase during physical plan generation.

HiveQL DDL : CREATE DATABASE CREATE TABLE ALTER TABLE SHOW TABLE DESCRIBE DML: LOAD TABLE INSERT QUERY: SELECT GROUP BY JOIN MULTI TABLE INSERT The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.

Object Inspector Map Fields Hive SerDe SELECT Query Hive built in Serde: Avro, ORC, Regex etc Can use Custom SerDe’s (e.g. for unstructured data like audio/video data, semistructured XML data) Record Reader Hive Table Deserialize Hive Row Object Object Inspector Map Fields End User

Good Things Boon for Data Analysts Easy Learning curve Completely transparent to underlying Map-Reduce Partitions(speed!) Flexibility to load data from localFS/HDFS into Hive Tables

Cons and Possible Improvements Extending the SQL queries support(Updates, Deletes) Parallelize firing independent jobs from the work DAG Table Statistics in Metastore Explore methods for multi query optimization Perform N- way generic joins in a single map reduce job Better debug support in shell

Hive v/s Pig Similarities: Both High level Languages which work on top of map reduce framework Can coexist since both use the under lying HDFS and map reduce Differences: Language Pig is a procedural ; (A = load ‘mydata’; dump A) Hive is Declarative (select * from A) Declarative programming is where you say what you want without having to say how to do it. e.g SQL, yacc, lex With procedural programming, you have to specify exact steps to get the result. e.g. C Work Type Pig more suited for adhoc analysis (on demand analysis of click stream search logs) Hive a reporting tool (e.g. weekly BI reporting)

Hive v/s Pig Differences: Users Pig – Researchers, Programmers (build complex data pipelines, machine learning) Hive – Business Analysts Integration Pig - Doesn’t have a thrift server(i.e no/limited cross language support) Hive - Thrift server Declarative programming is where you say what you want without having to say how to do it. e.g SQL, yacc, lex With procedural programming, you have to specify exact steps to get the result. e.g. C User’s need Pig – Better dev environments, debuggers expected Hive - Better integration with technologies expected(e.g JDBC, ODBC)

Head-to-Head (the bee, the pig, the elephant) SELECT * FROM grep WHERE field like ‘%XYZ%’; [The table “grep” has two columns: (key STRING, field STRING). It has 500,000,000 rows and takes 50 GB disk space. ~ ] SELECT pageRank, pageURL FROM rankings WHERE pageRank > 10; [ The table “rankings” has three columns: (pageRank INT, pageURL STRING, avgDuration INT). It has 56289700 rows and takes 3.3 GB disk space. ] SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP; [The table “uservisits” has 9 columns: (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ). It has 465000000 rows and takes 60GB disk space. ] SELECT sourceIP, AVG(pageRank), SUM(adRevenue) FROM rankings R JOIN (SELECT * FROM uservisits UV WHERE UV.visitDate > '1999-01-01' AND UV.visitDate < '2000-01-01') NUV ON (R.pageURL = NUV.destURL) GROUP BY sourceIP; s finished in two jobs on Hive and PIG and three jobs on Hadoop. The first job, which takes the majority of the time, has 480 mappers and 60 reducers on all Hive, PIG and Hadoop. The second job on PIG has 120 mappers and 60 reducers. The second job on both Hive and Hadoop has 60 mappers and 60 reducers. The third job on Hadoop has 60 mappers and 1 reducer. Version: Hadoop – 0.18x, Pig:786346, Hive:786346

REFERENCES https://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/Presentatio ns https://developer.yahoo.com/blogs/hadoop/comparing-pig- latin-sql-constructing-data-processing-pipelines-444.html http://www.qubole.com/blog/big-data/hive-best-practices/ Hortonworks tutorials (youtube) Graph : https://issues.apache.org/jira/secure/attachment/12411185/hi ve_benchmark_2009-06-18.pdf Hive has to explicitly maintain consistency between metadata and data – not much details provided in paper In memory metastore?