Introduction to Apache HIVE

Slides:



Advertisements
Similar presentations
HBase and Hive at StumbleUpon
Advertisements

Phoenix We put the SQL back in NoSQL James Taylor Demos:
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Shark:SQL and Rich Analytics at Scale
Session 2Introduction to Database Technology Data Types and Table Creation.
Shark Hive SQL on Spark Michael Armbrust.
Hive Index Yongqiang He Software Engineer Facebook Data Infrastructure Team.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive - A Warehousing Solution Over a Map-Reduce Framework.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
Cloud Computing Other Mapreduce issues Keke Chen.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hive UDF content/uploads/downloads/2013/09/HWX.Qu bole.Hive_.UDF_.Guide_.1.0.pdf UT Dallas 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
MSBIC Hadoop Series Querying Data with Hive Bryan Smith
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Projects on Extended Apache Spark
Hadoop EcoSystem B.Ramamurthy.
Introduction to Spark.
A Comparison of Approaches to Large-Scale Data Analysis
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Hadoop – PIG.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Introduction to SQL Server and the Structure Query Language
Presentation transcript:

Introduction to Apache HIVE

Agenda Background. HIVE. HiveQL. Extension mechanisms. Performance comparison.

Motivation Analysis of Data made by both engineering and non-engineering people. The data are growing fast. In 2007, the volume was 15TB and it grew up to 200TB in 2010. Current RDBMS can NOT handle it. Current solution are not available, not scalable, Expensive and Proprietary. *Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook – both engineering and non-engineering. *The entire data processing infrastructure in Facebook prior to 2008 was built around a data warehouse built using a commercial RDBMS. The data that they were generating was growing very fast - as an example we grew from a 15TB data set in 2007 to a 700TB data set today. *The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. They had an urgent need for infrastructure that could scale along with their data.

Map/Reduce - Apache Hadoop MapReduce is a programing model and an associated implementation introduced by Goolge in 2004. Apache Hadoop is a software framework inspired by Google's MapReduce. * MapReduce is a programing model and an asociated implementation introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. *Basically there are two steps, the “Map step” and the “Reduce step”. During the first one, there is a master node which takes the input, partitions it up into smaller sub- problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. During the second one, I mean the reduce step, the master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. * Apache Hadoop is a software framework inspirated by Google's MapReduce. Apache Hadoop supports data-intensive distributed applications. It includes a distribuited file system, HDFS, that provides high-throughput access to application data. Also it includes a Map/Reduce software framework for distributed processing of large data sets on compute clusters. So all you have to do is hook from some classes and implement some interface in order to develop some map/reduce oriented solution.

Motivation (cont.) Hadoop supports data-intensive distributed applications. However... Map-reduce hard to program (users know sql/bash/python). No schema. * As a result of the data issue, they started exploring Hadoop as a technology to address our scaling needs. The fact that Hadoop was already an open source project that was being used at petabyte scale and provided scalability using commodity hardware was a very compelling proposition for us. The same jobs that had taken more than a day to complete could now be completed within a few hours using Hadoop. *However, using Hadoop was not easy for end users, especially for those users who were not familiar with map-reduce. End users had to write map-reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressiveness of popular query languages like SQL and as a result users ended up spending hours. Also, since map/reduce is a different programatic model, the users has to used to think using that model. *There is no structure, there

Agenda Background. HIVE HiveQL. Extension mechanisms. Performance comparison.

What is HIVE? A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. ETL. Structure. Access to different storage. Query execution via MapReduce. Key Building Principles: SQL is a familiar language Extensibility – Types, Functions, Formats, Scripts Performance *Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis initially developed by Facebook. *Hive structures data into the well-understood database concepts like tables, columns, rows, and partitions. It supports all the major primitive types – integers, floats, doubles and strings – as well as complex types such as maps, lists and structs. *Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language. *Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. *Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs) *As we will see before, hive has a good performance compared with hadoop and pig.

Data Units Databases. Tables. Partitions. Buckets (or Clusters). In the order of granularity - Hive data is organized into: *Databases: Namespaces that separate tables and other data units from naming confliction. *Tables: Homogeneous units of data which have the same schema. *Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. You can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Partition columns are virtual columns, they are not part of the data itself but are derived on load. *Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. These can be used to efficiently sample the data.

Type System Primitive types Complex types Integers:TINYINT, SMALLINT, INT, BIGINT. Boolean: BOOLEAN. Floating point numbers: FLOAT, DOUBLE . String: STRING. Complex types Structs: {a INT; b INT}. Maps: M['group']. Arrays: ['a', 'b', 'c'], A[1] returns 'b'.

Agenda Background. HIVE. HiveQL. Extension mechanisms. Performance comparison.

Examples – DDL Operations CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). *lists all the table that end with 's'. The pattern matching follows Java regular expressions.

Examples – DML Operations LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24'); *Loads a file that contains two columns separated by ctrl-a into sample table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS. *The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets. *The second command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SELECTS and FILTERS SELECT foo FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive- sample-out' SELECT * FROM sample; *selects column 'foo' from all rows of partition ds=2012-02-24 of the invites table. The results are not stored anywhere, but are displayed on the console. *selects all rows from partition ds=2012-02-24 of the invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. *the last query store the result into a local directory.

Aggregations and Groups SELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; *The first query get the max value of foo. *The 2nd query groups the ds, sums the foo values for a given ds and count the amount of row for the given ds. *The last one show us how we can insert the ouptup into a table.

Join SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id); *This join query joins two different tables. It is needed just one statement to do the join operation. If we use hadoop to do this, the code would be pretty complex. *Therefore does not matters the format of each file. The customer file is a slash delimited file meanwhile the orders file is a tab delimited file. There is no need to write different mapper for each file. It is managed by hive. *Apart of this, as you can see, in the last query there is a sub query, this is also a powerful feature but it is also supported in the from clause.

Multi table insert - Dynamic partition insert FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, ... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, ... WHERE pvs.country = 'UK'; INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, ... *In the first query we see the output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files. *In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job. *Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table.

Agenda Background. HIVE HiveQL. Extension mechanisms. Performance comparison.

User-defined function Java code package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } *Hive has the ability to define function. In order to create a new UDF classes, it needs to inherit from this UDF class. All UDF classes need implements one or more methods named "evaluate" which will be called by Hive. "evaluate" should never be a void method. However it can return "null" if needed. *After compile it, you have to include it in the hive classpath and then, after hive is started up, you have to register the function. *Finally you use the fuction in some setence. Registering the class CREATE FUNCTION my_lower AS 'com.example.hive.udf.Lower'; Using the function SELECT my_lower(title), sum(freq) FROM titles GROUP BY my_lower(title);

Built-in Functions Mathematical: round, floor, ceil, rand, exp... Collection: size, map_keys, map_values, array_contains. Type Conversion: cast. Date: from_unixtime, to_date, year, datediff... Conditional: if, case, coalesce. String: length, reverse, upper, trim... Hive provides a lot of built in functions such as mathematical,

More Functions Aggregate: count,sum,variance... Table-Generating: Lateral view: string pageid Array<int> adid_list "front_page" [1,2,3] "contact_page" [3, 4, 5] SELECT pageid, adid FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid; *Agregation functions creates the output given the full set of data. The implemantion of it is a slight more complex than the UDF. The user has to implement a few more methods, but the idea is similar. Therefore hive provides a lot of built-in UDAF. *Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows. * Consider the following base table named pageAds. It has two columns: pageid (name of the page) and adid_list (an array of ads appearing on the page). A lateral view with explode() can be used to convert adid_list into separate rows using the query. string pageid int adid "front_page" 1 2 ...

Map/Reduce Scripts Using the function my_append.py for line in sys.stdin: line = line.strip() key = line.split('\t')[0] value = line.split('\t')[1] print key+str(i)+'\t'+value+str(i) i=i+1 Using the function *Users can also plug in their own custom mappers and reducers in the data stream. In order to run a custom mapper script and a custom reducer script the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts. *By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT .... SELECT TRANSFORM (foo, bar) USING 'python ./my_append.py' FROM sample;

Comparison of UDF/UDAF v.s. M/R scripts language Java any language 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF 1/n input/output supported via UDTF Speed Faster (in same process) Slower (spawns new process)

Agenda Background. HIVE HiveQL. Extension mechanisms. Performance comparison.

Performance - Dataset structure grep(key VARCHAR(10), field VARCHAR(90)) 2 columns, 500 million rows, 50GB rankings(pageRank INT, pageURL VARCHAR(100), avgDuration INT) 3 columns, 56.3 million rows, 3.3GB. uservisits(sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT ). 9 columns, 465 million rows, 60GB (scaled down from 200GB). *These are the 3 dataset that were used for the benchmark.

Performance - Test query Select query 1 SELECT * FROM grep WHERE field like ‘%XYZ%’; Select query 2 SELECT pageRank, pageURL FROM rankings WHERE pageRank > 10; Aggregation query SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP; Join query SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM rankings AS R, userVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1999-01-01') AND Date(`2000-01-01') GROUP BY UV.sourceIP; *The test includes two select queries, one aggregation query and one join query.

Performance - Result

Conclusion A easy way to process large scale data. Support SQL-based queries. Provide more user defined interfaces to extend Programmability. Files in HDFS are immutable. Tipically: Log processing: Daily Report, User Activity Measurement Data/Text mining: Machine learning (Training Data) Business intelligence: Advertising Delivery,Spam Detection To Sum up these stuff: *Hive is built on hadoop. It provides an easy way to process large scale data. Due it uses hadoop is not appropriated to use it to process online data or real time process. Rememberer that Hadoop does not process a job immediately *Hive provides a HiveQL which is an SQL based language. This means that it is easy to learn for both engineering and non engineering people. *Also Hive provides several extension mechanism, they are UDF,UADF, UTDF and scripting. *Hive point out to process immutable data. Typically for offiline processing.