Introduction to Apache HIVE

Introduction to Apache HIVE

Agenda Background. HIVE. HiveQL. Extension mechanisms.
Performance comparison.

Motivation Analysis of Data made by both engineering and non-engineering people. The data are growing fast. In 2007, the volume was 15TB and it grew up to 200TB in Current RDBMS can NOT handle it. Current solution are not available, not scalable, Expensive and Proprietary. *Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook – both engineering and non-engineering. *The entire data processing infrastructure in Facebook prior to 2008 was built around a data warehouse built using a commercial RDBMS. The data that they were generating was growing very fast - as an example we grew from a 15TB data set in 2007 to a 700TB data set today. *The infrastructure at that time was so inadequate that some daily data processing jobs were taking more than a day to process and the situation was just getting worse with every passing day. They had an urgent need for infrastructure that could scale along with their data.

Map/Reduce - Apache Hadoop
MapReduce is a programing model and an associated implementation introduced by Goolge in 2004. Apache Hadoop is a software framework inspired by Google's MapReduce. * MapReduce is a programing model and an asociated implementation introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. *Basically there are two steps, the “Map step” and the “Reduce step”. During the first one, there is a master node which takes the input, partitions it up into smaller sub- problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. During the second one, I mean the reduce step, the master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. * Apache Hadoop is a software framework inspirated by Google's MapReduce. Apache Hadoop supports data-intensive distributed applications. It includes a distribuited file system, HDFS, that provides high-throughput access to application data. Also it includes a Map/Reduce software framework for distributed processing of large data sets on compute clusters. So all you have to do is hook from some classes and implement some interface in order to develop some map/reduce oriented solution.

Motivation (cont.) Hadoop supports data-intensive distributed applications. However... Map-reduce hard to program (users know sql/bash/python). No schema. * As a result of the data issue, they started exploring Hadoop as a technology to address our scaling needs. The fact that Hadoop was already an open source project that was being used at petabyte scale and provided scalability using commodity hardware was a very compelling proposition for us. The same jobs that had taken more than a day to complete could now be completed within a few hours using Hadoop. *However, using Hadoop was not easy for end users, especially for those users who were not familiar with map-reduce. End users had to write map-reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressiveness of popular query languages like SQL and as a result users ended up spending hours. Also, since map/reduce is a different programatic model, the users has to used to think using that model. *There is no structure, there

Agenda Background. HIVE HiveQL. Extension mechanisms.

What is HIVE? A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. ETL. Structure. Access to different storage. Query execution via MapReduce. Key Building Principles: SQL is a familiar language Extensibility – Types, Functions, Formats, Scripts Performance *Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis initially developed by Facebook. *Hive structures data into the well-understood database concepts like tables, columns, rows, and partitions. It supports all the major primitive types – integers, floats, doubles and strings – as well as complex types such as maps, lists and structs. *Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language. *Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. *Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs) *As we will see before, hive has a good performance compared with hadoop and pig.

Data Units Databases. Tables. Partitions. Buckets (or Clusters).
In the order of granularity - Hive data is organized into: *Databases: Namespaces that separate tables and other data units from naming confliction. *Tables: Homogeneous units of data which have the same schema. *Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. You can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Partition columns are virtual columns, they are not part of the data itself but are derived on load. *Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. These can be used to efficiently sample the data.

Type System Primitive types Complex types
Integers:TINYINT, SMALLINT, INT, BIGINT. Boolean: BOOLEAN. Floating point numbers: FLOAT, DOUBLE . String: STRING. Complex types Structs: {a INT; b INT}. Maps: M['group']. Arrays: ['a', 'b', 'c'], A[1] returns 'b'.

Agenda Background. HIVE. HiveQL. Extension mechanisms.

Examples – DDL Operations
CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; *Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). *lists all the table that end with 's'. The pattern matching follows Java regular expressions.

Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds=' '); LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds=' '); *Loads a file that contains two columns separated by ctrl-a into sample table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS. *The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets. *The second command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SELECTS and FILTERS SELECT foo FROM sample WHERE ds='2012-02-24';
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds=' '; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive- sample-out' SELECT * FROM sample; *selects column 'foo' from all rows of partition ds= of the invites table. The results are not stored anywhere, but are displayed on the console. *selects all rows from partition ds= of the invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. *the last query store the result into a local directory.

Aggregations and Groups
SELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; *The first query get the max value of foo. *The 2nd query groups the ds, sums the foo values for a given ds and count the amount of row for the given ds. *The last one show us how we can insert the ouptup into a table.

Join SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id);
CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id); *This join query joins two different tables. It is needed just one statement to do the join operation. If we use hadoop to do this, the code would be pretty complex. *Therefore does not matters the format of each file. The customer file is a slash delimited file meanwhile the orders file is a tab delimited file. There is no need to write different mapper for each file. It is managed by hive. *Apart of this, as you can see, in the last query there is a sub query, this is also a powerful feature but it is also supported in the from clause.

Multi table insert - Dynamic partition insert
FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='CA') SELECT pvs.viewTime, ... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='UK') SELECT pvs.viewTime, ... WHERE pvs.country = 'UK'; INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country) SELECT pvs.viewTime, ... *In the first query we see the output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files. *In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job. *Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table.

User-defined function
Java code package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } *Hive has the ability to define function. In order to create a new UDF classes, it needs to inherit from this UDF class. All UDF classes need implements one or more methods named "evaluate" which will be called by Hive. "evaluate" should never be a void method. However it can return "null" if needed. *After compile it, you have to include it in the hive classpath and then, after hive is started up, you have to register the function. *Finally you use the fuction in some setence. Registering the class CREATE FUNCTION my_lower AS 'com.example.hive.udf.Lower'; Using the function SELECT my_lower(title), sum(freq) FROM titles GROUP BY my_lower(title);

Built-in Functions Mathematical: round, floor, ceil, rand, exp...
Collection: size, map_keys, map_values, array_contains. Type Conversion: cast. Date: from_unixtime, to_date, year, datediff... Conditional: if, case, coalesce. String: length, reverse, upper, trim... Hive provides a lot of built in functions such as mathematical,

More Functions Aggregate: count,sum,variance... Table-Generating:
Lateral view: string pageid Array<int> adid_list "front_page" [1,2,3] "contact_page" [3, 4, 5] SELECT pageid, adid FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid; *Agregation functions creates the output given the full set of data. The implemantion of it is a slight more complex than the UDF. The user has to implement a few more methods, but the idea is similar. Therefore hive provides a lot of built-in UDAF. *Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows. * Consider the following base table named pageAds. It has two columns: pageid (name of the page) and adid_list (an array of ads appearing on the page). A lateral view with explode() can be used to convert adid_list into separate rows using the query. string pageid int adid "front_page" 1 2 ...

Map/Reduce Scripts Using the function my_append.py
for line in sys.stdin: line = line.strip() key = line.split('\t')[0] value = line.split('\t')[1] print key+str(i)+'\t'+value+str(i) i=i+1 Using the function *Users can also plug in their own custom mappers and reducers in the data stream. In order to run a custom mapper script and a custom reducer script the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts. *By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT .... SELECT TRANSFORM (foo, bar) USING 'python ./my_append.py' FROM sample;

Comparison of UDF/UDAF v.s. M/R scripts
language Java any language 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF 1/n input/output supported via UDTF Speed Faster (in same process) Slower (spawns new process)

Performance - Dataset structure
grep(key VARCHAR(10), field VARCHAR(90)) 2 columns, 500 million rows, 50GB rankings(pageRank INT, pageURL VARCHAR(100), avgDuration INT) 3 columns, 56.3 million rows, 3.3GB. uservisits(sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT ). 9 columns, 465 million rows, 60GB (scaled down from 200GB). *These are the 3 dataset that were used for the benchmark.

Performance - Test query
Select query 1 SELECT * FROM grep WHERE field like ‘%XYZ%’; Select query 2 SELECT pageRank, pageURL FROM rankings WHERE pageRank > 10; Aggregation query SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP; Join query SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM rankings AS R, userVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(` ') AND Date(` ') GROUP BY UV.sourceIP; *The test includes two select queries, one aggregation query and one join query.

Performance - Result

Conclusion A easy way to process large scale data.
Support SQL-based queries. Provide more user defined interfaces to extend Programmability. Files in HDFS are immutable. Tipically: Log processing: Daily Report, User Activity Measurement Data/Text mining: Machine learning (Training Data) Business intelligence: Advertising Delivery,Spam Detection To Sum up these stuff: *Hive is built on hadoop. It provides an easy way to process large scale data. Due it uses hadoop is not appropriated to use it to process online data or real time process. Rememberer that Hadoop does not process a job immediately *Hive provides a HiveQL which is an SQL based language. This means that it is easy to learn for both engineering and non engineering people. *Also Hive provides several extension mechanism, they are UDF,UADF, UTDF and scripting. *Hive point out to process immutable data. Typically for offiline processing.

Introduction to Apache HIVE

Similar presentations

Presentation on theme: "Introduction to Apache HIVE"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Apache HIVE

Similar presentations

Presentation on theme: "Introduction to Apache HIVE"— Presentation transcript:

Similar presentations

About project

Feedback