Hive – SQL on top of Hadoop

Hive – SQL on top of Hadoop

Content Background Concepts Hive Architecture Examples

Background

Version 2008 Apache Hive 2009/4/29 Stable version 0.3.0 2013/1/11 0.10.0 2014/11/12 0.14.0 2015/2/4 1.0.0(0.14.1) 2015/3/18 1.2.0

Concepts

Map-Reduce and SQL Map-Reduce SQL Solution: Combine SQL and Map-Reduce
Map-Reduce is scalable SQL SQL has a huge user base SQL is easy to code Solution: Combine SQL and Map-Reduce Hive on top of Hadoop (open source) Aster Data (proprietary) Green Plum (proprietary)

What is Hive A database/data warehouse on top of Hadoop
Rich data types Efficient implementations of SQL on top of map reduce Allows users to access Hive data without using Hive Support Analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. Provides an SQL-like language called HiveQL with schema. Converts queries to Map-Reduce, Apache Tez and Spark jobs.

What Hive Is NOT Hive aims to provide acceptable (but not optimal) latency for interactive data browsing. Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).

Data Units Databases Tables
Namespaces that separate tables and other data units from naming confliction. Tables Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema): timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed. userid - which is of BIGINT type that identifies the user who viewed the page. page_url - which is of STRING type that captures the location of the page. referrer_url - which is of STRING that captures the location of the page from where the user arrived at the current page. IP - which is of STRING type that captures the IP address from where the page request was made.

Data Units Partitions Each Table can have one or more partition Keys which determines how the data is stored. Apart from being storage units, partitions also allow the user to efficiently identify the rows that satisfy a certain criteria. A date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. create table partition_test (userid int, page_url string, refer_url string, IP string) partitioned by (timestamp int, country string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; alter table partition_test add partition (timestamp=' ', country='US'); HDFS: /user/hive/warehouse/partition_test/date= /country=US

Data Units Buckets Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example, the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data. HDFS: /user/hive/warehouse/partition_test/date= /country=US/part_00001

Type System Primitive Types Complex Types Integers Boolean
TINYINT, SMALLINT, INT, BIGINT Boolean BOOLEAN Point numbers FLOAT, DOUBLE String type STRING Complex Types Structs, Maps, Array

Build In Operators and Functions
Built in Operators Relational Operators =, !=, <, <=, >, >=, IS NULL, IS NOT NULL, LIKE, RLIKE/REGEXP Arithmetic Operators +, -, *, /, %, |, ^, ~ Logical Operators AND/&&, OR/||, NOT/! Operators on Complex Types A[n], M[key], S.x

Build In Operators and Functions
Built In Functions Basic round, floor, ceil Rand concat, substr, upper/ucase, lower/lcase, trim/ltrim/rtrim, regexp_replace Size cast, from_unixtime, to_date, year, month, day, get_json_object Aggregation count, sum, avg, min, max

Usage and Examples Creating Tables Browsing Tables and Partitions
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING>, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; Browsing Tables and Partitions SHOW TABLES; SHOW TABLES 'page.*'; SHOW PARTITIONS page_view; DESCRIBE (EXTENDED) page_view;

Usage and Examples Altering Tables Dropping Tables and Partitions
ALTER TABLE old RENAME TO new; ALTER TABLE old REPLACE COLUMNS (c1 TYPE, …); ALTER TABLE old ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING COMMENT DEFAULT 'def val'); Dropping Tables and Partitions DROP TABLE pv_users; ALTER TABLE pv_users DROP PARTITION (ds=' ')

Loading Data Method1 CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRIN, country STRING) COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoop dfs -put /tmp/pv_ txt /user/data/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

Loading Data Method2 Method3
LOAD DATA LOCAL INPATH /tmp/pv_ _us.txt INTO TABLE page_view PARTITION (date=' ', country='US') Method3 LOAD DATA INPATH '/user/data/pv_ _us.txt' INTO TABLE page_view PARTITION (date=' ', country='US')

Querying and Inserting Data
Simple Query Insert INSERT OVERWRITE TABLE user_active SELECT user.* FROM user WHERE user.active = 1; Select

Partition Based Query INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= ' ' AND page_views.date <= ' ' AND page_views.referrer_url like '%xyz.com'; Joins INSERT OVERWRITE TABLE pv_friends SELECT pv.*, u.gender, u.age, f.friends FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) WHERE pv.date = ' ';

Aggregations Allowed INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; Not allowed SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)

Multi Table/File Inserts FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; Dynamic-Partition Insert FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

Sampling Choose 3rd bucket out of 32 buckets INSERT OVERWRITE TABLE pv_gender_sum_sample SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32); Choose 3rd and 19th bucket out of 32 buckets TABLESAMPLE(BUCKET 3 OUT OF 16) Choose half of the 3rd buckets TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid) The buckets are numbered starting from 0

Hive Architecture

Hive Architecture 1. User issues SQL Query
2. Hive parses and plans query 3. Query converted to Map-Reduce 4. Map-Reduce run by Hadoop Compiler Optimizer Executor

Services CLI HiveServer HWI Command Line Interface
Allows a remote client to submit requests to Hive Exports Thrift For scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently among different languages HiveServer cannot handle concurrent requests from more than one client, removed starting in Hive (1.0.0) HiveServer2 is a rewrite of HiveServer that addresses these problems, starting with Hive HWI Hive Web Interface

Driver Compiler Optimizer Executor Parser Semantic Analyzer
Logic Plan Generator Local Optimizer Physical Plan Generator Physical Optimizer Optimizer Executor

Hive Architecture

SerDe Built-in SerDes Third-party SerDes Avro ORC RegEx Thrift Parquet
CSV Third-party SerDes jsonserde.jar

Examples – SQL to Map-Reduce

Join SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid); page_view user pv_users pageid userid time 1 111 09:08:01 2 09:08:13 222 09:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 32 X =

Join – in Map Reduce page_view pv_users pageid userid time 1 111
09:08:01 2 09:08:13 222 09:08:14 key value 111 <1, 1> <1, 2> 222 key value 111 <1, 1> <1, 2> <2, 25> pageid age 1 25 2 Shuffle Sort Map Reduce user userid age gender 111 25 female 222 32 male key value 111 <2, 25> 222 <2, 32> key value 222 <1, 1> <2, 32> pageid age 1 32

Group By SQL: INSERT INTO TABLE pageid_age_sum
SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; pv_users pageid_age_sum pageid age 1 25 2 32 pageid age count 1 25 2 32

Group By – in Map Reduce pv_users pageid_age_sum pageid age 1 25 2 key
value <1, 25> 1 <2, 25> key value <1, 25> 1 <1, 32> pageid age count 1 25 32 Shuffle Sort Map Reduce pageid age 1 32 2 25 key value <1, 32> 1 <2, 25> key value <2, 25> 1 pageid age count 2 25

Group By with Distinct SQL: SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid page_view result pageid userid time 1 111 09:08:01 2 09:08:13 222 09:08:14 09:08:20 pageid count_distinct_userid 1 2

Group By with Distinct – in Map Reduce
page_view result pageid userid time 1 111 09:08:01 2 09:08:13 key value <1, 111> <1, 222> pageid count_distinct_userid 1 2 Shuffle Sort Reduce pageid userid time 1 222 09:08:14 2 111 09:08:20 key value <2, 111> pageid count_distinct_userid 2 1

Order By SQL: SELECT * FROM page_view ORDER BY time; page_view
pageid userid time 2 111 09:08:13 1 09:08:01 09:08:20 222 09:08:14 pageid userid time 1 111 09:08:01 2 09:08:13 222 09:08:14 09:08:20

Order by – in Map Reduce page_view pageid userid time 2 111 09:08:13 1
09:08:01 key value <1, 111> 09:08:01 <2, 222> 09:08:13 page_view pageid userid time 1 111 09:08:01 2 09:08:13 222 09:08:14 09:08:20 Shuffle Sort Reduce pageid userid time 2 111 09:08:20 1 222 09:08:14 key value <1, 222> 09:08:14 <2, 111> 09:08:20

Sort by – in Map Reduce page_view page_view pageid userid time 2 111
09:08:13 1 09:08:01 key value <1, 111> 09:08:01 <1, 222> 09:08:14 pageid userid time 1 111 09:08:01 222 09:08:14 Shuffle Sort Reduce pageid userid time 2 111 09:08:20 1 222 09:08:14 key value <2, 111> 09:08:13 09:08:20 pageid userid time 2 111 09:08:13 09:08:20

Merge Sequential Map Reduce Jobs
SQL: SELECT …… FROM (a join b on a.key = b.jey) join c on a.key = c.key key av 1 111 Map Reduce key av bv 1 111 222 Map Reduce key bv 1 222 key av bv cv 1 111 222 333 key cv 1 333

Share Common Read Operations
Extended SQL: FROM pv_users INSERT INTO TABLE pv_pageid_sum SELECT pageid, count(1) GROUP BY pageid INSERT INTO TABLE pv_age_sum SELECT age, count(1) GROUP BY age; pageid count 1 2 pageid age 1 25 2 32 Map Reduce age count 25 1 32

Hive – SQL on top of Hadoop

Similar presentations

Presentation on theme: "Hive – SQL on top of Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hive – SQL on top of Hadoop

Similar presentations

Presentation on theme: "Hive – SQL on top of Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback