CSE 491/891 Lecture 24 (Hive).

CSE 491/891 Lecture 24 (Hive)

Outline Previous lecture Today’s lecture How to run Hive
How to create, drop, and alter table Difference between managed and external tables Today’s lecture Note on launching Beeline interactive shell Queries in Hive Partitions Mappers and reducers in Hive How to do word counting in Hive

Note on Launching Beeline on EMR
For simple queries, it is sufficient to login to hive without any username and password For more complicated queries requiring creation of mapreduce jobs, you should login with the username “hadoop” (password can be empty)

Note on Launching Beeline on EMR
hadoop> beeline –u <url_for_hive_server> -n <username> -p <password> Use this option to connect to beeline on AWS if you need to perform aggregate queries or other queries that involve creating mapreduce jobs

Example 1: Wikipedia Edits
The load data command will create a managed table where data will be copied to /user/hive/warehouse

Retrieval Queries (Examples)
Find users who have edited the page ‘Anarchism’

Count number of revisions for the Anarchism page

Count number of revisions made by 5 of the users

Retrieval Queries SELECT [ALL | DISTINCT] select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number] For complete reference, go to

ORDER, CLUSTER, DISTRIBUTE, SORT
SORT BY: Specifies the columns to sort the data per reducer. Unlike ORDER BY, which guarantees total order in the output, SORT BY guarantees only ordering of rows within a reducer DISTRIBUTE BY: Mainly used with map-reduce scripts (see later slides) Specifies the columns to distribute the rows among reducers CLUSTER BY: Equivalent to DISTRIBUTE BY followed by SORT BY

Store Query Result in a Table
It is often convenient to store the query result in a temporary table because Output is far too large to be dumped to the console or You need further processing on the output CREATE TABLE tableName AS SELECT … FROM … INSERT OVERWRITE TABLE tableName SELECT … FROM …

Example 2: Bestbuy Search Queries
This will create an external table where data is stored in /user/hadoop/bestbuy

Find all the unique query terms about ipad:

List the ID of users who are searching for Ipad:

Find the IDs and queries of ipad users who searched for playstation:

Multi-Table Inserts In HiveQL, the query can start with FROM clause to perform multiple insert queries

Multi-Table Inserts

Join Query HiveQL supports inner and outer joins

Join Query HiveQL supports outer joins, which allow you to find nonmatches in the tables being joined

Partitions Relational databases use indexes on columns to speed up executing queries in such columns Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a partition column For example, a state column will partition the table into 50 partitions (one per state) Hive physically store different partitions in different directories Using partitions can make it faster to answer queries on slices of the data

Partitions Partitioned tables are created using PARTITIONED BY clause.
A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns Note: partitioned column should not be declared with other non-partitioned columns when defining the table

Partitioning Example Suppose you have user data from several states

Partitioning Example Now you can load data for each partition
Note: load data local inpath allows you to upload the data from your local filesystem directly to HDFS

Partitioning Example

Partitioning Example You can use the partitioning column in a query

Mappers and Reducers in Hive
Hive allows users to define their own mappers and reducers to be used for processing data stream (similar to Hadoop streaming, lecture 10) hadoop jar <streaming_jar_file> -input <input_directory> -output <output_directory> -mapper <mapper_script> -reducer <reducer_script> -file <mapper_script_filename> -file <reducer_script_filename> -numReduceTasks <number_of_reducers>

You can write your own mappers and reducers in python, perl, or any languages and use them in Hive Why would you need your own mappers and reducers? So you can do more complex processing of input data and write the output into a table For example, how would you do word count in Hive? Can you write an SQL query to do this? In the next few slides, we will show how to write a HiveQL script to do this

Steps Write the mapper script Write the reducer script Write the HiveQL script that does the following: Create an input data table Create a table for storing the reducer output Import the mapper and reducer scripts into Hive Load the output of reducer into the reducer output table

Hive Syntax for loading data from one source to a destination table:
FROM source_table INSERT OVERWRITE TABLE destination_table Sql_query to be applied to source_table Example: FROM wiki_edit INSERT OVERWRITE TABLE editors SELECT DISTINCT users;

FROM mapper_output INSERT OVERWRITE TABLE output_table ( apply reducer to the mapper_output ) Apply reducer to each (key, value) pair output from mapper and store the reducer output in an output_table

FROM ( FROM input_table MAP columns_to_apply_the_mapper USING ‘mapper-script-filename’ AS map_key, map_value -- output of mapper CLUSTER BY map_key -- column for partitioning -- the mapper output ) mapper_output INSERT OVERWRITE TABLE output_table ( apply reducer to the mapper_output ) Mapper_output is an alias referring to the output of mapper CLUSTER BY clause is used to specify how to partition (shuffle) the mapper output to each reducer (it works the same way as the Partitioner class in Hadoop)

FROM ( FROM input_data_table MAP columns_to_apply_the_mapper USING ‘mapper-script-filename’ AS map_key, map_value -- output of mapper CLUSTER BY map_key -- column for partitioning -- the mapper output ) mapper_output INSERT OVERWRITE TABLE output_table REDUCE mapper_output.map_key, mapper_output.map_value USING ‘reducer-script-filename’ AS reducer_key, reducer_value; -- output of reducer

Example: WordCount in Hive
Steps Write the mapper script Write the reducer script Write the HiveQL script that does the following: Create an input table called document Table has only 1 column (called sentences) Each row in the table is a line in the input document file Create a table called output Table has 2 columns – word, count Each row will store the reducer output Add the mapper and reducer scripts into Hive Insert into output table the word counts obtained after applying the mapper and reducer

WordCount in Hive Mapper.pl (Written in Perl):
Reads each line, tokenize the line into a set of words, and output each word with a count of 1 (tab-separated)

WordCount in Hive Reducer.pl:
Takes the mapper output and adds up the count for each word; output each word with its total frequency

WordCount in Hive (wordcount.sql)

WordCount in Hive (wordcount.sql)
Cluster by: mapper output will be partitioned based on the word and shuffled to the respective reducers

Executing WordCount in Hive
Make input directories in HDFS Upload the input data (document.txt) to HDFS

Connect to hive server using hadoop as username Run the script by typing source <scriptname>

View the results

CSE 491/891 Lecture 24 (Hive).

Similar presentations

Presentation on theme: "CSE 491/891 Lecture 24 (Hive)."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 491/891 Lecture 24 (Hive).

Similar presentations

Presentation on theme: "CSE 491/891 Lecture 24 (Hive)."— Presentation transcript:

Similar presentations

About project

Feedback