Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 491/891 Lecture 24 (Hive).

Similar presentations


Presentation on theme: "CSE 491/891 Lecture 24 (Hive)."— Presentation transcript:

1 CSE 491/891 Lecture 24 (Hive)

2 Outline Previous lecture Today’s lecture How to run Hive
How to create, drop, and alter table Difference between managed and external tables Today’s lecture Note on launching Beeline interactive shell Queries in Hive Partitions Mappers and reducers in Hive How to do word counting in Hive

3 Note on Launching Beeline on EMR
For simple queries, it is sufficient to login to hive without any username and password For more complicated queries requiring creation of mapreduce jobs, you should login with the username “hadoop” (password can be empty)

4 Note on Launching Beeline on EMR
hadoop> beeline –u <url_for_hive_server> -n <username> -p <password> Use this option to connect to beeline on AWS if you need to perform aggregate queries or other queries that involve creating mapreduce jobs

5 Example 1: Wikipedia Edits
The load data command will create a managed table where data will be copied to /user/hive/warehouse

6 Retrieval Queries (Examples)
Find users who have edited the page ‘Anarchism’

7 Retrieval Queries (Examples)
Count number of revisions for the Anarchism page

8 Retrieval Queries (Examples)
Count number of revisions made by 5 of the users

9 Retrieval Queries SELECT [ALL | DISTINCT] select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number] For complete reference, go to

10 ORDER, CLUSTER, DISTRIBUTE, SORT
SORT BY: Specifies the columns to sort the data per reducer. Unlike ORDER BY, which guarantees total order in the output, SORT BY guarantees only ordering of rows within a reducer DISTRIBUTE BY: Mainly used with map-reduce scripts (see later slides) Specifies the columns to distribute the rows among reducers CLUSTER BY: Equivalent to DISTRIBUTE BY followed by SORT BY

11 Store Query Result in a Table
It is often convenient to store the query result in a temporary table because Output is far too large to be dumped to the console or You need further processing on the output CREATE TABLE tableName AS SELECT … FROM … INSERT OVERWRITE TABLE tableName SELECT … FROM …

12 Example 2: Bestbuy Search Queries
This will create an external table where data is stored in /user/hadoop/bestbuy

13 Example 2: Bestbuy Search Queries
Find all the unique query terms about ipad:

14 Example 2: Bestbuy Search Queries
List the ID of users who are searching for Ipad:

15 Example 2: Bestbuy Search Queries
Find the IDs and queries of ipad users who searched for playstation:

16 Multi-Table Inserts In HiveQL, the query can start with FROM clause to perform multiple insert queries

17 Multi-Table Inserts

18 Join Query HiveQL supports inner and outer joins

19 Join Query HiveQL supports outer joins, which allow you to find nonmatches in the tables being joined

20 Partitions Relational databases use indexes on columns to speed up executing queries in such columns Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a partition column For example, a state column will partition the table into 50 partitions (one per state) Hive physically store different partitions in different directories Using partitions can make it faster to answer queries on slices of the data

21 Partitions Partitioned tables are created using PARTITIONED BY clause.
A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns Note: partitioned column should not be declared with other non-partitioned columns when defining the table

22 Partitioning Example Suppose you have user data from several states

23 Partitioning Example Now you can load data for each partition
Note: load data local inpath allows you to upload the data from your local filesystem directly to HDFS

24 Partitioning Example

25 Partitioning Example

26 Partitioning Example You can use the partitioning column in a query

27 Mappers and Reducers in Hive
Hive allows users to define their own mappers and reducers to be used for processing data stream (similar to Hadoop streaming, lecture 10) hadoop jar <streaming_jar_file> -input <input_directory> -output <output_directory> -mapper <mapper_script> -reducer <reducer_script> -file <mapper_script_filename> -file <reducer_script_filename> -numReduceTasks <number_of_reducers>

28 Mappers and Reducers in Hive
You can write your own mappers and reducers in python, perl, or any languages and use them in Hive Why would you need your own mappers and reducers? So you can do more complex processing of input data and write the output into a table For example, how would you do word count in Hive? Can you write an SQL query to do this? In the next few slides, we will show how to write a HiveQL script to do this

29 Mappers and Reducers in Hive
Steps Write the mapper script Write the reducer script Write the HiveQL script that does the following: Create an input data table Create a table for storing the reducer output Import the mapper and reducer scripts into Hive Load the output of reducer into the reducer output table

30 Hive Syntax for loading data from one source to a destination table:
FROM source_table INSERT OVERWRITE TABLE destination_table Sql_query to be applied to source_table Example: FROM wiki_edit INSERT OVERWRITE TABLE editors SELECT DISTINCT users;

31 Mappers and Reducers in Hive
FROM mapper_output INSERT OVERWRITE TABLE output_table ( apply reducer to the mapper_output ) Apply reducer to each (key, value) pair output from mapper and store the reducer output in an output_table

32 Mappers and Reducers in Hive
FROM ( FROM input_table MAP columns_to_apply_the_mapper USING ‘mapper-script-filename’ AS map_key, map_value -- output of mapper CLUSTER BY map_key -- column for partitioning -- the mapper output ) mapper_output INSERT OVERWRITE TABLE output_table ( apply reducer to the mapper_output ) Mapper_output is an alias referring to the output of mapper CLUSTER BY clause is used to specify how to partition (shuffle) the mapper output to each reducer (it works the same way as the Partitioner class in Hadoop)

33 Mappers and Reducers in Hive
FROM ( FROM input_data_table MAP columns_to_apply_the_mapper USING ‘mapper-script-filename’ AS map_key, map_value -- output of mapper CLUSTER BY map_key -- column for partitioning -- the mapper output ) mapper_output INSERT OVERWRITE TABLE output_table REDUCE mapper_output.map_key, mapper_output.map_value USING ‘reducer-script-filename’ AS reducer_key, reducer_value; -- output of reducer

34 Example: WordCount in Hive
Steps Write the mapper script Write the reducer script Write the HiveQL script that does the following: Create an input table called document Table has only 1 column (called sentences) Each row in the table is a line in the input document file Create a table called output Table has 2 columns – word, count Each row will store the reducer output Add the mapper and reducer scripts into Hive Insert into output table the word counts obtained after applying the mapper and reducer

35 WordCount in Hive Mapper.pl (Written in Perl):
Reads each line, tokenize the line into a set of words, and output each word with a count of 1 (tab-separated)

36 WordCount in Hive Reducer.pl:
Takes the mapper output and adds up the count for each word; output each word with its total frequency

37 WordCount in Hive (wordcount.sql)

38 WordCount in Hive (wordcount.sql)
Cluster by: mapper output will be partitioned based on the word and shuffled to the respective reducers

39 Executing WordCount in Hive
Make input directories in HDFS Upload the input data (document.txt) to HDFS

40 Executing WordCount in Hive
Connect to hive server using hadoop as username Run the script by typing source <scriptname>

41 Executing WordCount in Hive
View the results


Download ppt "CSE 491/891 Lecture 24 (Hive)."

Similar presentations


Ads by Google