Download presentation
Presentation is loading. Please wait.
Published byFrank Powell Modified over 9 years ago
1
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151
2
Motivation Yahoo worked on Pig to facilitate application deployment on Hadoop. – Their need mainly was focused on unstructured data Simultaneously Facebook started working on deploying warehouse solutions on Hadoop that resulted in Hive. – The size of data being collected and analyzed in industry for business intelligence (BI) is growing rapidly making traditional warehousing solution prohibitively expensive. – Hive attempts to the leverage the extensive SQL skilled workforce to do big data analytics 8/18/20152
3
Hadoop MR MR is very low level and requires customers to write custom programs. HIVE supports queries expressed in SQL-like language called HiveQL which are compiled into MR jobs that are executed on Hadoop. Hive also allows MR scripts It also includes MetaStore that contains schemas and statistics that are useful for data explorations, query optimization and query compilation. At Facebook Hive warehouse contains tens of thousands of tables, stores over 700TB and is used for reporting and ad- hoc analyses by 200 Fb users (numbers at the time the paper was written) 8/18/20153
4
Data model Hive structures data into well-understood database concepts such as: tables, rows, cols, partitions It supports primitive types: integers, floats, doubles, and strings Hive also supports: – associative arrays: map – Lists: list – Structs: struct SerDe: serialize and deserialized API is used to move data in and out of tables 8/18/20154
5
Query Language (HiveQL) Subset of SQL Meta-data queries Limited equality and join predicates No inserts on existing tables (to preserve worm property) – Can overwrite an entire table 8/18/20155
6
Wordcount in Hive FROM ( MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING 'pythonwc_reduce.py'; 8/18/20156
7
Session/tmstamp example FROM ( FROM session_table SELECT sessionid, tstamp, data DISTRIBUTE BY sessionid SORT BY tstamp ) a REDUCE sessionid, tstamp, data USING 'session_reducer.sh'; 8/18/20157
8
Data Storage Tables are logical data units; table metadata associates the data in the table to hdfs directories. Hdfs namespace: tables (hdfs directory), partition (hdfs subdirectory), buckets (subdirectories within partition) /user/hive/warehouse/test_table is a hdfs directory 8/18/20158
9
Hive architecture (from the paper) 8/18/20159
10
Architecture Metastore: stores system catalog Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics Query compiler: Compiles HiveQL into a directed acyclic graph of map/reduce tasks Execution engines: The component executes the tasks in proper dependency order; interacts with Hadoop HiveServer: provides Thrift interface and JDBC/ODBC for integrating other applications. Client components: CLI, web interface, jdbc/odbc inteface Extensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function. 8/18/201510
11
Sample Query Plan 8/18/201511
12
Hive Usage in Facebook Hive and Hadoop are extensively used in Facbook for different kinds of operations. 700 TB = 2.1Petabyte after replication! Think of other application models that can leverage Hadoop MR. 8/18/201512
13
General approach to learning Hive Similar to RDBMS DDL: Data definition language – Schema definition, create table, meta data defn. DML: Data Manipulation Language – Load table HiveQL: Query language – Select, insert etc. Underlying structure is either HDFS/hadoop or HBase 8/18/201513
14
Demo on Amazon aws Goal: – Understand amazon aws – Essential services among the many services provided – Hive setup and query Demo URL: http://docs.aws.amazon.com/gettingstarted/latest/e mr/getting-started-emr-tutorial.html 8/18/201514
15
Discussion of the Demo Problem Data: Apache collected web logs with IP address of the client (want to know who accessed your website? ) Goal: web log is unstructured; big data; collect and clean the data and load using Hive; query the data using Hive – Deserialize and Serialize into and from Hive.. So the data is called SerDe – Count the line – See the first line of data – Count the number of accesses from a particular IP 8/18/201515
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.