A Warehousing Solution Over a Map-Reduce Framework

Slides:

Advertisements

Similar presentations

Hive Index Yongqiang He Software Engineer Facebook Data Infrastructure Team.

Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Hive - A Warehousing Solution Over a Map-Reduce Framework.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Introduction to Hive Liyin Tang

Hive: A data warehouse on Hadoop

1 Foundations of Software Design Lecture 27: Java Database Programming Marti Hearst Fall 2002.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.

Hive : A Petabyte Scale Data Warehouse Using Hadoop

Cloud Computing Other High-level parallel processing languages Keke Chen.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.

Introduction to Hadoop and HDFS

Hive Facebook 2009.

Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu

A NoSQL Database - Hive Dania Abed Rabbou.

Hive – SQL on top of Hadoop

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

 2009 Calpont Corporation 1 Calpont Open Source Columnar Storage Engine for Scalable MySQL Data Warehousing April 22, 2009 MySQL User Conference Santa.

Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

HIVE A Warehousing Solution Over a MapReduce Framework

Scaling Big Data Mining Infrastructure: The Twitter Experience

CS 405G: Introduction to Database Systems

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

PROTECT | OPTIMIZE | TRANSFORM

Hive - A Warehousing Solution Over a Map-Reduce Framework

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

Spark Presentation.

Big Data Intro.

Hive Mr. Sriram

Central Florida Business Intelligence User Group

Powering real-time analytics on Xfinity using Kudu

Hadoop EcoSystem B.Ramamurthy.

Rekha Singhal, Amol Khanapurkar, TCS Mumbai.

HIVE CSCE 587 Spring 2018.

Big Data - in Performance Engineering

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Server & Tools Business

Introduction to Apache

Renouncing Hotel’s Data Through Queries Using Hadoop

Pig - Hive - HBase - Zookeeper

Data Warehousing in the age of Big Data (1)

CSE 491/891 Lecture 24 (Hive).

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

MIS2502: Data Analytics MySQL and MySQL Workbench

IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.

Server & Tools Business

05 | Processing Big Data with Hive

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Pig Hive HBase Zookeeper

Presentation transcript:

A Warehousing Solution Over a Map-Reduce Framework Hive A Warehousing Solution Over a Map-Reduce Framework CS848 Lei Yao 19/09/2016

1.What is Hive? Why Hive? What can Hive do? Outline for my presentation 1.What is Hive? Why Hive? What can Hive do? 2.Hive Database 3.Hive architecture 4.Interesting research questions

What is Hive? A data warehouse infrastructure built on top of Hadoop for querying and managing large data sets. An apache software foundation project. Why Hive? Map-reduce is very low level and requires developers to write custom programs, which are hard to maintain and reuse Higher level data processing languages are needed

What can Hive do? Hive supports queries expressed in a SQL-like declarative language Hive also supports custom map-reduce scripts to be plugged into queries Data warehouse infrastructure over Hadoop SQL-like query language (HiveQL) Enables developers to utilize custom mappers and reducers Familiar, fast, scalable and extensible

2.Hive Database Outline for my presentation 1.What is Hive? Why Hive? What can Hive do? 2.Hive Database 3.Hive architecture 4.Interesting research questions

Data Model Data in Hive is organized into a three level hierarchy: Tables: Each table maps to a HDFS directory. Partitions: Each partition maps to sub directories under the table Buckets: Each bucket maps to files under each partition Hive supports primitive column types and nestable collection types

HiveQL commands Table commands Loading Data into Hive HDFS CREATE TABLE mytable (userid int, name string) PARTITIONED BY (date string); SHOW TABLE ‘.*my’; ALTER TABLE mytable ADD COLUMNS(new_col int); DROP TABLE mytable; Loading Data into Hive HDFS LOAD DATA INPATH ‘mybigdata’ [overwrite] INTO TABLE mypeople; Local file system LOAD DATA LOCAL INPATH ‘mybigdata’ INTO TABLE mypeople

Hive commands JOIN SELECT a.name, b.school FROM information1 a JOIN information2 b ON (a.userid = b.userid) INSERTION INSERT OVERWRITE TABLE t1 SELECT * FROM t2

Hive Query Language (example) Goal: generate the daily counts of status updates by school and gender CREATE TABLE status_updates (userid int, status string, ds string) PARTITIONED BY (ds string); LOAD DATA LOCAL INPATH ‘/logs/status_updates’ INTO TABLE status_updates PARTITION (ds=‘2009-03-20’) Multi table insert statement FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds =’2009-03-20’ ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds=’2009-03-20’) SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary SELECT subq1.school, COUNT(1) GROUP BY subq1.school

Use custom mapper and reducer (example) Goal: display 10 most popular memes per school Mapper: meme-extractor.py Reducer: top10.py REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school, meme, cnt) FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.status USING ‘meme-extractor.py’ AS (school, meme) FROM status_updates a JOIN profiles b ON (a.userid = b.userid) ) subq1 GROUP BY subq1.school, subq1.meme DISTRIBUTE BY school, meme SORT BY school, meme, cnt desc ) subq2;

4.Interesting research questions Outline for my presentation 1.What is Hive? Why Hive? What can Hive do? 2.Hive Database 3.Hive architecture 4.Interesting research questions

(compiler, Optimizer, Executor) Hive architecture External Interface: CLI, web UI and APIs like JDBC/ODBC Thrift server: exposes a client API to execute HiveQL statements. A framework for cross language services Metastore: system catalog, contains metadata about the tables stored in Hive. Driver: manages the life cycle of a HiveQL statement Compiler: translates HiveQL statements into a plan. Insert statements and queries ----- a DAG of map-reduce jobs Interface CLI JDBC/ODBC Web GUI Thrift Server Metastore Driver (compiler, Optimizer, Executor) Hive Job Tracker Name Node Data Node + Task Tracker Hadoop

Execution process Interface Hive Compiler HDFS Driver Execution Engine Request Compiler parse the query to check the syntax and the query are compiled. Compiler sends metadata request to Metastore. Metastore sends metadata to compiler. Compiler takes this information and creates a plan, which is sent back to the driver. The plan is sent to the execution engine. The execution engine processes the job as a mapreduce job by interacting with the underlying Hadoop system. The final results are sent to the user interface through the execution engine and then the driver. Interface Hive 1 10 Driver 2 Compiler 6 5 9 3 4 Execution Engine 7b Metastore 7a 8 Job Tracker Name Node Hadoop Data Node + Task Tracker Map-Reduce HDFS

The plan of a multi-table insert query (a DAG of map reduce job) SelectOperator Expressions:[col[1], col[4], col[5]] [0:string, 1:string, 2:int] FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds =’2009-03-20’ ) ) subq1 JoinOperator Predicate:col[0.0] = col [1.0] [0:int, 1:string, 2:string, 3:int, 4:string, 5:int] ReduceSinkOperator Partition cols: col[0] [0:int, 1:string, 2:string] FilterOperator Predicate:col[ds]=‘2009-03-20’ [0:int, 1:string, 2:string] ReduceSinkOperator Partition cols: col[0] [0:int, 1:string, 2:int] TableScanOperator Table: status_updates [userid int, status string, ds string] TableScanOperator Table: profiles [userid int, school string, gender int]

The plan of a multi-tale insert query (a DAG of map reduce job) INSERT OVERWRITE TABLE gender_summary PARTITION(ds=’2009-03-20’) SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary SELECT subq1.school, COUNT(1) GROUP BY subq1.school FileOutputOperator Table: tmp1 [0:string, 1:bigint] FileOutputOperator Table: tmp2 [0:int 1:bigint] GroupByOperator Aggregations: [couont(1)] Keys: [col(1)] Mode: hash [0:string, 1:bigint] GroupByOperator Aggregations: [couont(1)] Keys: [col(2)] Mode: hash [0:int, 1:bigint] SelectOperator Expressions:[col[1], col[4], col[5]] [0:string, 1:string, 2:int]

Continue FileOutputOperator Table: school_summary [0:string, 1:bigint] Table: gender_summary [0:int 1:bigint] SelectOperator Expressions:[col(0), col(1)] [0:string, 1:bigint] SelectOperator Expressions:[col(0), col(1)] [0:int, 1:bigint] GroupByOperator Aggregations: [couont(1)] Keys: [col(0)] Mode: mergepartial [0:string, 1:bigint] GroupByOperator Aggregations: [couont(1)] Keys: [col(0)] Mode: mergepartial [0:int, 1:bigint] ReduceSinkOperator Partition cols: col[0] [0:string, 1:bigint] ReduceSinkOperator Partition cols: col[0] [0:int, 1:bigint] TableScanOperator Table: tmp1 [0:string, 1:bigint] TableScanOperator Table: tmp2 [0:int, 1:bigint]

4.Interesting research questions Outline for my presentation 1.What is Hive? Why Hive? What can Hive do? 2.Hive Database 3.Hive architecture 4.Interesting research questions

Interesting research questions The current optimizer of Hive is rule based, we need to build a cost-based optimizer and adaptive optimization techniques. To improve scan performance, Hive needs columnar storage and intelligent data placement. Multi-query optimization techniques and generic n-way joins in a single map-reduce job.

Thank you