MSBIC Hadoop Series Querying Data with Hive Bryan Smith

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

SQOOP HCatalog Integration
Shark Hive SQL on Spark Michael Armbrust.
Hive Index Yongqiang He Software Engineer Facebook Data Infrastructure Team.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith
Before the Session Verify HDInsight Emulator properly installed Verify Visual Studio and NuGet installed on emulator system Verify emulator system has.
INTELLIGENT DATA SOLUTIONS COM Intro to Data Factory PASS Cloud Virtual Chapter March 23, 2015 Steve Hughes, Architect.
MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith
Image taken from: slideshare
Big Data, Data Mining, Tools
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Welcome to MSBIC! June 2014.
Hadoop.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Easily retrieve data from the Baan database
Getting Started with Power Query
MSBIC Hadoop Series Processing Data with Pig
A Warehousing Solution Over a Map-Reduce Framework
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Hive Mr. Sriram
Central Florida Business Intelligence User Group
Powering real-time analytics on Xfinity using Kudu
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Azure Machine Learning & ML Studio
Hadoop EcoSystem B.Ramamurthy.
07 | Analyzing Big Data with Excel
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Server & Tools Business
Slides borrowed from Adam Shook
Introduction to Apache
Overview of big data tools
Adam Lech Joseph Pontani Matthew Bollinger
Distributed Systems CS
CSE 491/891 Lecture 24 (Hive).
HDInsight & Power BI By Łukasz Gołębiewski.
Server & Tools Business
05 | Processing Big Data with Hive
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
06 | Automating Big Data Processing
Pig Hive HBase Zookeeper
Presentation transcript:

MSBIC Hadoop Series Querying Data with Hive Bryan Smith

MSBIC Hadoop Series Learn the basics of Hadoop through a combination of demonstration and lecture. Session participants are invited to follow along leveraging emulation environments and Azure-based clusters, the setting up of which we will address in our first session. March – Getting StartedAugust – Processing the Data with Pig April – Understanding the File SystemSeptember – Hadoop & MS BI May – Implementing MapReduce Jobs October – To Be Announced June – Querying the Data with Hive November – Loading Social Media Data July – On VacationDecember – DW Integration

Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

MapReduce Job The Hive Data Warehouse SELECT MyCol, COUNT(*) FROM MyTable GROUP BY MyCol;

Demonstration

Demo Script 1: Create Database show databases; create database ufo location ‘/ufo.db’; dfs –ls /;

Demo Script 2: Create & Load Table use ufo; create table sightings ( dateobs string, daterpt string, `location` string, shape string, duration string, `description` string) row format delimited fields terminated by '\t‘; load data inpath '/demo/ufo/in/ufo_awesome.tsv' overwrite into table sightings; dfs –ls /ufo.db; dfs –ls /ufo.db/sightings; dfs –ls /demo/ufo/in;

Demo Script 3: Query Table select * from sightings limit 10; selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape; create table SightingsSummary as selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape;

Managed vs. External Tables Managed Tables Table definition & associated data files managed by Hive Loaded data files moved to table- associated folders Dropping table drops data files Use for transformed data only needed by Hive External Tables Table definition only managed by Hive Loaded data files remain in original location Dropping table does not drop data files Use for initial staging or when data needs to be accessible across wide range of applications

File Formats Default input format is row delimited input & output Default format is tab-delimited input and Cntrl-A delimited field output File access controlled by LazySimpleSerDe (default SerDe) Default data types include… int, bigint, tinyint, smallint, float, double, boolean, string, binary, timestamp Complex structures supported with array, map & struct types

HCatalog Table & storage management layer for Hadoop Database defs, table defs, etc. presented through accessible interface Integrated with Hive but accessible via Hive, Pig & MapReduce Stored by default in Apache Derby database Other databases can be substituted for better performance, HA, etc.

A Few Key Points Object definitions are not case-sensitive… But string comparisons and HDFS references are Names conflicting with reserved keywords can be employed using the `grave accent`

Reserved Keywords AddCommentFloatLinesPartitionsString AllCreateFormatLoadReanmeTable AlterDataFromLocalReduceTables AndDateFullLocationRegexpTablesample ArrayDatetimeFunctionMapReplaceTblproperties AsDelimitedGroupMsckRightTblproperties AscDescInpathNotRlikeTemporary BigintDescribeInputformatNullRowTerminated BinaryDirectoryInsertOfSelectTextfile BooleanDistinctIntOnSequencefileTimestamp BucketDistributeIntoOrSerdeTinyint BucketsDoubleIsOrderSerdepropertiesTo ByDropItemsOutSetTransform CastExplainJoinOuterShowTrue ClusterExtendedKeysOutputformatSmallintUnion ClusteredExternalLeftOverwriteSortUsing CollectionFalseLikePartitionSortedWhere ColumnsFieldsLimitPartitionedStoredWith

Resources

Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

For Next Session Topic:  Processing Data with Pig Requested Action(s):  Come with working HDInsight Emulator  Load sample data sets into HDFS on Emulator