MSBIC Hadoop Series Querying Data with Hive Bryan Smith

MSBIC Hadoop Series Querying Data with Hive Bryan Smith email: bryan.smith@microsoft.com twitter: @smithbryanc

MSBIC Hadoop Series http://msbic.sqlpass.org/ Learn the basics of Hadoop through a combination of demonstration and lecture. Session participants are invited to follow along leveraging emulation environments and Azure-based clusters, the setting up of which we will address in our first session. March – Getting StartedAugust – Processing the Data with Pig April – Understanding the File SystemSeptember – Hadoop & MS BI May – Implementing MapReduce Jobs October – To Be Announced June – Querying the Data with Hive November – Loading Social Media Data July – On VacationDecember – DW Integration

Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

MapReduce Job The Hive Data Warehouse SELECT MyCol, COUNT(*) FROM MyTable GROUP BY MyCol;

Demonstration

Demo Script 1: Create Database show databases; create database ufo location ‘/ufo.db’; dfs –ls /;

Demo Script 2: Create & Load Table use ufo; create table sightings ( dateobs string, daterpt string, `location` string, shape string, duration string, `description` string) row format delimited fields terminated by '\t‘; load data inpath '/demo/ufo/in/ufo_awesome.tsv' overwrite into table sightings; dfs –ls /ufo.db; dfs –ls /ufo.db/sightings; dfs –ls /demo/ufo/in;

Demo Script 3: Query Table select * from sightings limit 10; selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape; create table SightingsSummary as selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape;

Managed vs. External Tables Managed Tables Table definition & associated data files managed by Hive Loaded data files moved to table- associated folders Dropping table drops data files Use for transformed data only needed by Hive External Tables Table definition only managed by Hive Loaded data files remain in original location Dropping table does not drop data files Use for initial staging or when data needs to be accessible across wide range of applications

File Formats Default input format is row delimited input & output Default format is tab-delimited input and Cntrl-A delimited field output File access controlled by LazySimpleSerDe (default SerDe) Default data types include… int, bigint, tinyint, smallint, float, double, boolean, string, binary, timestamp Complex structures supported with array, map & struct types

HCatalog Table & storage management layer for Hadoop Database defs, table defs, etc. presented through accessible interface Integrated with Hive but accessible via Hive, Pig & MapReduce Stored by default in Apache Derby database Other databases can be substituted for better performance, HA, etc.

A Few Key Points Object definitions are not case-sensitive… But string comparisons and HDFS references are Names conflicting with reserved keywords can be employed using the `grave accent`

Reserved Keywords AddCommentFloatLinesPartitionsString AllCreateFormatLoadReanmeTable AlterDataFromLocalReduceTables AndDateFullLocationRegexpTablesample ArrayDatetimeFunctionMapReplaceTblproperties AsDelimitedGroupMsckRightTblproperties AscDescInpathNotRlikeTemporary BigintDescribeInputformatNullRowTerminated BinaryDirectoryInsertOfSelectTextfile BooleanDistinctIntOnSequencefileTimestamp BucketDistributeIntoOrSerdeTinyint BucketsDoubleIsOrderSerdepropertiesTo ByDropItemsOutSetTransform CastExplainJoinOuterShowTrue ClusterExtendedKeysOutputformatSmallintUnion ClusteredExternalLeftOverwriteSortUsing CollectionFalseLikePartitionSortedWhere ColumnsFieldsLimitPartitionedStoredWith

Resources

Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

For Next Session Topic:  Processing Data with Pig Requested Action(s):  Come with working HDInsight Emulator  Load sample data sets into HDFS on Emulator

MSBIC Hadoop Series Querying Data with Hive Bryan Smith

Similar presentations

Presentation on theme: "MSBIC Hadoop Series Querying Data with Hive Bryan Smith"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MSBIC Hadoop Series Querying Data with Hive Bryan Smith

Similar presentations

Presentation on theme: "MSBIC Hadoop Series Querying Data with Hive Bryan Smith"— Presentation transcript:

Similar presentations

About project

Feedback