Presentation is loading. Please wait.

Presentation is loading. Please wait.

MSBIC Hadoop Series Querying Data with Hive Bryan Smith

Similar presentations


Presentation on theme: "MSBIC Hadoop Series Querying Data with Hive Bryan Smith"— Presentation transcript:

1 MSBIC Hadoop Series Querying Data with Hive Bryan Smith email: bryan.smith@microsoft.com twitter: @smithbryanc

2 MSBIC Hadoop Series http://msbic.sqlpass.org/ Learn the basics of Hadoop through a combination of demonstration and lecture. Session participants are invited to follow along leveraging emulation environments and Azure-based clusters, the setting up of which we will address in our first session. March – Getting StartedAugust – Processing the Data with Pig April – Understanding the File SystemSeptember – Hadoop & MS BI May – Implementing MapReduce Jobs October – To Be Announced June – Querying the Data with Hive November – Loading Social Media Data July – On VacationDecember – DW Integration

3 Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

4 MapReduce Job The Hive Data Warehouse SELECT MyCol, COUNT(*) FROM MyTable GROUP BY MyCol;

5 Demonstration

6 Demo Script 1: Create Database show databases; create database ufo location ‘/ufo.db’; dfs –ls /;

7 Demo Script 2: Create & Load Table use ufo; create table sightings ( dateobs string, daterpt string, `location` string, shape string, duration string, `description` string) row format delimited fields terminated by '\t‘; load data inpath '/demo/ufo/in/ufo_awesome.tsv' overwrite into table sightings; dfs –ls /ufo.db; dfs –ls /ufo.db/sightings; dfs –ls /demo/ufo/in;

8 Demo Script 3: Query Table select * from sightings limit 10; selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape; create table SightingsSummary as selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape;

9 Managed vs. External Tables Managed Tables Table definition & associated data files managed by Hive Loaded data files moved to table- associated folders Dropping table drops data files Use for transformed data only needed by Hive External Tables Table definition only managed by Hive Loaded data files remain in original location Dropping table does not drop data files Use for initial staging or when data needs to be accessible across wide range of applications

10 File Formats Default input format is row delimited input & output Default format is tab-delimited input and Cntrl-A delimited field output File access controlled by LazySimpleSerDe (default SerDe) Default data types include… int, bigint, tinyint, smallint, float, double, boolean, string, binary, timestamp Complex structures supported with array, map & struct types

11 HCatalog Table & storage management layer for Hadoop Database defs, table defs, etc. presented through accessible interface Integrated with Hive but accessible via Hive, Pig & MapReduce Stored by default in Apache Derby database Other databases can be substituted for better performance, HA, etc.

12 A Few Key Points Object definitions are not case-sensitive… But string comparisons and HDFS references are Names conflicting with reserved keywords can be employed using the `grave accent`

13 Reserved Keywords AddCommentFloatLinesPartitionsString AllCreateFormatLoadReanmeTable AlterDataFromLocalReduceTables AndDateFullLocationRegexpTablesample ArrayDatetimeFunctionMapReplaceTblproperties AsDelimitedGroupMsckRightTblproperties AscDescInpathNotRlikeTemporary BigintDescribeInputformatNullRowTerminated BinaryDirectoryInsertOfSelectTextfile BooleanDistinctIntOnSequencefileTimestamp BucketDistributeIntoOrSerdeTinyint BucketsDoubleIsOrderSerdepropertiesTo ByDropItemsOutSetTransform CastExplainJoinOuterShowTrue ClusterExtendedKeysOutputformatSmallintUnion ClusteredExternalLeftOverwriteSortUsing CollectionFalseLikePartitionSortedWhere ColumnsFieldsLimitPartitionedStoredWith

14 Resources

15 Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

16 For Next Session Topic:  Processing Data with Pig Requested Action(s):  Come with working HDInsight Emulator  Load sample data sets into HDFS on Emulator


Download ppt "MSBIC Hadoop Series Querying Data with Hive Bryan Smith"

Similar presentations


Ads by Google