MSBIC Hadoop Series Querying Data with Hive Bryan Smith

Slides:

Advertisements

Similar presentations

Introduction to Apache HIVE

Advertisements

SQOOP HCatalog Integration

Shark Hive SQL on Spark Michael Armbrust.

Hive Index Yongqiang He Software Engineer Facebook Data Infrastructure Team.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,

Introduction to Hive Liyin Tang

Hive: A data warehouse on Hadoop

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.

Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Hive : A Petabyte Scale Data Warehouse Using Hadoop

Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

Hive Facebook 2009.

Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.

A NoSQL Database - Hive Dania Abed Rabbou.

Hive – SQL on top of Hadoop

Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith

Before the Session Verify HDInsight Emulator properly installed Verify Visual Studio and NuGet installed on emulator system Verify emulator system has.

INTELLIGENT DATA SOLUTIONS COM Intro to Data Factory PASS Cloud Virtual Chapter March 23, 2015 Steve Hughes, Architect.

MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith

Image taken from: slideshare

Big Data, Data Mining, Tools

HIVE A Warehousing Solution Over a MapReduce Framework

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Welcome to MSBIC! June 2014.

An Open Source Project Commonly Used for Processing Big Data Sets

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Easily retrieve data from the Baan database

Getting Started with Power Query

MSBIC Hadoop Series Processing Data with Pig

A Warehousing Solution Over a Map-Reduce Framework

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Hive Mr. Sriram

Central Florida Business Intelligence User Group

Powering real-time analytics on Xfinity using Kudu

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

Azure Machine Learning & ML Studio

Hadoop EcoSystem B.Ramamurthy.

07 | Analyzing Big Data with Excel

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Distributed Systems CS

Server & Tools Business

Slides borrowed from Adam Shook

Introduction to Apache

Overview of big data tools

Adam Lech Joseph Pontani Matthew Bollinger

Distributed Systems CS

CSE 491/891 Lecture 24 (Hive).

HDInsight & Power BI By Łukasz Gołębiewski.

Server & Tools Business

05 | Processing Big Data with Hive

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

06 | Automating Big Data Processing

Pig Hive HBase Zookeeper

Presentation transcript:

MSBIC Hadoop Series Querying Data with Hive Bryan Smith

MSBIC Hadoop Series Learn the basics of Hadoop through a combination of demonstration and lecture. Session participants are invited to follow along leveraging emulation environments and Azure-based clusters, the setting up of which we will address in our first session. March – Getting StartedAugust – Processing the Data with Pig April – Understanding the File SystemSeptember – Hadoop & MS BI May – Implementing MapReduce Jobs October – To Be Announced June – Querying the Data with Hive November – Loading Social Media Data July – On VacationDecember – DW Integration

Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

MapReduce Job The Hive Data Warehouse SELECT MyCol, COUNT(*) FROM MyTable GROUP BY MyCol;

Demonstration

Demo Script 1: Create Database show databases; create database ufo location ‘/ufo.db’; dfs –ls /;

Demo Script 2: Create & Load Table use ufo; create table sightings ( dateobs string, daterpt string, `location` string, shape string, duration string, `description` string) row format delimited fields terminated by '\t‘; load data inpath '/demo/ufo/in/ufo_awesome.tsv' overwrite into table sightings; dfs –ls /ufo.db; dfs –ls /ufo.db/sightings; dfs –ls /demo/ufo/in;

Demo Script 3: Query Table select * from sightings limit 10; selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape; create table SightingsSummary as selectsubstring(dateobs, 0, 4) as year, shape, count(*) from sightings group by year, shape;

Managed vs. External Tables Managed Tables Table definition & associated data files managed by Hive Loaded data files moved to table- associated folders Dropping table drops data files Use for transformed data only needed by Hive External Tables Table definition only managed by Hive Loaded data files remain in original location Dropping table does not drop data files Use for initial staging or when data needs to be accessible across wide range of applications

File Formats Default input format is row delimited input & output Default format is tab-delimited input and Cntrl-A delimited field output File access controlled by LazySimpleSerDe (default SerDe) Default data types include… int, bigint, tinyint, smallint, float, double, boolean, string, binary, timestamp Complex structures supported with array, map & struct types

HCatalog Table & storage management layer for Hadoop Database defs, table defs, etc. presented through accessible interface Integrated with Hive but accessible via Hive, Pig & MapReduce Stored by default in Apache Derby database Other databases can be substituted for better performance, HA, etc.

A Few Key Points Object definitions are not case-sensitive… But string comparisons and HDFS references are Names conflicting with reserved keywords can be employed using the `grave accent`

Reserved Keywords AddCommentFloatLinesPartitionsString AllCreateFormatLoadReanmeTable AlterDataFromLocalReduceTables AndDateFullLocationRegexpTablesample ArrayDatetimeFunctionMapReplaceTblproperties AsDelimitedGroupMsckRightTblproperties AscDescInpathNotRlikeTemporary BigintDescribeInputformatNullRowTerminated BinaryDirectoryInsertOfSelectTextfile BooleanDistinctIntOnSequencefileTimestamp BucketDistributeIntoOrSerdeTinyint BucketsDoubleIsOrderSerdepropertiesTo ByDropItemsOutSetTransform CastExplainJoinOuterShowTrue ClusterExtendedKeysOutputformatSmallintUnion ClusteredExternalLeftOverwriteSortUsing CollectionFalseLikePartitionSortedWhere ColumnsFieldsLimitPartitionedStoredWith

Resources

Today’s Session Objectives: 1.Understand the basics of Hive 2.Demonstrate use of Hive with sample data set

For Next Session Topic:  Processing Data with Pig Requested Action(s):  Come with working HDInsight Emulator  Load sample data sets into HDFS on Emulator