SQLCAT: Big Data – All Abuzz About Hive

SQLCAT: Big Data – All Abuzz About Hive
Dipti Sangani SQL Big Data PM Microsoft Cindy Gross SQLCAT BI/Big Data PM Microsoft @SQLCindy Ed Katibah SQLCAT Spatial PM Microsoft @Spatial_Ed Phones silenced Thank remote audience (video recording) Thank everyone Ask audience to use mic for questions Eval via website/guidebook Who knows something about Big Data / Hadoop / Hive?

Hive Hadoop Big Data Analytics to Insights Big Agenda
There are other talks that will go into Big Data and Hadoop so we’ll only do a quick overview of that right now. We’ll spend most of our time on Hive.

A NEW SET OF QUESTIONS LIVE DATA FEEDS SOCIAL & Web ANALYTICS
What’s the social sentiment for my brand or products How do I better predict future outcomes? SOCIAL & Web ANALYTICS Advanced ANALYTICS How do I optimize my fleet based on weather and traffic patterns? Unknown unknowns – the question may be asked for the first time, you’re exploring: Avinash Kaushik at Strata 2012. Is there a correlation between the brightness of the north star last night and sales of pet fish? Does the amount of gas left in my tank impact what I buy at the grocery store? Company impact/live data feeds/advanced analytics Gartner asserts that “By 2015 businesses that build a modern information management system will outperform their peers financially by 20 percent.” ================= Today new types of questions are being asked to drive the business. These questions include: Questions on Social & Web Analytics e.g. What is my brand and product sentiment? How effective is my online campaign? Who am I reaching? How can I optimize or target the correct audience? Questions that require connecting to live data feeds e.g. a large shipping company uses live weather feeds and traffic patterns to fine tune its ship and truck routes leading to improved delivery times and cost savings. Retailers analyze sales, pricing and economic, demographic and live weather data to tailor product selections at particular stores and determine the timing of price markdowns. Questions that require advanced analytics e.g. Financial firms using machine learning to build better fraud detection algorithms that go beyond the simple business rules involving charge frequency and location to also include an individual’s customized buying patterns ultimately leading to a better customer experience. Organizations that are able to take advantage of Big Data to ask and answer these new types of questions will be able to more effectively differentiate and derive new value for the business whether it is in the form of revenue growth, cost savings or creating entirely new business models. Gartner asserts that “By 2015 businesses that build a modern information management system will outperform their peers financially by 20 percent.” McKinsey agrees, confirming that organizations that use data and business analytics to drive decision making are more productive and deliver higher return on equity than those who don’t.

NEW OPPORTUNITIES GE Revenue Growth Massive Volumes
Increases ad revenue by processing 3.5 billion events per day Massive Volumes Processes 464 billion rows per quarter, with average query time under 10 secs. Businesses Innovation Measures and ranks online user influence by processing 3 billion signals per day Cloud Connectivity Connects across 15 social networks via the cloud for data and API access Operational Efficiencies Uses sentiment analysis and web analytics for its internal cloud GE Real-Time Insight Improves operational decision making for IT managers and users Some examples of organizations that delivering new value based in the form of revenue growth, cost savings or creating entirely new business models. Yahoo - AS with Hive, Klout - AS with Hive (white paper), GE - Hive Analytics Yahoo! (Gartner BI Excellence Award Winner) is driving growth for existing revenue streams: Yahoo! manages a powerful, scalable advertising exchange that includes publishers and advertisers. Advertisers want to get the most out of their investment by reaching their targeted audiences effectively and efficiently. Yahoo! needs visibility into how consumers are responding to ads along many dimensions (websites, creative, time of day, gender, age, location) to make the exchange work as efficiently and effectively as possible. Yahoo! doubled its revenue by allowing campaign managers to “tune” campaign targeting and creative. Yahoo! drove an increase in spending from advertisers since they got better performance by advertising through Yahoo!. Yahoo! TAO exposed customer segment performance to campaign managers and advertisers for the first time. Klout is creating new businesses and revenue streams: Klout’s mission is to help everyone understand and leverage their influence. Klout uses Big Data to unify the social web (consumers, brands, and partners) with social networking and activity, along with data to generate a Klout score and enable analysis, targeting, and social graphs. Helps consumers manage their “social brand.” Helps brands reach influencers at scale. Helps data partners enhance their services (customer loyalty, CRM, media and identity, and marketing). For example, the Palms uses Klout scores in addition to their normal customer rewards program to determine whether or not to upgrade their customers to a better room during their stay. The Huffington Post uses Klout to help serve the best curated Twitter content. Klout Case Study: Enterprise/Klout/Data-Services-Firm-Uses-Microsoft-BI-and-Hadoop-to-Boost- Insight-into-Big-Data/ Case Study on Thailand’s Department of Special Investigations : GE is driving operational efficiencies: GE is running several use cases on its Hadoop cluster while incorporating several different disparate sources to produce results. Along with sentiment analysis, GE is running web analytics on its internal cloud structure and looking at load usage, user analytics, and failure mode analytics. GE built a recommendation engine for its intranet involving various press releases users might be interested in based on their function, user profiles, and prior visits to its site. GE is working with several types of remote monitoring and diagnostic data from energy and wind businesses.

MANAGE any data, any size, anywhere
Unified Monitoring, Management & Security Relational Non-Relational Streaming Complete set of capabilities, modern platform, all types of data, insights, monitor/manage/scale/secure/HA, working together Data management needs have evolved from traditional relational storage to both relational and non-relational storage and a modern information management platform needs to support all types of data. To deliver insight on any data, you need a platform that provides a complete set of capabilities for data management across relational, non-relational and streaming data while being able to seamlessly move data from one type to another and being able to monitor and manage all your data regardless of the type of data or data structure it is. All without the application having to worry about scale, performance, security and availability. In addition to supporting all types of data, moving data to and from a non-relational store such as Hadoop and a relational Data Warehouse is one of the key Big Data customer usage patterns. To support this common usage pattern, we provide connectors for high speed data movement between data stored in Hadoop and existing SQL Server Data Warehousing environments including SQL Server Parallel Data Warehouse. There is a lot of debate in the market today on relational vs non-relational technologies. Asking the question should I use relational or non-relational technologies for my application requirements is asking the wrong questions. Both are storage mechanisms designed to meet very different needs. Relational stores are good for structured data where the schema is known which makes programming against a relational store require an understanding of declarative query languages like SQL, in return you get a store with high consistency and transaction isolation. In contrast, non-relational stores are good for unstructured data where schema does not exist and querying is more programmatic, in return you get greater scalability and tradeoff the ability to execute transactions. As the requirements for both these types of stores evolve, the key point to remember is that a modern data platform must support both types of data equally well, provide unified monitoring and management of data across both and be able to easily move and transform data across all types of stores. As an example, Yahoo evolved from its traditional EDW only strategy to include EDW, Hadoop and OLAP (Online Analytical Processing) Data Movement

VVVVroom! Volume – beyond what environment can handle
Velocity – Need decisions fast Variety – Many formats Big data is often described as problems that have one or more of the 3 (or 4) Vs – volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, unused data. Volume = more data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to use Velocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrives Variety = many different formats that are expensive to integrate, probably from many data sources/feeds Variability = many possible interpretations of the data Variability – Multiple interpretations

NoSQL SQL Better Together Structured Un/Multi/Semi-Structured
Schema on Write Modifications expected ACID Scale Up Maturity SQL Un/Multi/Semi-Structured Schema on Read Write once, read many BASE Scale Out Incubation NoSQL Hadoop is part of NOSQL (Not Only SQL) and it’s a bit wild. You explore in/with Hadoop. You learn new things. You test hypotheses on unstructured jungle data. You eliminate noise. Then you take the best learnings and share them with the world via a relational or multidimensional database. Atomicity, consistency, isolation, durability (ACID) is used in relational databases to ensure immediate consistency. But what if eventual consistency is good enough? In stomps BASE – Basically available, soft state, eventual consistency Scale up or scale out? Pay up front or pay as you go? Which IT skills do you utilize?

Big Data MapReduce, Streaming, Machine Learning, Massively Parallel Processing Scale Out for Pay As You Go Schema on Read Not Write BASE Not ACID Append only, bulk data operations Too Big, Complex, or Expensive for Current Environment

BIG DATA REQUIRES AN END-TO-END APPROACH
INSIGHT Self-Service Collaboration Corporate Apps Devices DATA ENRICHMENT Discover Combine Refine DATA MANAGEMENT Relational Non-relational Analytical Streaming

Why Use Big Data – Use Cases
Financial Services Risk Modeling Threat Analysis Fraud Detection Credit Scoring IT Management SLA Monitoring Cyber Security Forensic Analysis Telemetry Management Clickstream and Application Log Analysis Sensor Data Online Commerce Sentiment Analysis Recommendation Engines Search Indexing / Quality

Hadoop architecture. Metadata (HCatalog) (Excel, PowerView…)
Active Directory (Security) Pipeline / workflow (Oozie) Metadata (HCatalog) Graph (Pegasus) Stats processing (RHadoop) (Excel, PowerView…) Business Intelligence NoSQL Database (HBase) Scripting (Pig) Query (Hive) Machine Learning (Mahout) ( ODBC / SQOOP/ REST) Data Integration System Center Log file aggregation (Flume) Distributed Processing (Map Reduce) Biggest buzzword in Big Data right now is Hadoop It can mean many things, but always includes HDFS and MapReduce Distributed Storage (HDFS)

Hive Architecture Data warehousing framework on Hadoop
HiveQL Metastore ODBC JDBC Hive Web Interface (HWI) Command Line Interface (CLI) Thrift Server Compiler, Optimizer, Executor Hadoop Head Node Name Node Thrift server has client and server components for translation between ODBC and Hadoop MR. It provides and interface definition language for RPC Hive/HiveQL is simple, easy for SQL pros to learn Hive design principles Scalable, extensible (via UDF, UDAF), fault tolerant, and loose coupling with file formats. What Hive is not Low latency response times on queries. Data warehousing framework on Hadoop Imposes metadata / familiar looking HiveQL Simple translation layer for MapReduce Extensible via custom mappers/reducers Loose coupling with input formats Enables analytics from high level BI tools via ODBC Data Nodes / Task Nodes

Hive Flow Create Metadata HiveQL Load Data MapReduce
Insight Advanced Analytics You can create the metadata first OR load the data first (it usually has some general known shape), usually very iterative, data flows around, use existing analytics tools with ODBC/Hive. This is one of many ways Hive may fit into your enterprise solution. Load to SQL or AS

DEMO: Analyzing a Frankenstorm
Why we would choose to put data in hive - new insights, machine learning, statistical analysis, feed to other data sources including public data, multiple ways to view the data, metadata/odbc/tables, large volume over time (don't have to archive). Census data is variable across time/geography, could be large volumes, used with many data sources. Census data has a huge market for analyzing/manipulating/enriching data between 10 year gatherings. Can quickly add new data sets (CDC, housing, jobs, pets, infrastructure such as levies, elevation) to Hive for additional mashups/analysis. Biz proposition for this demo - first responders have more info, biz can sell data as a service, news orgs, etc. All sorts of "what if" analysis. Show our BI stack on top then drill down into specific Hive steps. Spatial portion of demo Look for remaining steps soon on

Behind the Scenes

Get HDInsight Sign up for Windows Azure HDInsight Service
(Cloud CTP) Download Microsoft HDInsight Server (On-Prem CTP) Yes, you can install it right now on your Windows7+ laptop – you’ll get a single node install.

Create Table CREATE EXTERNAL TABLE censusP (State_FIPS int, County_FIPS int, Population bigint, Pop_Age_Over_69 bigint, Total_Households bigint, Median_Household_Income bigint, KeyID string) COMMENT 'US Census Data' PARTITIONED BY (Year string) ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE; ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010'; If you have 10 delimited values and only specify three columns (for example) then the other 7 “columns” are ignored Hadoop Cmd Prompt: bcp dbo.ACS2010 out c:\data\dbo_ACS2010.dat -b d NOAA -T -c -S CGROSSBOISE\SQL2012 hadoop fs -put C:\data\dbo_ACS2010.dat /user/demo/census/census.dat hadoop fs -put C:\data\dbo_ACS2010.dat /user/demo/censusp/2010/census.dat hadoop fs -lsr /user/demo/ Hive: CREATE EXTERNAL TABLE censusP (State_FIPS int, County_FIPS int, Population bigint, Pop_Age_Over_69 bigint, Total_Households bigint, Median_Household_Income bigint, KeyID string) COMMENT 'US Census Data' PARTITIONED BY (Year string) ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE; ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010'; CREATE EXTERNAL TABLE Census(State_FIPS int,County_FIPS int,Population bigint,Pop_Age_Over_69 bigint,Total_Households bigint,Median_Household_Income bigint,KeyID string) COMMENT 'US Census Data 2010' ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE LOCATION '/user/demo/census'; DESCRIBE census; DESCRIBE EXTENDED census; DESCRIBE FORMATTED census; EXIT; Hive Cmd Prompt: hive -e "select * from census limit 10;" -S

Inside a Hive Table DATA TYPES EXTERNAL / INTERNAL PARTITIONED BY | CLUSTERED BY | SKEWED BY Terminators ROW FORMAT DELIMITED | SERDE STORED AS FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BY LOCATION Other valid syntax: IF NOT EXISTS, COMMENT (column and table), TBLPROPERTIES, LINES TERMINATED BY (only valid value is /n) If you have a file with a header row (such as column names)– it easiest to remove it from the source file, otherwise create a custom serde You can have multiple tables pointing to one data set You can rename a table or a column, you can change the data type of a column (simple metadata change – data isn’t touched)

MetaData Metadata View is stored in a MetaStore database such as Derby
SQL Azure SQL Server View SHOW TABLES 'ce.*'; DESCRIBE census; DESCRIBE census.population; DESCRIBE EXTENDED census; DESCRIBE FORMATTED census; SHOW FUNCTIONS "x.*"; SHOW FORMATTED INDEXES ON census;

Data Types Primitives Collections Properties
Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float> Properties No fixed lengths NULL handling depends on SerDe Timestamp is very new (Hive 0.8.0), UTC, seconds since 1970 OR "YYYY-MM-DD HH:MM:SS.fffffffff" Binary is new in Hive 0.8.0, not a blob, arbitrary unparsed bytes Structs accessed via dot notation .City Map is key/value pairs accessed via name Array is all the same type referenced by zero based order name[1] CREATE TABLE union_test(foo UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>);

Storage – External and Internal
CREATE EXTERNAL TABLE census(…) LOCATION '/user/demo/census'; LOCATION ‘hdfs:///user/demo/census'; LOCATION ‘asv://user/demo/census'; Use EXTERNAL when Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV Hive should not own data and control settings, directories, etc. Not creating table based on existing table (AS SELECT) And ASV = Azure Storage Vault (blob store) INTERNAL is NOT a keyword, just leave off EXTERNAL Use external table when data also needs to be accessed outside of Hive such as when you are programmatically generating data outside of hive. Hive does not "own" the data. Internal tables are fully managed by Hive, including automatic data deletion and automatic partition creation. Hive controls the data lifecycle. Internal tables = managed tables Can store data on the Azure blob store with ASV abstraction. ROW FORMAT DELIMITED must appear before other clauses (except STORED AS) STORED AS values: TEXTFILE (easy, shareable), SEQUENCEFILE (compressed/faster), RCFILE (), INPUTFORMAT/OUTPUTFORMAT (custom) Location: perf is better with fewer, larger files. Each file should be much bigger than the block size (often 64MB or 128MB). More files = more memory needed by head node. Hive data load is about BULK load, not individual row manipulation

Storage – Partition and Bucket
CREATE EXTERNAL TABLE census (…) PARTIONED BY (Year string) CLUSTERED BY (population) into 256 BUCKETS Partition Directory for each distinct combination of string partition values Partition key name cannot be defined in table itself Allows partition elimination Useful in range searches Can slow performance if partition is not referenced in query Buckets Split data based on hash of a column One HDFS file per bucket within partition sub-directory Performance may improve for aggregates and join queries Sampling Supported file types (on-disk file formats) Clustered by : generally helps with perf by reducing network i/o for some aggregate type queries as same hash bucket blocks are co-located. Skewed by: helps with elimination if data is skewed for some values (eg: most people in US and some in other countries in EUROPE will help with filters like country = US or country = Europe by skipping entire files. Separate directory created for each distinct value combo in partition columns. Within a table or partition you can bucket with CLUSTERED BY and sort with SORT BY. You cannot include the partition key in the table itself (trick = give column a different name). You can add partitions only if key is a string. Dynamic partitions in Hive 0.6 and later if set hive.exec.dynamic.partition=true; Partition keys are positional based on last column(s) SHOW PARTITIONS census; Partition columns appear like other columns in the table definition Internal table: Hive creates the directories when you issue the LOAD statement LOAD DATA LOCAL INPATH {path} INTO TABLE {table} PARTITION {key/value} LOCAL is optional and indicates a copy from a non-distributed file system vs. a move from a distributed file system (still has to be the local cluster) INPATH says where to get the data (~FROM) 1 LOAD statement per partition, though they can be combined Don't use ALTER ADD PARTITION for internal partitions (gives you a mix of managed/external data) External table: User defined directories and loads data CREATE TABLE does not specify LOCATION 1 ALTER TABLE ADD PARTITION per partition key/value, this is where location is specified - be logical and include key name and key value in dir name No LOAD statement, data is put into directories via other methods (see first bullet) If a directory doesn't exist or is empty there is no error, just no results Favor fewer, balanced partitions in most cases

Storage – File formats and Serdes
CREATE EXTERNAL TABLE census (…) ROW FORMAT DELIMITED FIELDS TERMINATED by ‘\001‘ STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO Format TEXTFILE is common, useful when data is shared and all alphanumeric Extensible storage formats via custom input, output formats Extensible on disk/in-memory representation via custom SerDes Supported file types (on-disk file formats) Textfile, sequence file (key value pairs) , RCfile (hybrid row groups and column store format) , avro (schema present with the data) Textfile is all (international) alphanumeric plus delimiters. A line = a record. Flexible/extensible and not limited to specific file formats Ability to add new file formats when formats are different or for more efficient on-disk format Documentation on how to add new serdes and file formats is outside the scope of this talk Lazy SerDe doesn’t materialize until needed SerDe is serializer / deserializer, a custom way of interpreting the data: ROW FORMAT SERDE ‘org.apache….’ Common delimiters \t = tab \001 = control-A (default column separator) \002 = control-B (default ARRAY or STRUCT separator) \003 = control-C (default MAP separator) STORED AS INPUTFORMAT ‘org.apache…’ OUTPUTFORMAT ‘org.apache…’

CREATE INDEX CREATE INDEX census_population
ON TABLE census (population) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IN TABLE census_population_index; ALTER INDEX census_population ON census REBUILD; Key Points No keys Index data is another table Requires REBUILD to include new data SHOW FORMATTED INDEXES on MyTable; Indexing May Help Avoid many small partitions GROUP BY can index tables (including partitioning to a different granularity or a subset of partitions) or views very new, few options, extendable EXPLAIN can help determine if an index helps a query Drop table drops associated index(es) Cost = disks space and processing, benefits only some queries Deferred rebuild - data is not populated on creation Index data is not updated as files are added or removed

Create View Sample Code Key Points
CREATE VIEW censusBigPop (state_fips, county_fips, population) AS SELECT state_fips, county_fips, population FROM census WHERE population > ORDER BY population; Sample Code SELECT * FROM censusBigPop; DESCRIBE FORMATTED censusBigPop; Key Points Not materialized Can have ORDER BY or LIMIT

Query Key Points Performance
SELECT c.state_fips, c.county_fips, c.population FROM census c WHERE c.median_household_income > GROUP BY c.state_fips, c.county_fips, c.population ORDER BY county_fips LIMIT 100; Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering Some shortcuts available to avoid spinning up MR infrastructure when data can be directly streamed. For example: select * or partition elimination. Examples of common syntax supported by Hive (not a complete list): UNION ROUND() FLOOR() RAND() COUNT SUM | AVG | MIN | MAX | VARIANCE UPPER | LOWER LIKE RLIKE (Regular Expressions) CASE WHEN THEN

Sorting ORDER BY SORT BY DISTRIBUTE BY
One reducer does final sort, can be a big bottleneck SORT BY Sorted only within each reducer, much faster DISTRIBUTE BY Determines how map data is distributed to reducers SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution

Joins Supported Hive Join Types Not Supported Equality
OUTER - LEFT, RIGHT, FULL LEFT SEMI Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN) Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive. you can rewrite IN/EXISTS queries using LEFT SEMI JOIN.

Joins Characteristics
Multiple MapReduce jobs unless same join columns in all tables Put largest table last in query to save memory Joins are done left to right in query order JOIN ON completely evaluated before WHERE starts In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence.

EXPLAIN Characteristics EXPLAIN SELECT * FROM census;
EXPLAIN SELECT * FROM census WHERE population > ; EXPLAIN EXTENDED SELECT * FROM census; Characteristics Does not execute the query Shows parsing Lists stages, temp files, dependencies, modes, output operators, etc. ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 EXPLAIN SELECT * FROM census ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 EXPLAIN SELECT * FROM census WHERE population > ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (> (TOK_TABLE_OR_COL population) )))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: census TableScan alias: census Filter Operator predicate: expr: (population > ) type: boolean Select Operator expressions: expr: state_fips type: int expr: county_fips type: int expr: population type: bigint expr: pop_age_over_69 type: bigint expr: total_households type: bigint expr: median_household_income type: bigint expr: keyid type: string outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: EXPLAIN EXTENDED SELECT * FROM census ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1

Configure Hive Configuration
Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log

Why Use Hive BUZZ! Cross-pollinate your existing SQL skills!
Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing Individual queries still often slower than a relational database E2E insight may be much faster

BI on Big Data Gain Insights
Mash-up Hive + other data in Excel Hive data source to PowerPivot for in-memory analytics Power View on top of PowerPivot for spectacular visualizations leading to insights Securely share on SharePoint for collaboration, re-use, centralized data Microsoft on top of Hadoop / Hive includes PowerPivot Power View Analysis Services PDW StreamInsight SQL Server SQL Azure Excel CTP of HDInsight Services (Azure) and HDInsight Server (OnPrem) Currently we have either components, frameworks, or examples for: Hadoop, Hive, Hive ODBC Driver, Sqoop, Javascript, Mahout, Pegasus, Pig, C#, ASV, Azure, .NET, HCatalog

Big Deal Hive Hadoop Big Data Analytics to Insights

Next Steps Get Involved Read a bit
Programming Hive Book Sign up: Windows Azure HDInsight Service (Cloud CTP) Download Microsoft HDInsight Server (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses

Big Data References Hadoop: The Definitive Guide by Tom White
SQL Server Sqoop JavaScript Twitter Hive Excel to Hadoop via Hive ODBC Hadoop On Azure Videos Klout Microsoft Big Data Denny Lee Carl Nolan Cindy Gross

Microsoft Big Data at PASS Summit
Manage BIA-305-A SQLCAT: Big Data – All Abuzz About Hive Wednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset” Wednesday 1015am | Darwin Schweitzer AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez Enrich BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BI Thursday 130pm | Denny Lee, Dave Mariani AD-316-M Harnessing Big Data with Hadoop Friday 8am | Mike Flasko Insight DBA-410-S Big Data Meets SQL Server Friday 945am | David DeWitt AD-315-M NoSQL and Big Data Programmability Friday 415p | Michael Rys

Don’t Miss! Win prizes with new online evaluations
Build experience with Hands On Labs NEW: TCC 304 Attend David DeWitt’s spotlight session Big Data Meets SQL Server DBA-410-S, Room 6E Friday, 9:45 AM Be SQL Server 2012 Certified with onsite testing Room Find hidden session announcements by following: @sqlserver #sqlpass Visit the SQL Clinic and new “I MADE THAT!” Developer Chalk talks NEW: 4C-3 & 4C-4

PASS Resources Free SQL Server and BI training
Free 1-day Training Events Regional Event Local and Virtual User Groups Free Online Technical Training This is Community Learning Center

for attending this session and the 2012 PASS Summit in Seattle
Thank you for attending this session and the 2012 PASS Summit in Seattle

SQLCAT: Big Data – All Abuzz About Hive
Please fill out evaluations! SQLCAT: Big Data – All Abuzz About Hive Cindy Gross SQLCAT BI/Big Data PM Microsoft @SQLCindy Dipti Sangani SQL Big Data PM Microsoft Ed Katibah SQLCAT Spatial PM Microsoft @Spatial_Ed Phones silenced Thank remote audience (video recording) Ask audience to use mic for questions Eval via website/guidebook

SQLCAT: Big Data – All Abuzz About Hive

Similar presentations

Presentation on theme: "SQLCAT: Big Data – All Abuzz About Hive"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SQLCAT: Big Data – All Abuzz About Hive

Similar presentations

Presentation on theme: "SQLCAT: Big Data – All Abuzz About Hive"— Presentation transcript:

Similar presentations

About project

Feedback