© Hortonworks Inc. 2013 Stinger Initiative: Deep Dive Interactive Query on Hadoop Page 1 Chris Harris Twitter : cj_harris5.

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
Technical BI Project Lifecycle
Spark: Cluster Computing with Working Sets
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Clydesdale: Structured Data Processing on MapReduce Jackie.
Hive: A data warehouse on Hadoop
Putting the Sting in Hive Page 1 Alan F.
Mihai Pintea. 2 Agenda Hadoop and MongoDB DataDirect driver What is Big Data.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
1.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
An Introduction to HDInsight June 27 th,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
© Hortonworks Inc Putting the Sting in Hive Alan Gates
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matthew Winter and Ned Shawa
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
1 Database Systems, 8 th Edition 1 Chapter 13 Business Intelligence and Data Warehouses Objectives In this chapter, you will learn: –How business intelligence.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Please note that the session topic has changed
Big Data Yuan Xue CS 292 Special topics on.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Image taken from: slideshare
HIVE A Warehousing Solution Over a MapReduce Framework
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
SQOOP.
Hive on steroid Project stinger.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Simplied Data Processing on Large Clusters
SQL 2014 In-Memory OLTP What, Why, and How
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Overview of big data tools
Big-Data Analytics with Azure HDInsight
Presentation transcript:

© Hortonworks Inc Stinger Initiative: Deep Dive Interactive Query on Hadoop Page 1 Chris Harris Twitter : cj_harris5

© Hortonworks Inc Agenda Key Hive Use Cases Brief Refresher on Hive The Stinger Initiative: Interactive Query for Hive Page 2

© Hortonworks Inc Key Hive Use Cases RDBMS / MPP Offload –More data under query. –Database unable to keep up with SLAs. Analysis of semi-structured data. ETL / Data Refinement +++ Increasingly: Business Intelligence and interactive query Page 3

© Hortonworks Inc BI Use Cases Page 4 Enterprise Reports Dashboard / Scorecard Parameterized Reports Visualization Data Mining

© Hortonworks Inc Organize Tiers and Process with Metadata Page 5 Work Tier Work Tier Standardize, Cleanse, Transform MapReduce Pig Raw Tier Raw Tier Extract & Load WebHDF S Flume Sqoop Gold Tier Gold Tier Transform, Integrate, Storage MapReduce Pig Conform, Summarize, Access HiveQL Pig Access Tier HCat Provides unified metadata access to Pig, Hive & MapReduce Organize data based on source/derived relationships Allows for fault and rebuild process

© Hortonworks Inc Hive Current Focus Area Page 6 Online systems R-T analytics CEP Real-TimeInteractiveBatch Parameterized Reports Drilldown Visualization Exploration Operational batch processing Enterprise Reports Data Mining Data Size 0-5s5s – 1m 1m – 1h 1h+ Non- Interactive Data preparation Incremental batch processing Dashboards / Scorecards

© Hortonworks Inc Stinger: Extending Hive’s Sweetspot Page 7 Online systems R-T analytics CEP Real-TimeInteractiveBatch Parameterized Reports Drilldown Visualization Exploration Operational batch processing Enterprise Reports Data Mining Data Size 0-5s5s – 1m 1m – 1h 1h+ Non- Interactive Data preparation Incremental batch processing Dashboards / Scorecards Improve Latency & Throughput Query engine improvements New “Optimized RCFile” column store Next-gen runtime (elim’s M/R latency) Extend Deep Analytical Ability Analytics functions Improved SQL coverage Continued focus on core Hive use cases

© Hortonworks Inc The top BI vendors support Hive today Page 8

© Hortonworks Inc Agenda Key Hive Use Cases Brief Refresher on Hive The Stinger Initiative: Interactive Query for Hive Page 9

© Hortonworks Inc Brief Refresher on Hive The State of Hive Today (0.10) Page 10

© Hortonworks Inc Hive’s Origins Page 11 Hive was originally developed at Facebook. More data than existing RDBMS could handle. 60,000+ Hive queries per day. More than 1,000 users per day PB of data. 15+ TB of data loaded daily. Hive is a proven solution at extreme scale.

© Hortonworks Inc Hive 0.10 Capabilities De-facto SQL Interface for Hadoop Multiple persistence options: –Flat text for simple data imports. –Columnar format (RCFile) for high performance processing. Secure and concurrent remote access ODBC/JDBC connectivity Highly extensible: –Supports User Defined Functions and User Defined Aggregation Functions. –Ships with more than 150 UDF/UDAF. –Extensible readers/writers can process any persisted data. Support from 10+ BI vendors Page 12

© Hortonworks Inc HDP 1.2: ODBC Access for Popular BI Tools Page 13 Seamless integration with BI tools such as Excel, PowerPivot, MicroStrategy, and Tableau Efficiently maps advanced SQL functionality into HiveQL –With configurable pass-through of HiveQL for Hive-aware apps ODBC 3.52 standard compliant Supports Linux & Windows High quality ODBC driver developed in partnership with Simba. Free to download & use with Hortonworks Data Platform. Applications & Spreadsheets Visualization & Intelligence ODBC Hortonworks Data Platform

© Hortonworks Inc to Big Data in 15 Minutes Page 14 Hands on tutorials integrated into Sandbox HDP environment for evaluation

© Hortonworks Inc Agenda Brief Refresher on Hive Key Hive Use Cases The Stinger Initiative: Interactive Query for Hive Page 15

© Hortonworks Inc The Stinger Initiative Interactive Query on Hadoop Page 16

© Hortonworks Inc Stinger Initiative: 2-Pronged Approach Page 17 Tez New primitives move beyond map-reduce and beyond batch Avoid unnecessary persistence of temporary data Hive, Pig and others generate Tez plans for high perf Query Engine Improvements Cost-based optimizer In-memory joins Caching hot tables Vector processing State-of-the-art Column Store “Optimized RCFile” or ORCFile Minimizes disk IO and deserialization Tez Service Always-on service for query interactivity Improve Latency and Throughput Analytics Functions SQL:2003 Compliant OVER with PARTITION BY and ORDER BY Wide variety of windowing functions: RANK LEAD/LAG ROW_NUMBER FIRST_VALUE LAST_VALUE Many more Aligns well with BI ecosystem Improved SQL Coverage Non-correlated Subqueries using IN in WHERE Expanded SQL types including DATETIME, VARCHAR, etc. Extend Deep Analytical Ability Making Hive Best for Interactive Query

© Hortonworks Inc Hive: Performance Improvements Page 18

© Hortonworks Inc Stinger Initiative At A Glance Page 19

© Hortonworks Inc Base Optimizations: Intelligent Optimizer Introduction of In-Memory Hash Join: –For joins where one side fits in memory: –New in-memory-hash-join algorithm. –Hive reads the small table into a hash table. –Scans through the big file to produce the output. Introduction of Sort-Merge-Bucket Join: –Applies when tables are bucketed on the same key. –Dramatic speed improvements seen in benchmarks. Other Improvements: –Lower the footprint of the fact tables in memory. –Enable the optimizer to automatically pick map joins. Page 20

© Hortonworks Inc Dimensionally Structured Data Extremely common pattern in EDW. Results in large “fact tables” and small “dimension tables”. Dimension tables often small enough to fit in RAM. Sometimes called Star Schema. Page 21

© Hortonworks Inc A Query on Dimensional Data Derived from TPC-DS Query 27 Dramatic speedup on Hive 0.11 Page 22 SELECT col5, avg(col6) FROM fact_table join dim1 on (fact_table.col1 = dim1.col1) join dim2 on (fact_table.col2 = dim2.col1) join dim3 on (fact_table.col3 = dim3.col1) join dim4 on (fact_table.col4 = dim4.col1) GROUP BY col5 ORDER BY col5 LIMIT 100;

© Hortonworks Inc Star Schema Join Improvements in 0.11 Page 23

© Hortonworks Inc Hive: Bucketing Bucketing causes Hive to physically co-locate rows within files. Buckets can be sorted or unsorted. Page 24 CREATE EXTERNAL TABLE IF NOT EXISTS test_table ( Id INT, name String ) PARTITIONED BY (dt STRING, hour STRING) CLUSTERED BY(country,continent) SORTED BY(country,continent) INTO n BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/home/test_dir';

© Hortonworks Inc ORCFile - Optimized Column Storage Make a better columnar storage file –Tightly aligned to Hive data model Decompose complex row types into primitive fields –Better compression and projection Only read bytes from HDFS for the required columns. Store column level aggregates in the files –Only need to read the file meta information for common queries –Stored both for file and each section of a file –Aggregates: min, max, sum, average, count –Allows fast access by sorted columns Ability to add bloom filters for columns –Enables quick checks for whether a value is present Page 25

© Hortonworks Inc Performance Futures - Vectorization Operates on blocks of 1K or more records, rather than one record at a time Each block contains an array of Java scalars, one for each column Avoids many function calls, virtual dispatch, CPU pipeline stalls Size to fit in L1 cache, avoid cache misses Generate code for operators on the fly to avoid branches in code, maximize deep pipelines of modern processers Up to 30x faster processing of records Beta possible in 2H 2013 Page 26

© Hortonworks Inc Performance Futures – Cost-Based Optimizer Generate more intelligent DAGs based on properties of data being queried, e.g. table size, statistics, histograms, etc. Page 27

© Hortonworks Inc Performance Futures - Buffering Query workloads always have hotspots: –Metadata –Small dimension tables Build into YARN or Tez Service ways of buffering frequently used data into memory so it is not always read from disk. Part of the “last mile” of latency efforts. Page 28

© Hortonworks Inc Yarn Moving Hive and Hadoop beyond MapReduce Page 29

© Hortonworks Inc Hadoop 2.0 Innovations - YARN Focus on scale and innovation –Support 10,000+ computer clusters –Extensible to encourage innovation Next generation execution –Improves MapReduce performance Supports new frameworks beyond MapReduce –Low latency, Streaming, Services –Do more with a single Hadoop cluster HDFS MapReduce Redundant, Reliable Storage YARN: Cluster Resource Management TezGraph ProcessingOther

© Hortonworks Inc Tez Moving Hive and Hadoop beyond MapReduce Page 31

© Hortonworks Inc Tez Low level data-processing execution engine Use it for the base of MapReduce, Hive, Pig, Cascading etc. Enables pipelining of jobs Removes task and job launch times Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline Does not write intermediate output to HDFS –Much lighter disk and network usage Built on YARN Page 32

© Hortonworks Inc Tez - Core Idea Task with pluggable Input, Processor & Output Page 33 YARN ApplicationMaster to run DAG of Tez Tasks Input Processor Task Output Tez Task -

© Hortonworks Inc Tez – Blocks for building tasks MapReduce ‘Map’ Page 34 MapReduce ‘Reduce’ HDFS Input Map Processor MapReduce ‘Map’ Task Sorted Output Shuffle Input Reduce Processor HDFS Output Intermediate ‘Reduce’ for Map-Reduce-Reduce Shuffle Input Reduce Processor Intermediate ‘Reduce’ for Map- Reduce-Reduce Sorted Output MapReduce ‘Reduce’ Task

© Hortonworks Inc Tez – More tasks Special Pig/Hive ‘Map’ Page 35 In-memory Map HDFS Input Map Processor Tez Task Pipeline Sorter Output HDFSIn put Map Processor Tez Task In- memory Sorted Output Special Pig/Hive ‘Reduce’ Shuffle Skip- merge Input Reduce Processor Tez Task Sorted Output

© Hortonworks Inc Pig/Hive-MR versus Pig/Hive-Tez Page 36 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Pig/Hive - MR Pig/Hive - Tez I/O Synchronization Barrier I/O Synchronization Barrier Job 1 Job 2 Job 3 Single Job

© Hortonworks Inc FastQuery: Beyond Batch with YARN Page 37 Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently Always-On Tez Service Low latency processing for all Hadoop data processing

© Hortonworks Inc Tez Service MR Query Startup Expensive –Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) Solution –Tez Service –Removes task-launch overhead –Removes job-launch overhead –Hive/Pig –Submit query-plan to Tez Service –Native Hadoop service, not ad-hoc Page 38

© Hortonworks Inc Tez Service Delivers Low Latency Page 39 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Existing Hive Parse Query0.5s Create Plan0.5s Launch Map-Reduce20s Process Map- Reduce 10s Total31s Hive/Tez Parse Query0.5s Create Plan0.5s Launch Map-Reduce20s Process Map- Reduce 2s Total23s Tez and Tez Service Parse Query0.5s Create Plan0.5s Submit to Tez Service0.5s Process Map-Reduce2s Total3.5s * Numbers for illustration only

© Hortonworks Inc Recap and Questions: Hive Performance Page 40

© Hortonworks Inc Improving Hive’s SQL Support Page 41

© Hortonworks Inc Stinger: Deep Analytical Capabilities SQL:2003 Window Functions –OVER clauses –Multiple PARTITION BY and ORDER BY supported –Windowing supported (ROWS PRECEDING/FOLLOWING) –Large variety of aggregates –RANK –FIRST_VALUE –LAST_VALUE –LEAD / LAG –Distrubutions Page 42

© Hortonworks Inc Hive Data Type Conformance Data Types: –Add fixed point NUMERIC and DECIMAL type (in progress) –Add VARCHAR and CHAR types with limited field size –Add DATETIME –Add size ranges from 1 to 53 for FLOAT –Add synonyms for compatibility –BLOB for BINARY –TEXT for STRING –REAL for FLOAT SQL Semantics: –Sub-queries in IN, NOT IN, HAVING. –EXISTS and NOT EXISTS Page 43

© Hortonworks Inc Questions? Page 44

© Hortonworks Inc Thank You! Questions & Answers Page 45