Presentation is loading. Please wait.

Presentation is loading. Please wait.

Putting the Sting in Hive Page 1 Alan F.

Similar presentations


Presentation on theme: "Putting the Sting in Hive Page 1 Alan F."— Presentation transcript:

1 Putting the Sting in Hive Page 1 Alan F. Gates @alanfgates

2 Stinger Overview Page 2 An initiative, not a project or product Includes changes to Hive and a new project Tez Two main goals –Improve Hive performance 100x over Hive 0.10 –Extend Hive SQL to include features needed for analytics Hive will support: –BI tools connecting to Hadoop –Analysts performing ad-hoc, interactive queries –Still excellent at the large batch jobs it is used for today © 2013 Hortonworks

3 Stinger Mileposts Page 3 © 2013 Hortonworks Stinger Phase 3 Buffer Cache Cost Based Optimizer Stinger Phase 2 YARN Resource Mgmnt Hive on Apache Tez Query Service Vectorized Operators Stinger Phase 1 Base Optimizations SQL Analytics ORCFile Format 12 Improve existing tools & preserve investments Enable Hive to support interactive workloads Released in Hive 0.11 Current Work Roadmap

4 Hive Performance Gains in 0.11 Page 4 © 2013 Hortonworks Enable star joins by improving Hive’s map join (aka broadcast join) –Where possible do in single map only task –When not possible push larger tables to separate tasks Collapse adjacent jobs where possible –Hive has lots of M->MR type plans, collapse these to MR –Collapse adjacent jobs on sufficiently similar keys when feasible –join followed by group –join followed by order –group followed by order Improvements in sort merge bucket (SMB) joins

5 © Hortonworks Inc. 2013 Improvements in Map Joins TPC-DS Query 27, Scale=200, 10 EC2 nodes (40 disks) Page 5

6 © Hortonworks Inc. 2013 Before Page 6

7 © Hortonworks Inc. 2013 After Page 7

8 © Hortonworks Inc. 2013 Improvements in SMB Joins TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks) Page 8

9 New Technologies in Hive Page 9 © 2013 Hortonworks All covered in depth in other talks –See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today Tez – A new execution engine for relational tools such as Hive –No need to use MapReduce, instead provides general DAG execution –Data moved between tasks via socket, disk, or HDFS based on performance / re- startability trade off –Provides standing service to greatly reduce query start time ORCFile – A rewrite of RCFile –Columnar –Tightly integrated with Hive’s type model, including support for nested types –Much better compression –Supports projection and filter push down Vectorization – Rewriting operators to take advantage of modern processors –Based on work done in MonetDB –Rewrite operators to radically reduce number of function calls, branch prediction misses, and cache misses

10 © Hortonworks Inc. 2013 Standard Queries Page 10

11 © Hortonworks Inc. 2013 Performance Trajectory Page 11

12 © Hortonworks Inc. 2013 Query 12 – Demonstrating MRR Page 12

13 Hive Performance Up Next Page 13 © 2013 Hortonworks Push down start up time - even for queries that spend less than a second running on the cluster, there is ~15 seconds of start up time –Tez service will remove Hadoop startup issues –Need to reduce time for the metadata access –Need intelligent file caching so that hot tables can be kept in memory Keep working on the optimizer –Y Smart work from Ohio State University –Start using statistics to make intelligent decisions about how many mappers and reducers to spawn – maybe in Hive, maybe in Tez –Start using statistics to choose between competing plan options Buffer Cache –Coordinate with HDFS team to determine caching strategy

14 Extending Hive SQL in 0.11 Page 14 © 2013 Hortonworks DECIMAL data type – for fixed precision calculation (e.g. currency) OVER clause –PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING –Works with existing aggregate functions –New analytic and window functions added –ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK SELECT salesperson, AVG(salesprice) OVER (PARTITION BY region ORDER BY date ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING) FROM sales;

15 Extending Hive SQL Post 0.11 Page 15 © 2013 Hortonworks Subqueries in WHERE –Non-correlated first –[NOT] IN first, then extend to (in)equalities and EXISTS Datatype conformance – Hive has Java type model, add support for SQL types: –DATE –CHAR() and VARCHAR() –add precision and scale to decimal and float –aliases for standard SQL types (BLOB = binary, CLOB = string, integer = int, real/number = decimal) Security –Add security checks to views, indices, functions, etc. –Secure GRANT and REVOKE

16 Questions Page 16 © 2013 Hortonworks


Download ppt "Putting the Sting in Hive Page 1 Alan F."

Similar presentations


Ads by Google