Apache Hive in 2017: Faster, Friendlier, Futuristic

Apache Hive in 2017: Faster, Friendlier, Futuristic
Thejas Nair, Hortonworks Hortonworks: Powering the Future of Data

Apache Hive in 2017 Faster – Lower latency and higher throughput
Enterprise and ecosystem Friendly features Working with the new norm of the Future

Hive 2 with LLAP: Architecture Overview
Deep Storage YARN Cluster LLAP Daemon Query Executors Query Coordinators Coord-inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon

Efficiency for individual queries
Eliminates container startup costs JIT optimizer has a chance to work (esp. for vectorization) Data sharing (hash join tables, etc.) 6 months note

LLAP – 2017 Expand your cache with SSD
Cache support added for text, parquet formats Scheduling improvements Monitoring - LLAP UI, per query log file Faster startup time Now GA & battle tested! Stability improvements, bug fixes, debuggability

Monitoring

Vectorization & SIMD Operations on vector of columns from block of rows More vectorized code operators and data type (decimal) SIMD – Single instruction multiple data Subtle changes to expression evaluations to make it Java SIMD friendly

Better query plans Improvements to Cost Based Optimizer
Improvements to statistics collection Statistics for columns – Number of distinct values, min, max Faster, automatic gathering Use of modern algorithms to estimate distinct values (HyperLogLog)

Query plan optimization example
Join reordering Capable of generating bushy join operator trees Both inputs of a join operator can receive results from other join operators Well suited for parallel execution engines: increases parallelism degree, performance and cluster utilization ORM (object relational mapping) Combines exhaustive search with greedy algorithm Exhaustive search finds every possible plan using rewriting rules Not practical for large number of joins Greedy algorithm builds the plan iteratively Uses heuristic to choose best join to add next Left-deep join tree Bushy join tree Join Join Join Join Join Join Join Join

New smarts – Dynamic runtime filtering

Faster compilation Few 100 ms is now too slow! Metastore cache
Strongly consistent for single metastore case Background refresh of the cache

Hive2 + LLAP vs Hive 1 (x26 faster)
6 months note

Hive2 LLAP vs Impala; TPCDS 10Tb on 9 nodes
Total Runtime (sec) (Lower is Better) TPC-DS Queries Supported (Higher is Better) 60 99/99

What Do You Expect in a Data Warehouse?

What You Expect in a Data Warehouse?
Support for BI, Cubes, Data Science Monitoring & Management High Performance SQL 2011 High Storage Capacity Governance Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience Security Replication & D/R Data Lifecycle Management Workload Management Data Ingestion

Enterprise friendly features - 2017
Replication & disaster recovery Advanced security Advanced security with Spark Handling slowly changing dimension tables

Hive replication Event based replication based on metastore events
Replicate metadata and data changes Replication can be at database or table level Master – Slave point in time replication Parallel data transfer using distcp New commands in hive that expose the capabilities Replicates views and permanent functions (including jars) as well Seamless automation using Hortonworks Data Plane service

Apache Ranger: Per-User Row Filtering by Region in Hive
(West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies User 2 (East Region) Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” AND region = “west” LLAP Data Access User ID Region Total Spend 1 East 5,131 2 27,828 3 West 55,493 4 7,193 5 18,193

Apache Ranger: Dynamic Data Masking of Hive Columns
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation! Goal: Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output Benefits Sensitive information never leaves database No changes are required at the application or Hive layer No need to produce additional protected duplicate versions of datasets Simple & easy to setup masking policies Core Technologies: Ranger, Hive RAngER HIVE ATLAS Hortonworks: Powering the Future of Data

Spark Fine grained security with LLAP
Fine-Grained Column Level Access Control for SparkSQL. Fully dynamic policies per user. Doesn’t require views. Use Standard Ranger policies and tools to control access and masking policies. Flow: SparkSQL gets data locations known as “splits” from HiveServer and plans query. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. Spark gets a modified query plan based on dynamic security policy. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3

Updating existing data using ACID merge
State County Value 1 CA LA 19.0 2 MA Norfolk 15.0 7 Suffolk 50.15 16 Orange 9.1 MERGE INTO TARGET T USING SOURCE S ON T.ID=S.ID WHEN MATCHED THEN UPDATE SET T.Value=S.Value WHEN NOT MATCHED INSERT (ID,State,Value) VALUES(S.ID, S.State, S.Value) ID State Value 1 20.0 7 80.0 100 NH 6.0 Target is the table inside the Warehouse Source table contains the changes to apply ID State County Value 1 CA LA 20.0 2 MA Norfolk 15.0 7 Suffolk 80.0 16 Orange 9.1 100 NH null 6.0

W/o MERGE – much less efficient
UPDATE Target set Value= 20.0 where ID = 1; UPDATE Target set Value = 80.0 where ID = 7; INSERT INTO Target (ID, State, Value) VALUES(100, ‘NH’, 6.0);

Futuristic trends => new normal
Big data in cloud Real time ingest (IOT)

Hive in cloud Reads from cloud filesystems are slower than HDFS
LLAP cache (w SSD) Move operations in S3 is expensive Insert only transactional tables that work with any file format (WIP) Cloud hosted RDBMS can also be slow/throttled Metastore caching

Real time injest using Druid
Column-oriented distributed data store Batch and real-time ingestion Scalable to petabytes of data Sub-second response for arbitrary time-based slice-and-dice Data partitioned by time dimension Automatic data summarization Approximate algorithms (hyperLogLog, theta)

Druid query recognition (powered by Apache Calcite)
Query logical plan Apache Hive - SQL query Sink SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND GROUP BY `user` ORDER BY s DESC LIMIT 10; Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Sort Limit Aggregate Filter Project Druid Scan

Druid query recognition (powered by Apache Calcite)
Query logical plan Apache Hive Apache Hive - SQL query Sink SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND GROUP BY `user` ORDER BY s DESC LIMIT 10; Rewriting rules push computation into Druid Need to check that operator meets some pre-conditions before pushing it to Druid Druid query groupBy Sort Limit Rewriting rule Aggregate Filter Project Druid Scan

Hive in 2018 Keep pushing the boundaries on Faster, Friendlier & Futuristic!

References warehouse hive-past-present-future apache-hive-using-druid for-your-apache-hive-warehouse

Apache Hive in 2017: Faster, Friendlier, Futuristic

Similar presentations

Presentation on theme: "Apache Hive in 2017: Faster, Friendlier, Futuristic"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apache Hive in 2017: Faster, Friendlier, Futuristic

Similar presentations

Presentation on theme: "Apache Hive in 2017: Faster, Friendlier, Futuristic"— Presentation transcript:

Similar presentations

About project

Feedback