1 © Cloudera, Inc. All rights reserved. Simplifying Analytic Workloads via Complex Schemas Josh Wills, Alex Behm, and Marcel Kornacker Data Modeling for.

Slides:

Advertisements

Similar presentations

Phoenix We put the SQL back in NoSQL James Taylor Demos:

Advertisements

Advanced SQL (part 1) CS263 Lecture 7.

© 2007 by Prentice Hall (Hoffer, Prescott & McFadden) 1 Joins and Sub-queries in SQL.

Michael Pizzo Software Architect Data Programmability Microsoft Corporation.

Ingres/Vectorwise Implementation Details XXV Ingres Benutzerkonferenz 2012 Confidential © 2011 Actian Corporation Doug Inkster 1 of 9.

1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.

‹#› © Cloudera, Inc. All rights reserved. Marcel Kornacker // Cloudera, Inc. Friction-Free ELT: Automating Data Transformation with Cloudera Impala.

From Raw Data to Analytics with No ETL Marcel Kornacker // Cloudera, Inc.

Instructor: Craig Duckett CASE, ORDER BY, GROUP BY, HAVING, Subqueries

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Hive: A data warehouse on Hadoop

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 8 Advanced SQL.

Chapter 14 The Second Component: The Database.

Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.

Advanced SQL SMSU Computer Services Short Course.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

Sarah Sproehnle Cloudera, Inc

Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.

Introduction to SQL Structured Query Language Martin Egerhill.

Introduction to Databases Chapter 8: Improving Data Access.

Sofia, Bulgaria | 9-10 October Using XQuery to Query and Manipulate XML Data Stephen Forte CTO, Corzen Inc Microsoft Regional Director NY/NJ (USA) Stephen.

1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.

PostgreSQL and relational databases As well as assignment 4…

Ashwani Roy Understanding Graphical Execution Plans Level 200.

5/24/01 Leveraging SQL Server 2000 in ColdFusion Applications December 9, 2003 Chris Lomvardias SRA International

1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:

CS1100: Microsoft Access Managing Data in Relational Databases Created By Martin Schedlbauer CS11001Microsoft Access - Introduction.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.

SQL Performance and Optimization l SQL Overview l Performance Tuning Process l SQL-Tuning –EXPLAIN PLANs –Tuning Tools –Optimizing Table Scans –Optimizing.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Module 4 Database SQL Tuning Section 3 Application Performance.

CpSc 462/662: Database Management Systems (DBMS) (TEXNH Approach) Relational Schema and SQL Queries James Wang.

Database Basics BCIS 3680 Enterprise Programming.

Session 1 Module 1: Introduction to Data Integrity

1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Random Query Generator for Hive November 2015 Hive Contributor Meetup Szehon Ho.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

QUERY CONSTRUCTION CS1100: Data, Databases, and Queries CS1100Microsoft Access1.

Introduction to MongoDB. Database compared.

In this session, you will learn to: Create and manage views Implement a full-text search Implement batches Objectives.

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

Select Complex Queries Database Management Fundamentals LESSON 3.1b.

Extending and Creating Dynamics AX OLAP Cubes

More SQL: Complex Queries, Triggers, Views, and Schema Modification

MySQL Subquery Source: Dev.MySql.com

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

02 | Advanced SELECT Statements

UFC #1433 In-Memory tables 2014 vs 2016

A Warehousing Solution Over a Map-Reduce Framework

Scaling SQL with different approaches

Database Systems: Design, Implementation, and Management Tenth Edition

Oracle Analytic Views Enhance BI Applications and Simplify Development

Cse 344 May 7th – Exam Review.

1 Demand of your DB is changing Presented By: Ashwani Kumar

The Killing Cursors Cyndi Johnson

Enhance BI Applications and Simplify Development

Chapter 4 Summary Query.

Cyndi Johnson Senior Software Engineer at AdvancedMD Killing Cursors.

Lab 2 and Merging Data (with SQL)

Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.

Contents Preface I Introduction Lesson Objectives I-2

Chapter 8 Advanced SQL.

Database Systems: Design, Implementation, and Management Tenth Edition

Cyndi Johnson Senior Software Engineer at AdvancedMD Killing Cursors.

All about Indexes Gail Shaw.

Presentation transcript:

1 © Cloudera, Inc. All rights reserved. Simplifying Analytic Workloads via Complex Schemas Josh Wills, Alex Behm, and Marcel Kornacker Data Modeling for Data Science

2 © Cloudera, Inc. All rights reserved. Relational databases are optimized to work with flat schemas Much of the useful information in the world is stored using complex schemas (a.k.a., the “document model”) Log files NoSQL data stores REST APIs On Complex Schemas

3 © Cloudera, Inc. All rights reserved. Handling Complex Schemas in SQL

4 © Cloudera, Inc. All rights reserved. Complex Schemas and Data Science

5 © Cloudera, Inc. All rights reserved. Think Like A Data Scientist

6 © Cloudera, Inc. All rights reserved. Building Data Science Teams: A Moneyball Approach

7 © Cloudera, Inc. All rights reserved. Leveraging The Power of SQL Engines…

8 © Cloudera, Inc. All rights reserved. …While Managing the Cost of SQL Engines

9 © Cloudera, Inc. All rights reserved. Intentional Complex Schemas

10 © Cloudera, Inc. All rights reserved. Better Resource Utilization

11 © Cloudera, Inc. All rights reserved. Higher Data Quality

12 © Cloudera, Inc. All rights reserved. Simplifying Analytic Queries

13 © Cloudera, Inc. All rights reserved. Data Science For Everyone

14 © Cloudera, Inc. All rights reserved. Nested Types in Impala Support for Nested Data Types: STRUCT, MAP, ARRAY Natural Extensions to SQL Full Expressiveness of SQL with Nested Types

15 © Cloudera, Inc. All rights reserved. Example TPCH-like Schema CREATE TABLE Customers { id BIGINT, address STRUCT< city: STRING, zip: INT > orders ARRAY<STRUCT< date: TIMESTAMP, price: DECIMAL(12,2), cc: BIGINT, items: ARRAY<STRUCT< item_no: STRING, price: DECIMAL(9,2) >> }

16 © Cloudera, Inc. All rights reserved. Impala Syntax Extensions Path expressions extend column references to scalars (nested structs) Can appear anywhere a conventional column reference is used Collections (maps and arrays) are exposed like sub-tables Use FROM clause to specify which collections to read like conventional tables Can use JOIN conditions to express join relationship (default is INNER JOIN) Find the ids and city of customers who live in the zip code 94305: SELECT id, address.city FROM customers WHERE address.zip = Find all orders that were paid for with a customer’s preferred credit card: SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc

17 © Cloudera, Inc. All rights reserved. Referencing Arrays & Maps Basic idea: Flatten nested collections referenced in the FROM clause Can be thought of as an implicit join on the parent/child relationship SELECT c.id, o.date FROM customers c, c.c_orders o c.ido.date …… …… id of a customer repeated for every order

18 © Cloudera, Inc. All rights reserved. Referencing Arrays & Maps SELECT c.c_custkey, o.o_orderkey FROM customers c, c.c_orders o Returns customer/order data for customers that have at least one order SELECT c.c_custkey, o.o_orderkey FROM customers c LEFT OUTER JOIN c.c_orders o Also returns customers with no orders (with order fields NULL) SELECT c.c_custkey, o.o_orderkey FROM customers c LEFT ANTI JOIN c.orders o Find all customers that have no orders SELECT c.c_custkey, o.o_orderkey FROM customers c INNER JOIN c.c_orders o

19 © Cloudera, Inc. All rights reserved. Motivation for Advanced Querying Capabilities Count the number of orders per customer Count the number of items per order Impractical  Requires unique id at every nesting level Information is already expressed in nesting relationship! What about even more interesting queries? Get the number of orders and the average item price per customer? “Group by” multiple nesting levels at the same time SELECT COUNT(*) FROM customers c, c.orders o GROUP BY c.id SELECT COUNT(*) FROM customers.orders o, o.items GROUP BY ??? Must be unique

20 © Cloudera, Inc. All rights reserved. Advanced Querying: Aggregates over Nested Collections Count the number of orders per customer Count the number of items per order Get the number of orders and the average item price per customer SELECT id, COUNT(orders) FROM customers SELECT date, COUNT(items) FROM customers.orders SELECT id, COUNT(orders), AVG(orders.items.price) FROM customers

21 © Cloudera, Inc. All rights reserved. Advanced Querying: Relative Table References Full expressibility of SQL with nested collections Arbitrary SQL allowed in inline views and subqueries with relative table refs Exploits nesting relationship No need for stored unique ids at all interesting levels SELECT id FROM customers c WHERE (SELECT AVG(price) FROM c.orders.items) > 1000 SELECT id FROM customers c JOIN (SELECT price FROM c.orders ORDER BY date DESC LIMIT 2) v

22 © Cloudera, Inc. All rights reserved. Impala Nested Types in Action Josh Wills’ Blog post on analyzing misspelled queries Goal: Rudimentary spell checker based on counting query/click events Problem: Cross referencing items in multiple nested collections Representative of many machine learning tasks (Josh tells me) Goal was not naturally achievable with Hive Josh implemented a custom Hive extension “WITHIN” How can Impala serve this use case?

23 © Cloudera, Inc. All rights reserved. Impala Nested Types in Action account_id: bigint search_events: array<struct< event_id: bigint query: string tstamp_sec: bigint... >> install_events: array<struct< event_id: bigint search_event_id: bigint app_id: bigint... >> SELECT bad.query, good.query, count(*) as cnt FROM sessions s, s.search_events bad, s.search_events good, WHERE bad.tstamp_sec < good.tstamp_sec AND good.tstamp_sec - bad.tstamp_sec < 30 AND bad.event_id NOT IN (select search_event_id FROM s.install_events) AND good.event_id IN (select search_event_id FROM s.install_events), GROUP BY bad.query, good.query Schema

24 © Cloudera, Inc. All rights reserved. Design Considerations for Complex Schemas Turn parent-child hierarchies into nested collections logical hierarchy becomes physical hierarchy nested structure = join index physical clustering of child with parent For distributed, big data systems, this matter: distributed join turns into local join

25 © Cloudera, Inc. All rights reserved. Flat TPCH vs. Nested TPCH PartPartSupp SupplierLineitem CustomerOrders NationRegion Part PartSupp Suppliers Lineitems Customer Orders Nation Region 1 N 1 N 1 N 1 N N 1 N 1 N 1 N 1 N 1 1 N N 1

26 © Cloudera, Inc. All rights reserved. Columnar Storage for Complex Schemas Columnar storage: a necessity for processing nested data at speed complex schemas = really wide tables row-wise storage: read lots of data you don’t care about Columnar formats effective store a join index {id: 17 orders: [{date: , price: 10.2}, {date: ….} …] } {id: 18 orders: [{date: , price: }, {date: ….} …] } id orders.price … … …

27 © Cloudera, Inc. All rights reserved. Columnar Storage for Complex Schemas A “join” between parent and child is essentially free: coordinated scan of parent and child columns data effectively pre-sorted on parent PK merge join beats hash join! Aggregating child data is cheap: the data is already pre-grouped by the parent Amenable to vectorized execution Local non-grouping aggregation <<< distributed grouping aggregation

28 © Cloudera, Inc. All rights reserved. Analytics on Nested Data: Cheap & Easy! SELECT c.id, AVG(orders.items.price) avg_price FROM customer ORDER BY avg_price DESC LIMIT 10 Query: Find the 10 customers that buy the most expensive items on average SELECT c.id, AVG(i.price) avg_price FROM customer c, order o, item i WHERE c.id = o.cid and o.id = i.oid GROUP BY c.id ORDER BY avg_price DESC LIMIT 10 FlatNested

29 © Cloudera, Inc. All rights reserved. Scan Xch Join Scan Xch Join Agg Xch Agg Scan Unnest Agg Top-n Xch Top-n Xch Top-n Analytics on Nested Data: Cheap & Easy! $$$

30 © Cloudera, Inc. All rights reserved. A Note to BI Tool Vendors Please support nested types, it’s not that hard! you already take advantage of declared PK/FK relationships a nested collection isn’t really that different except you don’t need a join predicate

31 © Cloudera, Inc. All rights reserved. Thank you