PolyBase Query Hadoop with ease Sahaj Saini SQL Server, Microsoft.

Slides:

Advertisements

Similar presentations

Indexing HDFS Data in PDW: Splitting the data from index 1 Vinitha Gankidi #, Nikhil Teletia *, Jignesh M. Patel #, Alan Halverson *, David J. DeWitt *

Advertisements

David J. DeWitt Microsoft Jim Gray Systems Lab Madison, Wisconsin graysystemslab.com.

Spark: Cluster Computing with Working Sets

FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)

Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.

Running Hadoop-as-a-Service in the Cloud

Microsoft Ignite /16/2017 5:47 PM

Jeremy Boyd Director – Mindscape MSDN Regional Director

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Business Intelligence Overview Marc Schöni Technical Solution Professional | Business Intelligence Microsoft Switzerland.

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Using the WDK for Windows Logo and Signature Testing Craig Rowland Program Manager Windows Driver Kits Microsoft Corporation.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

Hive Facebook 2009.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Modern Data Warehouse: Microsoft APS Alain Dormehl June 2015.

Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Sponsorzy strategiczni Sponsorzy srebrni. PolyBase – data beyond tables Hubert Kobierzewski.

PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Please note that the session topic has changed

Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |

Excel Services Displays all or parts of interactive Excel worksheets in the browser –Excel “publish” feature with optional parameters defined in worksheet.

Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.

Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.

MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith

AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

An Introduction To Big Data For The SQL Server DBA.

BIG DATA/ Hadoop Interview Questions.

Apache Hadoop on Windows Azure Avkash Chauhan

PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.

Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.

Ignite in Sberbank: In-Memory Data Fabric for Financial Services

PolyBase overview Speaker Name

Data Platform and Analytics Foundational Training

PolyBase: T-SQL Reaching Beyond the Database

Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.

The Model Architecture with SQL and Polybase

Hadoopla: Microsoft and the Hadoop Ecosystem

Building Analytics At Scale With USQL and C#

Polybase Didn’t That Go Out in the 70’s Stan Geiger.

Microsoft Ignite NZ October 2016 SKYCITY, Auckland.

A developers guide to Azure SQL Data Warehouse

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

07 | Analyzing Big Data with Excel

SQL Server PolyBase and Dell EMC Isilon storage

Server & Tools Business

A developers guide to Azure SQL Data Warehouse

Henk van der Valk Oct.15, 2016 Level: Beginner

Introduction to Apache

Inside SQL Server Polybase

IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.

Data Wrangling as the key to success with Data Lake

Moving your on-prem data warehouse to cloud. What are your options?

SQL Server 2019 Bringing Apache Spark to SQL Server

Pig Hive HBase Zookeeper

Presentation transcript:

PolyBase Query Hadoop with ease Sahaj Saini SQL Server, Microsoft

2 Please silence cell phones 2

Agenda What is PolyBase? Why customers need it? How it works? Demo Q&A

Why?

“Our mission is to empower every person and every organization on the planet to achieve more” - Satya Nadella 5

6 All the interest in Big Data Increased number and variety of data sources that generate large quantities of data. Realization that data is “too valuable” to delete. Dramatic decline in the cost of hardware, especially storage. $

7 The Hadoop Ecosystem

Initially MapReduce for insights from HDFS-resident data Recently SQL-like data warehouse technologies on HDFS e.g. Hive, Impala, HAWQ, Spark/Shark Hadoop Evolution

9 9 What if you use RDBMS and Hadoop?

What is PolyBase?

11 Big Picture Provides a T-SQL language extension for combining data from both worlds

12 PolyBase in SQL Server 2016

13 PolyBase journey … …… 2016 … 2014 PolyBase in SQL Server PDW V2 (Analytics Platform System) PolyBase in SQL Server 2016 CTP2 CTP3 PolyBase in Azure SQL Data Warehouse RTM

14 Example 1: Auto Insurance Usage-based Insurance Combining non-relational sensor data from cars (kept in Hadoop) with structured customer data (kept in APS) Ability to adjust policies based on driver behavior ‘ Pay-as-you-drive’ - Driver Discount & Policy adjustment Status - In production

PolyBase Demo SQL Server 2016

16 Example 2: Wind Turbine Manufacturer Turbine Monitoring Analyzing sensor data from wind turbines (kept in Hadoop) combined with relational turbine data (kept in SQL Server) Ability to do change detection, proactive maintenance and reporting Turbine Monitoring Status - In development

17 PolyBase Use Cases

How does PolyBase work?

19 Step 1: Setup a Hadoop Cluster Hortonworks or Cloudera Distributions Hadoop 2.0 or above Linux or Windows On premise or in Azure Namenode (HDFS) File System Hadoop cluster

20 Or Azure Storage Account Azure Storage Blob (ASB) exposes an HDFS layer PolyBase reads and writes from ASB using Hadoop APIs No compute push-down support for ASB

Step 2: Install SQL Server 21 Select PolyBase feature Adds two new services - PolyBase Engine - PolyBase Data Movement Service Pre-requisite: download and install JRE

1. Install multiple SQL Server instances with PolyBase. Step 3: Scale-out 22 Head Node PolyBase Engine PolyBase DMS PolyBase Engine 2. Choose one as Head Node. 3. Configure remaining as Compute Nodes a.run stored procedure b.shutdown PolyBase Engine c.restart PolyBase DMS

After Step 3 23 PolyBase Group for Scale-out Computation Head node contains the SQL Server instance to which PolyBase queries are submitted Compute nodes are used for scale- out query processing on external data Compute Nodes

Step 4 - Choose Hadoop flavor Supported distributions in CTP3 Cloudera CDH 5.1 on Linux Hortonworks 2.0, 2.1 & 2.2 on Linux Hortonworks 2.0, 2.2 on Windows Server Azure blob storage (ASB) What happens under the covers? Loading the right client jars to connect to Hadoop -- different numbers map to various Hadoop flavors -- example: value 4 stands for HDP 2.0 on Windows or ASB, value 5 for HDP 2.0 on Linux, value 6 for CHD 5.1 on Linux, value 7 for HDP 2.1/2.2 on Linux and Windows or ASB 7

25 After Step 4 Namenode (HDFS) File System

PolyBase Design

27 Under-the-hood Exploiting compute resources of Hadoop Clusters with push-down computation

Uses Hadoop RecordReaders/RecordWriters to read/write standard HDFS file types HDFS bridge in DMS

29 Under-the-hood Exploiting compute resources of Hadoop Clusters with push-down computation

30 Namenode (HDFS) Hadoop Cluster File System Data moves between clusters in parallel SQL16

31 Under-the-hood Exploiting compute resources of Hadoop Clusters with push-down computation

Creating External Tables Once per Hadoop Cluster Once per File Format HDFS File Path

Creating External Tables (secure Hadoop) Once per Hadoop User HDFS File Path Once per File Format Once per Hadoop user

-- select on external table (data in HDFS) SELECT * FROM Customer WHERE c_nationkey = 3 and c_acctbal < 0; A possible execution plan: CREATE temp table T Execute on SQL compute nodes 1 IMPORT FROM HDFS HDFS Customer file read into T in parallel 2 EXECUTE QUERY Select * from T where T.c_nationkey =3 and T.c_acctbal < 0 3 PolyBase Query Example #1

35 Under-the-hood Exploiting compute resources of Hadoop Clusters with push-down computation

HDFS Hadoop HDFS blocks DB PolyBase Query 1 MapReduce Cost-based decision on how much computation to push SQL operations on HDFS data pushed into Hadoop as MapReduce jobs Big Picture Takeaway Map job

Cost-based Decision (for split-based query execution) Major factor for decision is data volume reduction Hadoop takes seconds to spin-up Map job o Spin-up time varies depending on distribution and OS Cardinality of predicate matters o No push-down for scenarios where SQL can execute under seconds w/o push-down o Creating statistics on external table (not auto-created) Queries can have “pushable” & “non-pushable” expressions and predicates – Pushable ones will be evaluated on Hadoop side – Processing of non-pushable ones will be done on SQL side – Aggregate functions (sum, count, …) partially pushed – JOINS never pushed, always executed on SQL side External Table External Data source External File Format Your Apps PowerPivot PowerView PDW Engine Service Polybase Storage Layer (PPAX) HDFS Bridge – (as part of DMS) Job Submitter

-- select and aggregate on external table (data in HDFS) SELECT AVG(c_acctbal) FROM Customer WHERE c_acctbal < 0 GROUP BY c_nationkey; Execution Plan: PolyBase Query Example #2 Run MR Job on Hadoop Apply filter and compute aggregate on Customer. 1 What happens here? Step 1: QO compiles predicate into Java. Step 2: Engine submits Map job to Hadoop cluster. Output left in hdfsTemp. hdfsTemp FRA UK

-- select and aggregate on external table (data in HDFS) SELECT AVG(c_acctbal) FROM Customer WHERE c_acctbal < 0 GROUP BY c_nationkey; Execution Plan: PolyBase Query Example #2 1.Query optimizer made a cost- based decision on what operators to push. 2.Predicate and aggregate pushed into Hadoop cluster as a Map job. Run MR Job on Hadoop Apply filter and compute aggregate on Customer. Output left in hdfsTemp 1 IMPORT hdfsTEMP Read hdfsTemp into T 3 CREATE temp table T On SQL compute nodes 2 RETURN OPERATION Read from T Do final aggregation 4 hdfsTemp FRA UK

Query Capabilities

Query Capabilities (1) Combine relational and external data SELECT FROM 1.Querying external tables 2.Joining external with regular SQL tables 3.Pushing compute for basic expressions and aggregates External tables referring to data in two HDP Hadoop clusters SQL Table

Query Capabilities (2) Push-Down Computation Pushing Compute o Either on data source level or o Per-query basis using query hints

Query Capabilities (3) Multiple User IDs Credential support o Credential support for multiple user IDs associated with external data source

Query Capabilities (4) Seamless BI integration

Import Scenario – Persistent Storage SELECT INTO 1.Importing data from Hadoop or Azure storage for persistent storage 2.‘ETL’ type of processing possible via T-SQL External table referring to data in HDP Hadoop clusters new SQL Table created

External table for aging data into Hadoop Export Scenario – Data aging to Hadoop/Azure INSERT INTO 1.Exporting SQL data into Hadoop or Azure Storage 2.‘ETL’ type of processing possible via T-SQL Export data to Hadoop

Dr. David DeWitt For letting me use his material to explain the PolyBase technology. Dr. Artin Avanes For selecting and building the demo use case. Our team in Gray Systems Lab, Madison and Aliso Viejo Acknowledgments

Thank You