Data Virtualization SQL Server 2019 Enhanced Polybase

Data Virtualization SQL Server 2019 Enhanced Polybase
Mike - Introduction Data Virtualization SQL Server 2019 Enhanced Polybase Kimberly St. Jacques Michael Grayson

Agenda About us Enhanced PolyBase overview
Enhanced PolyBase Installation Connecting to Data Sources Querying Troubleshooting Questions? Mike

About Paychex Paychex is a leading provider of integrated human capital management solutions for payroll, HR, retirement, and insurance services. Backed by 47 years of industry expertise, Paychex serves approximately 605,000 payroll client as of May 31, 2017, across more than 100 locations and pays one out of every 12 American private sector employees. Mike

About the Speakers: Kim
DBA since 2009 Currently working at Paychex Have worked at Xerox, FujiFilm, and Wegmans Have administered SQL Server, Oracle, MongoDB, and Netezza Have also touched Informix, MariaDB, and Postgres Graduated from RIT Married, 2 kids (both in college) Member of PASS PASS Summit Program Committee - Served 4 Years Twitter: @kimstjacques LinkedIn: Blog: Kim

About the Speakers: Mike
DBA since 2010 Currently working at Paychex Have worked at Paetec, Windstream, Thomson Reuters previously Have administered MongoDB, Oracle, MySQL, MariaDB, Cassandra, DB2, Kafka MongoDB Master 2016-present Graduated from Drexel University in Philadelphia Married, 4 kids Twitter: @mikegray831 LinkedIn: Blog: Mike

Data Virtualization

Data Virtualization Defined
“Data virtualization offers techniques to abstract the way we handle and access data. It allows you to manage and work with data across heterogenous streams and systems, regardless of their physical location or format. Data virtualization can be defined as a set of tools, techniques and methods that let you access and interact with data without worrying about its physical location and what compute is done on it. For instance, say you have tons of data spread across disparate systems and want to query it all in a unified manner, but without moving the data around. That’s when you would want to leverage data virtualization techniques.” - TechNet Kim

For instance, say you have tons of data spread across disparate systems and want to query it all in a unified manner, but without moving the data around. That’s when you would want to leverage data virtualization techniques.” - TechNet Kim

PolyBase

PolyBase Introduced in 2016
HDFS-compatible Hadoop distributions and file systems such as HortonWorks, Cloudera, and Azure Blob Storage Mike

What is Enhanced PolyBase?
Combines many disparate data sources for reporting and analysis inside SQL Server, without the need to develop and run ETL processes. Use T-SQL to query data located on a wide array of data sources. HDFS-compatible Hadoop distributions and file systems such as HortonWorks, Cloudera, and Azure Blob Storage, SQL Server, Oracle, Teradata, MongoDB, or any data source with an ODBC driver Scalable with compute instances Secured with Active Directory Authentication Microsoft SQL Server 2019 (CTP2.2) Mike

Mike - Sample Data Sources

Mike - Enhanced Polybase Architecture

Head Node Kim

What is the Head Node? SQL Server instance where queries are submitted
Can only have one Head node Must be Enterprise edition Running services include: SQL Database Engine PolyBase Engine PolyBase Data Movement Service Parses submitted queries and distributes the query plan and work to the data movement service on the compute nodes After work is completed, the compute nodes submit results to the head node for returning to the client Kim

Head Node Installation
Kim highlight Pre-reqs Comment on Instance features selected

Head Node Installation (cont..)
Kim If someone asks: When querying external SQL Server, Oracle or Teradata instances, partitioned tables will benefit from scale-out reads. Each node in a PolyBase scale-out group can spin up to 8 readers to read external data. And each reader is assigned one partition to read in the external table. For e.g., let's say you have an external SQL Server table with 12 monthly partitions and a 3-node PolyBase scale-out group, each node will use 4 PolyBase readers to process each of the 12 partitions. This is illustrated in the image below.

Head Node Installation (cont..)
Kim

Head Node - Post Installation
Enable PolyBase: exec = 'polybase = 1 GO RECONFIGURE To Verify: SELECT * FROM sys.configurations where name like '%poly%' Kim

Compute Nodes Mike

What is the Compute Node?
Assists with scale-out query processing on external data Running services include: PolyBase Data Movement service SQL Server service ** Notice the PolyBase Engine is not running here ** Can be Enterprise or Standard Edition Send results back to the Head node. Mike

Compute Node Installation
Mike highlight Pre-reqs Comment on Instance features selected

Compute Node Installation (cont..)
Mike

Compute Node Installation (cont..)
Mike generate new screenshot that does not have Paychex accounts

Compute Node - Post Installation
Enable PolyBase: exec = 'polybase = 1 GO RECONFIGURE To Verify: SELECT * FROM sys.configurations where name like '%poly%' Mike

Compute Node - Post Installation
Register Compute Node with Head Node: EXEC sp_polybase_join_group 'PLBDHWN2AWV1', 16450, 'MSSQLSERVER'; What this does: Identifies which node will be the Head ( when executed first time ) Disables the PolyBase Engine Service Mike

Lastly, Recycle the Data Movement Service
In SQL Server Configuration Manager:

PolyBase Engine is Disabled
In SQL Server Configuration Manager:

SSMS – Scale-out Group Mike

Create Database

Our Story…. We would like to do incident analysis using data from multiple systems in our environment. Each system is using a different type of datasource. MongoDB - Jira Board Cosmos DB - Bug Tracking MariaDB - Document Revision History Oracle - Customer Impact Data Kim

Create Database Create a database on the Head node
This database will hold: credentials data source definitions external tables -- Create Database to store the external tables for PolyBase CREATE DATABASE [IncidentAnalysis]; -- Create Master Encryption Key CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Enhanced#Polybase123'; Kim Normal create database statement… nothing special for Polybase here. Showing what it will look like when all is said and done. The database master key is a symmetric key that is used to protect the private keys of certificates and asymmetric keys that are present in the database. It can also be used to encrypt data, but it has length limitations that make it less practical for data than using a symmetric key.

Connecting to Oracle Kim

Configuring Oracle for Polybase
On the Oracle source: Create Oracle user for PolyBase to authenticate with. Grant the account access to the tables PolyBase will need to extract data from. On the PolyBase Head node: Install the appropriate Oracle client and ODBC drivers. Kim

Create Database Scoped Credential - Oracle
IDENTITY - the username created in the Oracle target with access to the tables being exposed to SQL Server Polybase. SECRET - the Oracle password for the username specified by “IDENTITY”. USE DATABASE IncidentAnalysis GO CREATE DATABASE SCOPED CREDENTIAL oracle_poly WITH IDENTITY = 'poly', Secret = 'cracker'; Kim

Add External Data Source - Oracle
USE DATABASE IncidentAnalysis GO CREATE EXTERNAL DATA SOURCE IA_Oracle WITH ( LOCATION = 'oracle://myoracleservername:1521', PUSHDOWN = OFF, CREDENTIAL = oracle_poly ) Kim Pushdown - can push column projections to be done on the hadoop cluster level instead of the polybase compute nodes.

Create External Table - Oracle
USE DATABASE IncidentAnalysis GO CREATE EXTERNAL TABLE CustomerImpact( E_CustomerID FLOAT(53), E_CustomerName VARCHAR(30) COLLATE Latin1_General_BIN, E_DateofIncident Date, E_Cost FLOAT(53), E_BugID FLOAT(53)) WITH ( LOCATION='Customer.POLY.CUSTOMERIMPACT', DATA_SOURCE=IA_Oracle ); CREATE STATISTICS ora_cust_impact_cust_id_stats ON CustomerImpact (E_CustomerID) WITH FULLSCAN; ** Remember: Oracle is case sensitive ** Kim ** Mention customer Impact data is stored here

Findings Data type mismatches… SQL Server provides the needed data types user defined column type: ([Worker_ID] INT) vs. detected external table column type: ([WORKER_ID] FLOAT(53)), user defined column type: ([org] NVARCHAR(20)) vs. detected external table column type: ([ORG] VARCHAR(20) COLLATE Latin1_General_BIN), Kim

Connecting to MongoDB

Configuring MongoDB for Polybase
On the MongoDB source: Create user for PolyBase to authenticate with. Grant the account access to the collections that PolyBase will need to extract data from. On the PolyBase Head node: Do nothing! MongoDB driver is included in polybase install Mike

Create Database Scoped Credential - MongoDB
IDENTITY - the username created in the MongoDB target with access to the collections being exposed to SQL Server Polybase. SECRET - the MongoDB password for the username specified by “IDENTITY”. USE DATABASE IncidentAnalysis GO CREATE DATABASE SCOPED CREDENTIAL mongodb_poly WITH IDENTITY = 'superuser', Secret = 'test123'; Mike

Add External Data Source - MongoDB
CREATE EXTERNAL DATA SOURCE source_mongodb_poly WITH ( LOCATION = 'mongodb://mongoserver:27017', CONNECTION_OPTIONS='SSL=0', PUSHDOWN = OFF, CREDENTIAL = mongodb_poly ); Mike

Create External Table - MongoDB
USE DATABASE IncidentAnalysis GO CREATE EXTERNAL TABLE jiraBoard( [_id] NVARCHAR(24) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [SprintID] NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [AssigneeID] NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [TaskID] INT, [TaskDesc] NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [Procedure Used] NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [DateCompleted] NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS) WITH ( LOCATION='jira.jiraBoard', DATA_SOURCE= source_mongodb_poly ); CREATE STATISTICS mongo_jiraBoard_sprintid ON jiraBoard (SprintID) WITH FULLSCAN; Mike ** Mention Mongo hold JIRA Board data (maybe even explain JIRA is a tool for Agility Workflow Management)

Findings Had to add “CONNECTION_OPTIONS='SSL=0',” to turn off SSL (turned on by default) Mike

Connecting to MariaDB Kim

Configuring MariaDB for Polybase
On the MariaDB source: Create MariaDB user for PolyBase to authenticate with. Grant the account access to the tables PolyBase will need to extract data from. On the PolyBase Head node: Install the appropriate MariaDB ODBC drivers. Kim

Create Database Scoped Credential - MariaDB
IDENTITY - the username created in the Oracle target with access to the tables being exposed to SQL Server Polybase. SECRET - the Oracle password for the username specified by “IDENTITY”. USE DATABASE IncidentAnalysis GO CREATE DATABASE SCOPED CREDENTIAL mariadb_poly WITH IDENTITY = 'poly', Secret = 'cracker'; Kim

Add External Data Source - MariaDB
CREATE EXTERNAL DATA SOURCE IA_MariaDB WITH ( LOCATION = 'odbc://mymariadbservername:3306', PUSHDOWN = OFF, CREDENTIAL = mariadb_poly ); Kim

Create External Table - MariaDB
USE DATABASE Impact Analysis GO CREATE EXTERNAL TABLE documentation( E_DocumentID VARCHAR(8000) COLLATE SQL_Latin1_General_CP1_CI_AS, E_ DocumentName VARCHAR(8000) COLLATE SQL_Latin1_General_CP1_CI_AS, E_DateCreated DATE) WITH ( LOCATION='documents.documentation', DATA_SOURCE=IA_mariaDB ); CREATE STATISTICS maria_documentation_docID ON documentation (E_DocumentID) WITH FULLSCAN; ** Remember: MariaDB is case sensitive ** Kim

Connecting to Cosmos DB

Cosmos DB – Well Documented ????

Configuring CosmosDB for Polybase
On the CosmosDB source: Create user for PolyBase to authenticate with. Grant the account access to the collections that PolyBase will need to extract data from. On the PolyBase Head node: Do nothing! MongoDB driver is included in polybase install, which connects to CosmosDB through MongoDB API Mike

Create Database Scoped Credential - Cosmos DB
IDENTITY - the username created in the CosmosDB target with access to the collections being exposed to SQL Server Polybase. SECRET - the CosmosDB password for the username specified by “IDENTITY”. USE DATABASE IncidentAnalysis GO CREATE DATABASE SCOPED CREDENTIAL cosmos_poly WITH IDENTITY = 'cosmosdb-poly', Secret = 'password'; Mike

Add External Data Source - CosmosDB
CREATE EXTERNAL DATA SOURCE source_cosmos_poly WITH ( LOCATION = 'mongodb://cosmosdb-poly.documents.azure.com:10255', PUSHDOWN = OFF, CREDENTIAL = cosmos_poly); Mike

Create External Table - CosmosDB
USE DATABASE IncidentAnalysis GO CREATE EXTERNAL TABLE bugTracker( [_id] NVARCHAR(24) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [BugID] INT, [Assignee] NVARCHAR(4000) COLLATE SQL_Latin1_General_CP1_CI_AS, [IntroDate] NVARCHAR(4000) COLLATE SQL_Latin1_General_CP1_CI_AS, [Description] NVARCHAR(4000) COLLATE SQL_Latin1_General_CP1_CI_AS) WITH ( LOCATION='Bugs.bugTracker', DATA_SOURCE= source_cosmos_poly ); CREATE STATISTICS cosmos_bugtracker_bugid ON bugTracker (BugID) WITH FULLSCAN; Mike

Findings needed to use the CosmosDB emulator on-premise
needed to import documents via the CosmosDB SQL API Mike

Query the Data

What it Looks Like in SSMS
Kim the DW* databases are used by PolyBase and should not be modified.

Querying (continued…)
External tables can be queried just like a normal table: .Kim

Querying (cont…) Functions and clauses work as well: Kim.
pulling from mariadb

And Finally, the JOIN! BugTracker data is in cosmos DB
JiraBoard data is in MongoDB Customer impact data is Oracle

Results…. And a Confession….
Kim BugTracker data is in cosmos DB JiraBoard data is in MongoDB Customer impact data is CosmosDB

PowerBI Develop insights with PowerBI. A single connection to the SQL Server PolyBase is needed to gain access to all the datasources needed. Kim

Import/Export Data

Import - Example from Microsoft
-- PolyBase scenario - import external data into SQL Server -- Import data for fast drivers into SQL Server to do more in-depth analysis -- Leverage columnstore technology SELECT DISTINCT Insured_Customers.FirstName, Insured_Customers.LastName, Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus INTO Fast_Customers from Insured_Customers INNER JOIN ( SELECT * FROM CarSensor_Data where Speed > ) AS SensorD ON Insured_Customers.CustomerKey = SensorD.CustomerKey ORDER BY YearlyIncome CREATE CLUSTERED COLUMNSTORE INDEX CCI_FastCustomers ON Fast_Customers; Mike

Export - Example from Microsoft
-- PolyBase scenario - export data from SQL Server to Hadoop -- Create an external table CREATE EXTERNAL TABLE [dbo].[FastCustomers2009] ( [FirstName] char(25) NOT NULL, [LastName] char(25) NOT NULL, [YearlyIncome] float NULL, [MaritalStatus] char(1) NOT NULL ) WITH ( LOCATION='/old_data/2009/customerdata', DATA_SOURCE = HadoopHDP2, FILE_FORMAT = TextFileFormat, REJECT_TYPE = VALUE, REJECT_VALUE = 0 ); Export data: Move old data to Hadoop while keeping it query-able via an external table. INSERT INTO dbo.FastCustomers SELECT T.* FROM Insured_Customers T1 JOIN CarSensor_Data T2 ON (T1.CustomerKey = T2.CustomerKey) WHERE T2.YearMeasured = 2009 and T2.Speed > 40; Mike

Troubleshooting

Troubleshooting with DMVs
select * from sys.dm_exec_compute_nodes select * from sys.dm_exec_compute_node_errors select * from sys.dm_exec_compute_node_status Kim

Troubleshooting - Examples from Microsoft
-- PolyBase trouble-Shooting scenarios -- Pick up the query that took longest time select execution_id, st.text, dr.total_elapsed_time FROM sys.dm_exec_distributed_requests dr cross apply sys.dm_exec_sql_text(sql_handle) st order by total_elapsed_time desc -- Get the execution steps for the query based on the DSQL Plan select execution_id, step_index, operation_type, distribution_type, location_type, status, total_elapsed_time, command from sys.dm_exec_distributed_request_steps where execution_id = 'QIDXX' order by total_elapsed_time desc -- Get the DMS steps for the DMS Move select execution_id, step_index, dms_step_index, status, type, bytes_processed, total_elapsed_time from sys.dm_exec_dms_workers where execution_id = 'QIDXX' order by total_elapsed_time desc Get the information about the external DMS operations select * from sys.dm_exec_external_work where execution_id = 'QIDXX' order by total_elapsed_time desc -- Get the information about MR jobs executed during the Hadoop push-down. It contains a row for each map-reduce -- job that is pushed down to Hadoop as part of running a PolyBase query against an external table select * from sys.dm_exec_external_operations -- Get information about the scale out cluster select * from sys.dm_exec_compute_nodes --shows IS_External which is the only way to tell that this is an external table. SELECT name, type, IS_External FROM sys.tables WHERE name='bands' Kim

Find Longest Running Query
Kim

Get Execution Steps for QID
Kim

What’s Next?

What’s Next Today was a 101 level intro to Enhanced PolyBase. In the future we hope to bring you more information on: Performance / stress testing More data source examples More complex use cases Testing out new preview versions as released and eventually the GA Live Demos Import/Export in action

Wrapping Up... Today we covered: Benefits of Virtualizing your Data
What is SQL Server 2019 Enhanced PolyBase Installation/Configuration of Head and Compute Nodes Adding Oracle, MongoDB, Cosmos DB, and MariaDB as Data Sources Pulling the data all together through a single interface Tools for Troubleshooting Mike

Useful Links More Info on PolyBase:
List of DMVs: Mike

Mind Blown?!?!?! ** courtesy of James Livingston **

Ways to Connect with the User Community
SQL Server Oracle MongoDB PASS SQL Saturday PASS Summit Microsoft Ignite SQLIntersection GroupBy Local Group: PASS Ohio North Chapter IOUG Collaborate Oracle Open World Local Group: NEOOUG MongoDB Sponsored Groups: MongoDB World Percona Database Conference Local Group: Cleveland MongoDB User Group Mike

Questions? Michael Grayson Kimberly St. Jacques Twitter: @kimstjacques
LinkedIn: Blog: Twitter: @mikegray831 LinkedIn: Blog: Mike

Thank You! Mike

Data Virtualization SQL Server 2019 Enhanced Polybase

Similar presentations

Presentation on theme: "Data Virtualization SQL Server 2019 Enhanced Polybase"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Virtualization SQL Server 2019 Enhanced Polybase

Similar presentations

Presentation on theme: "Data Virtualization SQL Server 2019 Enhanced Polybase"— Presentation transcript:

Similar presentations

About project

Feedback