Please Visit Sponsors and Enter Raffles

Please Visit Sponsors and Enter Raffles

Polybase in SQL Server Big Data Queried with T-SQL
Hubert Kobierzewski Codec-dss SQL Saturday #501 – Dublin 18/06/2016

Hubert K. Kobierzewski BI Consultant in Codec-dss – over 9 years
Specialized in: Data Warehousing, ETL processes and Business Intelligence Ex-Developer MS SQL Server certified (MCDBA, MCTS, MCITP, MCSE – BI, ex-MCT) Member of Azure Advisors (internal MS group) Leader of Warsaw PLSSUG Chapter

What is Big Data and why is it valuable to the business
What is Big Data and why is it valuable to the business? Evolution in the nature and use of data in the enterprise Data complexity: variety and velocity Petabytes Historical analysis Insight Predictive analytics forecasting Value to the business KEY POINT Communicate what Big Data is TALK TRACK ERP, SCM, CRM, and transactional web applications are classic examples of systems processing transactions. Highly structured data in these systems is typically stored in SQL databases. Web 2.0 is about how people and things interact with each other or with your business. Web logs, user clickstreams, social interactions and feeds, and user- generated content are classic places to find interaction data. Big Data is the explosion of data volume and types inside and outside the business too large for traditional systems to manage. There are multiple types of data, including personal, organizational, public, and private. More Important, Big Data is changing how the business uses data from historical analysis to predictive analytics. Enterprises are using data in more progressive and higher value applications. These uses and applications are changing how data must be stored, managed, analyzed and accessed in order to provide not just the historical and insight analysis of the current data warehouse, but the predictive analytics and forecasting needed to stay competitive in the current marketplace. Megabytes

Hadoop (some elements, relevant in this presentation)
HDFS Distributed, scalable fault tolerant file system MapReduce A framework for writing fault tolerant, scalable distributed applications Hive A relational DBMS that stores its tables in HDFS and uses MapReduce as its target execution language Sqoop A library and framework for moving data between HDFS and a relational DBMS HDFS MapReduce Hive Sqoop Key goal of slide: Communicate what Hadoop is Slide talk track: Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open-source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts: HDFS – Hadoop Distributed File System is Hadoop’s file-system which stores large files (from gigabytes to terabytes) across multiple machines MapReduce – is a programming model that performs filtering, sorting and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include Hbase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper which are all parts of the Hadoop ecosystem that all perform other functions to supplement.

Move HDFS into the warehouse before analysis
Hadoop alone is not the answer to all Big Data challenges Steep learning curve, slow and inefficient Hadoop ecosystem Move HDFS into the warehouse before analysis New data sources Devices Web Sensor Social Learn new skills T-SQL New data sources Devices Web Sensor Social HDFS (Hadoop) KEY POINT Communicate conceptually how companies are managing Big Data in current data warehouse environments. This shows both setting up a side by side Hadoop and ETL data into existing data warehouse. TALK TRACK Many companies have responded to the explosion of Big Data by setting up side-by- side Hadoop ecosystems. However, these companies are learning the limitations of this approach, including: Steep learning curve of MapReduce and other Hadoop ecosystem tools Cost of installing, maintaining, and tooling side-by-side ecosystems to support two separate query models Many Hadoop solutions do not integrate into enterprise or other data warehouse systems creating complexity and cost and slowing time to insights Some Hadoop solutions feature vendor lock-in creating long term obligations Other companies set up costly extract, transform and load (ETL) operations to move non-relational data directly into the data warehouse. This requires IT to modify or create new data schema for all new data which is also time consuming and costly. As a result, performance is degraded, and it is often more expensive to integrate new data, build new applications, or access key BI insights. Build Integrate Manage Maintain Support ETL Warehouse HDFS (Hadoop) “New” data sources

PolyBase in the Modern Data Warehouse
Background Research done by Gray System Lab lead by Technical Fellow David DeWitt High-level goals for PolyBase Seamless Integration with Hadoop via regular T-SQL Enhancing the MPP query engine to process data coming from the Hadoop Distributed File System (HDFS) Fully parallelized query processing for highly performing data import and export from HDFS Integration with various Hadoop implementations Hadoop on Windows Server, Hortonworks, and Cloudera

Prerequisites for installing PolyBase
64-bit SQL Server 2016 Enterprise, Developer or Evaluation edition Microsoft .NET Framework 4.0. Oracle Java SE RunTime Environment (JRE) version 7.51 or higher Minimum memory: 4GB Minimum hard disk space: 2GB

External tables Internal representation of data residing outside of
appliance Supports wide array of data types Excluding text, ntext and similar but including binary and varbinary SQL permissions CREATE TABLE, and ALTER ANY SCHEMA Alter any data source CREATE EXTERNAL TABLE table_name ({<column_definition>}[,..n ]) {WITH (DATA_SOURCE = <data_source>, FILE_FORMAT = <file_format>, LOCATION =‘<file_path>’, [REJECT_VALUE = <value>], …} [;] 1 Referencing external data source 2 Referencing external file format 3 Path of the Hadoop file/folder 4 (Optional) Reject parameters

External data sources Internal representation of
an external data source Support of Hadoop as a data source and Windows Azure Blob Storage (WASB, formerly known as ASV) Enabling and disabling of split-based query processing Generation of MapReduce jobs on-the-fly [fully transparent for end user] ALTER ANY EXTERNAL DATA SOURCE permission required CREATE EXTERNAL DATA SOURCE datasource_name {WITH (TYPE = <data_source>, LOCATION =‘<location>’, [RESOURCE_TRACKER_LOCATION = ‘<jb_location>’] } [;] 1 Type of external data source 2 Location of external data source 3 Enabling or disabling of MapReduce job generation

External file format Internal representation of
an external file format Support of delimited text files, Hive RCFiles and Hive ORC Enabling and disabling of split-based query processing Generation of MapReduce jobs on-the-fly ALTER ANY EXTERNAL FILE FORMAT permission required CREATE EXTERNAL FILE FORMAT fileformat_name {WITH ( FORMAT_TYPE = <type>, [SERDE_METHOD = ‘<sede_method>’] [DATA_COMPRESSION = ‘<compr_method>’] [FORMAT_OPTIONS (<format_options>)] } [;] 1 Type of external data source 2 (De)Serialization method [Hive RCFile] 3 Compression method 4 (Optional) Format Options [Text Files]

Format options for delimited text files
<Format Options> :: = [,FIELD_TERMINATOR= ‘Value’], [,STRING_DELIMITER = ‘Value’], [,DATE_FORMAT = ‘Value’], [USE_TYPE_DEFAULT = ‘Value’] FIELD_TERMINATOR STRING_DELIMITER DATE_FORMAT USE_TYPE_DEFAULT To indicate a column delimiter To specify the delimiter for string data type fields To specify a particular date format To specify how missing entries in text files are treated

PolyBase – Predicate pushdown
HDFS File / Directory //hdfs/social_media/twitter //hdfs/social_media/twitter/Daily.log Dynamic binding Column filtering User Location Product Sentiment Rtwt Hour Date Sean Suz Audie Tom Sanjay Roger Steve CA WA CO IL MN TX AL xbox excel sqls wp8 ssas ssrs -1 1 5 8 2 23 1-8-14 1-7-14 SELECT User, Product, Sentiment FROM Twitter_Table WHERE Hour = Current - 1 AND Date = Today AND Sentiment <= 0 Row filtering Hadoop

Query Capabilities Push-Down Computation
SELECT DISTINCT C.FirstName, C.LastName, C.MaritalStatus FROM Insurance_Customer_SQL -- table in SQL Server … OPTION (FORCE EXTERNALPUSHDOWN) – push-down computation CREATE EXTERNAL DATA SOURCE ds-hdp WITH ( TYPE = Hadoop, LOCATION = ‘hdfs:// :8020’, Resources_Manager_Location = ‘ :8032’); Pushing Compute Either on data source level or Per-query basis using new query hints

PolyBase Demo

Use cases where PolyBase simplifies using Hadoop data Bringing islands of Hadoop data together
Running queries against Hadoop data Archiving data warehouse data to Hadoop (move) Exporting relational data to Hadoop (copy) KEY POINT Highlight the four main use cases for PolyBase. TALK TRACK There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop. PolyBase leverages the PDW MPP architecture along with optimizations like pushdown computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first. PolyBase is a great tool for archiving older or unused data in PDW to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL. There are times when you need to share your PDW with Hadoop users and PolyBasemakes it easy to copy data to a Hadoop cluster. Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes. Importing Hadoop data into a data warehouse (copy) 18

SQOOP-based connector
Hadoop cluster HDInsight The MPP Engine’s Integration Method – without PolyBase Control Node Compute Node MPP DWH Engine Name Node Data Node Hadoop Cluster SQOOP-based connector Data Node

HDInsight The MPP Engine’s Integration Method – with PolyBase
Name Node Data Node Hadoop Cluster Control Node Compute Node MPP DWH Engine DMS

Major Competitors Oracle since version 9i (ca. 2003)
IBM PureData System Pivotal Greenplum Oracle BDA (Big Data Appliance)

Read and watch more… MSDN Documentation Brief introduction on Channel 9 Andrew Peterson’s blog

Questions

Coming up Next Awesome Raffle Prizes Free Beer and BBQ from 17:00

Please Visit Sponsors and Enter Raffles

Similar presentations

Presentation on theme: "Please Visit Sponsors and Enter Raffles"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Please Visit Sponsors and Enter Raffles

Similar presentations

Presentation on theme: "Please Visit Sponsors and Enter Raffles"— Presentation transcript:

Similar presentations

About project

Feedback