Big data architectures and the data lake

Big data architectures and the data lake
4/22/2019 Big data architectures and the data lake James Serra Big Data Evangelist Microsoft With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I’ll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

About Me Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012” Fluff, but point is I bring real work experience to the session

Agenda Big Data Architectures Why data lakes? Top-down vs Bottom-up
Data lake defined Hadoop as the data lake Modern Data Warehouse Federated Querying Solution in the cloud SMP vs MPP

Big Data Architectures
? Big Data Architectures

Enterprise data warehouse augmentation
Seen when EDW has been in existence a while and EDW can’t handle new data Data hub, not data lake Cons: not offloading EDW work, can’t use existing tools, difficulty understanding data in data hub This scenario uses an enterprise data warehouse (EDW) built on a RDBMS, but will extract data from the EDW and load it into a big data hub along with data from other sources that are deemed not cost-effective to move into the EDW (usually high-volume data or cold data). Some data enrichment is usually done in the data hub. This data hub can then be queried, but primary analytics remain with the EDW. The data hub is usually build on Hadoop or NoSQL. This can save costs since storage using Hadoop or NoSQL is much cheaper than an EDW. Plus, this can speed up the development of reports since the data in Hadoop or NoSQL can be used right away instead of waiting for an IT person to write the ETL and create the schema’s to ingest the data into the EDW. Another benefit is it can support data growth faster as it is easy to expand storage on a Hadoop/NoSQL solution instead of on a SAN with an EDW solution. Finally, it can help by reducing the number of queries on the EDW. This scenario is most common when a EDW has been in existence for a while and users are requesting data that the EDW cannot handle because of space, performance, and data loading times. The challenges to this approach is you might not be able to use your existing tools to query the data hub, as well as the data in the hub being difficult to understand and join and may not be completely clean. Not using big data repository as cleaning area, offloading from EDW

Data hub plus EDW Data hub is used as temporary staging and refining, no reporting Cons: data hub is temporary, no reporting/analyzing done with the data hub The data hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed in the EDW. Hadoop or NoSQL is used to refine the data in the data hub. Once refined, the data is copied to the EDW and then deleted from the data hub. This will lower the cost of data capture, provide scalable data refinement, and provide fast queries via the EDW. It also offloads the data refinement from the EDW. Cons: data hub is temporary, no reporting or analyzing done with it (temporary)

All-in-one Data hub is total solution, no EDW
Cons: queries are slower, new training for reporting tools, difficulty understanding data, security limitations A distributed data system is implemented for long-term, high-detail big data persistence in the data hub and analytics without employing a EDW. Low level code is written or big data packages are added that integrate directly with the distributed data store for extreme-scale operations and analytics. The distributed data hub is usually created with Hadoop, HBase, Cassandra, or MongoDB. BI tools specifically integrated with or designed for distributed data access and manipulation are needed. Data operations either use BI tools that provide NoSQL capability or low-level code is required (e.g., MapReduce or Pig script). The disadvantages of this scenario are reports and queries can have longer latency, new reporting tools require training which could lead to lower adoption, and the difficulty of providing governance and structure on top of a non-RDBMS solution.

Modern Data Warehouse Evolution of three previous scenarios
Ultimate goal Supports future data needs Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data. Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.

? Why data lakes?

Traditional business analytics process
Start with end-user requirements to identify desired reports and analysis Define corresponding database schema and queries Identify the required data sources Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) Create reports. Analyze data New requirements Create ETL pipeline Create reports Do analytics Identify data schema and queries Identify data sources ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications All data not immediately required is discarded or archived

Need to collect any data
Harness the growing and changing nature of data Structured Unstructured “ ” Streaming Key Points: Businesses can use new data streams to gain a competitive advantage. Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming. Talk Track: Does it not seem like every day there is a new kind of data that we need to understand? New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it. Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business. Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social. Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second. All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach: Collecting data on-premises with SQL Server SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps. If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple. Capture new data types using the power and flexibility of the Microsoft Azure Cloud Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business. Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools. SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology. Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images. Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities. Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities. Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. Document DB: a fully managed, highly scalable, NoSQL document database service Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data Azure Data Factory: enables information production by orchestrating and managing diverse data Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds Microsoft Analytics Platform System: In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse. But this traditional data warehouse is under pressure, hitting limits amidst massive change. Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights. They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments. Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer. The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance. Now you can collect relational and non-relational data in one appliance. You can have seamless integration of the relational data warehouse and Hadoop with PolyBase. All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types. Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format

The three V’s Social and web analytics Live data feeds
MGXFY13 4/22/2019 The three V’s Social and web analytics Live data feeds Advanced analytics © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

New big data thinking: All data has value
All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit Iterate All data has immediate or potential value This leads to data hoarding—all data is stored indefinitely With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit. Gather data from all sources Store indefinitely Analyze See results

? Top-down vs Bottom-up

Two Approaches to getting value out of data: Top-Down + Bottoms-Up
(Deductive) Bottoms-Up (Inductive) How can we make it happen? VALUE Prescriptive Analytics What will happen? Theory Theory Predictive Analytics Hypothesis Why did it happen? Hypothesis OPTIMIZATION Pattern Diagnostic Analytics Observation What happened? Observation Confirmation Descriptive Analytics Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data There are two approaches to doing information management for analytics: Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen? Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen? In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach. . INFORMATION DIFFICULTY Know the questions to ask Lot’s of upfront work to get the data to where you can use it Model first Don’t know the questions to ask Little upfront work needs to be done to start using data Model later

Data Warehousing Uses A Top-Down Approach
Data sources OLTP ERP CRM LOB ETL BI and analytic Dashboards Reporting Data warehouse Understand Corporate Strategy Gather Requirements Business Requirements Technical Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure

The “data lake” Uses A Bottoms-Up Approach
Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Devices Social Batch queries Devices LOB applications Video Interactive queries Social LOB applications Real-time analytics Sensors Web Sensors Video Machine Learning Relational Web Clickstream Data warehouse Relational Clickstream

Data Lake + Data Warehouse Better Together
Data sources OLTP ERP CRM LOB ETL BI and analytic Dashboards Reporting Data warehouse What happened? What will happen? Descriptive Analytics Predictive Analytics LOB applications Devices Social Video Relational Why did it happen? Web Sensors Clickstream How can we make it happen? Diagnostic Analytics Prescriptive Analytics

? Data lake defined

Exactly what is a data lake?
the Parallel Data Warehouse Appliance 4/22/2019 Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. Inexpensively store unlimited data Collect all data “just in case” Store data with no modeling – “Schema on read” Complements EDW Frees up expensive EDW resources Quick user access to data ETL Hadoop tools Easily scalable Place to backup data to Place to move older data Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Traditional Approaches
Current state of a data warehouse Well manicured, often relational sources Known and expected data volume and formats Little to no change Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay Complex, rigid transformations Required extensive monitoring Transformed historical into read structures

Traditional Approaches
Current state of a data warehouse Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation

New Approaches Data Lake Transformation (ELT not ETL)
Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake: - Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced - Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS - Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy - ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake - There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools) - The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools

Data Analysis Paradigm Shift
4/22/2019 Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

4/22/2019 Data Lake layers Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation Sandbox data layer – Optional layer to be used to “play” in. Also called exploration layer or data science workspace Still need data governance so your data lake does not turn into a data swamp! Question: Do you see many companies building data lakes? © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Application Data Layer
4/22/ :23 PM Data Lake Layers Raw Data Layer Cleansed Data Layer Application Data Layer Sandbox Data Layer Question: Do you see many companies building data lakes? Raw: Raw events are stored for historical reference. Also called staging layer or landing area Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation Sandbox: Optional layer to be used to “play” in. Also called exploration layer or data science workspace Needs data governance so your data lake does not turn into a data swamp! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Should I use Hadoop or NoSQL for the data lake?
4/22/2019 Should I use Hadoop or NoSQL for the data lake? Most implementations use Hadoop as the data lake because of these benefits: Open-source software ecosystem that allows for massively parallel computing No inherent structure (no conversion to JSON needed) Good for batch processing, large files, volume writes, parallel scans, sequential access (NoSQL designed for large-scale OLTP) Large ecosystem of products Low cost Con: performance © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

? Hadoop as the data lake

What is Hadoop? Distributed, scalable system on commodity HW
4/22/2019 What is Hadoop? Distributed, scalable system on commodity HW Composed of a few parts: HDFS – Distributed file system MapReduce – Programming model Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm Main players are Hortonworks, Cloudera, MapR WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) OPERATIONAL SERVICES DATA SERVICES AMBARI PIG HIVE & HCATALOG FLUME OOZIE SQOOP FALCON HBASE Core Services LOAD & EXTRACT MAP REDUCE NFS YARN WebHDFS HDFS Hadoop Cluster Key goal of slide: Communicate what Hadoop is Slide talk track: Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open-source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts: HDFS – Hadoop Distributed File System is Hadoop’s file-system which stores large files (from gigabytes to terabytes) across multiple machines MapReduce – is a programming model that performs filtering, sorting and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include Hbase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper which are all parts of the Hadoop ecosystem that all perform other functions to supplement. compute & storage . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware Microsoft Confidential © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hortonworks Data Platform (HDP) 2.6
the Parallel Data Warehouse Appliance 4/22/2019 Hortonworks Data Platform (HDP) 2.6 (under the covers of HDInsight) Simply put, Hortonworks ties all the open source products together (22) © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

The real cost of Hadoop Total solution cost (5 years)
Hadoop 3.2x cheaper RDBMS 3.6x cheaper “Big Data – What Does It Really Cost?” Winter Corporation, 2013,

the Parallel Data Warehouse Appliance
4/22/2019 Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake) HDInsights benefits: Cheap, quickly procure Key goal of slide: Highlight the four main use cases for PolyBase. Slide talk track: There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop. PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first. PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL. There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster. Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes. 37

? Modern Data Warehouse

Modern Data Warehouse Think about future needs:
Increasing data volumes Real-time performance New data sources and types Cloud-born data Multi-platform solution Hybrid architecture Dell Microsoft Analytics Platform System (v2, SQL 2012, 15TB-6PB) HP AppSystem for SQL 2012 Parallel Data Warehouse (v2, SQL 2012, 15TB-6PB) Quanta Microsoft Analytics Platform System (v2, SQL 2012, 15TB-6PB)

The Dream Modern Data Warehouse Enterprise Data Warehouse All Sources

The Reality 1) Copy source data into the Azure Data Lake Store (twitter data example) 2) Massage/filter the data using Hadoop (or skip using Hadoop and use stored procedures in SQL DW/DB to massage data after step #5) 3) Pass data into Azure ML to build models using Hive query (or pass in directly from Azure Data Lake Store) 4) Azure ML feeds prediction results into the data warehouse 5) Non-relational data in Azure Data Lake Store copied to data warehouse in relational format (optionally use PolyBase with external tables to avoid copying data) 6) Power BI pulls data from data warehouse to build dashboards and reports 7) Azure Data Catalog captures metadata from Azure Data Lake Store and SQL DW/DB 8) Power BI and Excel can pull data from the Azure Data Lake Store via HDInsight 9) To support high concurrency if using SQL DW, or for easier end-user data layer, create an SSAS cube

Base Architecture : Big Data Advanced Analytics Pipeline
4/22/ :23 PM Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) OnPrem Data Azure Services Near Realtime Data Analytics Pipeline using Azure Steam Analytics Machine Learning (Anomaly Detection) Data Stream Telemetry Event Hub Stream Analytics (real-time analytics) Live / real-time data stats, Anomalies and aggregates PowerBI dashboard Data in Motion Data at Rest Interactive Analytics and Predictive Pipeline using Azure Data Factory Realtime Readings and Operational Data HDI Custom ETL Aggregate /Partition Machine Learning Local DB Sensor Readings Local DB Logs Customer MIS dashboard of predictions / alerts (Replaced by Azure SQL) Legacy Azure Storage Blob Azure SQL (Predictions) Historic Laser Data (1 time drop) Fault and Maintenance Data (1 time drop) Scheduled hourly transfer using Azure Data Factory Big Data Analytics Pipeline using Azure Data Lake Sensor Readings Device Health dashboard of operational stats Azure Data Lake Storage Azure Data Lake Analytics (Big Data Processing) Azure SQL Operational Logs © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Data Lake with DW use cases
Staging & preparation Data scientists/Power users Batch processing Data refinement/cleaning ETL workloads Store older/backup data Sandbox for data exploration One-time reports Quick access to data Don’t know questions Data Warehouse Serving, Security & Compliance Business people Low latency Complex joins Interactive ad-hoc query High number of users Additional security Large support for tools Dashboards Easily create reports (Self-service BI) Know questions

Microsoft data platform solutions
Product Category Description More Info SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic quadrant. JSON support. Linux support SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support. Managed Instance option soon SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision and scale quickly. Can pause service to reduce cost Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics HDInsight PaaS Hadoop compute/Hadoop clusters-as-a-service A managed Apache Hadoop, Spark, R Server, HBase, Kafka, Interactive Query (Hive LLAP) and Storm cloud service made easy Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure Azure Data Lake Analytics On-demand analytics job service/Big Data-as-a-service Cloud-based service that dynamically provisions resources so you can run queries on exabytes of data. Includes U-SQL, a new big data query language Azure Cosmos DB PaaS NoSQL: Key-value, Column-family, Document, Graph Globally distributed, massively scalable, multi-model, multi-API, low latency data service – which can be used as an operational database or a hot data lake Azure Database for PostgreSQL, MySQL, and MariaDB A fully managed database service for app developers “If you need a specialized JSON database in order to take advantage of automatic indexing of JSON fields, tunable consistency levels for globally distributed data, and JavaScript integration, you may want to choose Azure DocumentDB as a storage engine.” “If you have pure JSON workloads where you want to use some query language that is customized and dedicated for processing of JSON documents, you might consider Microsoft Azure DocumentDB.”

Cortana Intelligence Suite Integrated as part of an end-to-end suite
4/22/ :23 PM Cortana Intelligence Suite Integrated as part of an end-to-end suite Data Sources Apps Sensors and devices Data Information Management Event Hubs Data Catalog Data Factory Big Data Stores Machine Learning and Analytics Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Action People Automated Systems Apps Web Mobile Bots Machine Learning Data Lake Store SQL Data Warehouse Data Lake Analytics HDInsight (Hadoop and Spark) Stream Analytics Offer structures: A la carte, Data Intensive, Analytics Intensive, Stream Intensive, All-inclusive Intelligence © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

? Federated Querying

Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology

Server & Tools Business
4/22/2019 SQL Server and PolyBase Query relational and non-relational data with T-SQL Capability T-SQL for querying relational and non-relational data across SQL Server, Hadoop, Azure Data Lake Store and Azure blob storage Benefits New business insights across your data lake Leverage existing skillsets and BI tools Faster time to insights and simplified ETL process When it comes to key BI investments we are making it much easier to manage relational and non-relational data with Polybase technology that allows you to query Hadoop data and SQL Server relational data through single T-SQL query. One of the challenges we see with Hadoop is there are not enough people out there with Hadoop and Map Reduce skillset and this technology simplifies the skillset needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure. PolyBase will add support for Teradata, Oracle, SQL Server, MongoDB, and generic ODBC (Spark, Hive, Impala, DB2) © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

? Solution in the cloud

Benefits of the cloud Agility Innovation Risk Unlimited elastic scale
Pay for what you need Innovation Quick “Time to market” Fail fast Risk Availability Reliability Security Four Reasons to Migrate Your SQL Server Databases to the Cloud: Security, Agility, Availability, and Reliability Reasons not to move to the cloud: Security concerns (potential for compromised information, issues of privacy when data is stored on a public facility, might be more prone to outside security threats because its high-profile, some providers might not implement the same layers of protection you can achieve in-house) Lack of operational control: Lack of access to servers (i.e. say you are hacked and want to get to security and system log files; if something goes wrong you have no way of controlling how and when a response is carried out; the provider can update software, change configuration settings, and allocate resources without your input or your blessing; you must conform to the environment and standards implemented by the provider) Lack of ownership (an outside agency can get to data easier in the cloud data center that you don’t own vs getting to data in your onsite location that you own. Or a concern that you share a cloud data center with other companies and someone from another company can be onsite near your servers) Compliance restrictions Regulations (health, financial) Legal restrictions (i.e. data can’t leave your country) Company policies You may be sharing resources on your server, as well as competing for system and network resources Data getting stolen in-flight (i.e. from the cloud data center to the on-prem user) Total cost of ownership calculator:

Constraints of on-premise data
Scale constrained to on-premise procurement Capex up-front costs, most companies instead prefer a yearly operating expense (OpEx) A staff of employees or consultants must be retained to administer and support the hardware and software in place Expertise needed for tuning and deployment

Talking points when using the cloud for DW
Public and private cloud Cloud-born data vs on-prem born data Transfer cost from/to cloud and on-prem Sensitive data on-prem, non-sensitive in cloud Look at hybrid solutions

? SMP vs MPP

SMP vs MPP SMP - Symmetric Multiprocessing
Multiple CPUs used to complete individual processes simultaneously All CPUs share the same memory, disks, and network controllers (scale-up) All SQL Server implementations up until now have been SMP Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing Uses many separate CPUs running in parallel to execute a single program Shared Nothing: Each CPU has its own memory and disk (scale-out) Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).

DW Scalability Spider Chart
“Data Volume” “Mixed Workload” MPP – Multidimensional Scalability SMP – Tunable in one dimension on cost of other dimensions 50 TB 100 TB 500 TB 10 TB 5 PB “Query Concurrency“ Strategic, Tactical Strategic Loads Loads, SLA The spiderweb depicts important attributes to consider when evaluating Data Warehousing options. Big Data support is newest dimension. 1.000 100 10.000 “Data Freshness” “Query complexity“ Near Real Time Data Feeds Daily Load Weekly 3-5 Way Joins Joins + OLAP operations + Aggregation + Complex “Where” constraints + Views Parallelism 5-10 Way Joins Normalized Multiple, Integrated Stars and Normalized Simple Star Multiple, Integrated Stars Batch Reporting, Repetitive Queries Ad Hoc Queries Data Analysis/Mining Data Volume (Raw, User Data) - Raw data stored in the warehouse. This is the user data stored in the warehouse. It does not include generated data that also takes space within the warehouse such as indexes, summarizations, aggregations, duplicated data, and system overhead. Query Concurrency - The volume of work that can be done at the same time. Most commonly the number of queries that the database can process at the same time. It can also include load and in-database transformation work and stored procedure processing activity. Logged-on users not currently executing a query do no add to the concurrency workload. Query Complexity - The degree to which queries are complex in areas that make a query difficult or resource intensive for a database system. These areas include the number of tables involved in joins, complex "where" constraints in the SQL, aggregations and statistical functions, and the use of views in addition to base tables. Business intelligence query tools often generate very complex queries. Schema Sophistication – The ability to chose the scheme to meet my business requirements verses limiting the complexity of the schema due to technology performance limitations of the database. It’s the ability to be able to deploy a denormalized star schema, a sophisticated and complex normalized 3NF schema, or a combination of the two or anywhere in between to meet the requirements of the business. Query Data Volume - Refers to how much data must be touched to satisfy a query. Teradata features that can be cited as reducing the amount of data touched would include our unsurpassed compression capabilities, efficient row storage, strong indexing capabilities, and lack of storage requirement for primary index. Query Freedom - The ability for users to ask any question of the data at the time best for the business. This is an indication of how free the users are to ask exploratory, broad, or complex questions as well as expected and tuned queries and to ask new types of questions associated with new applications. Data Freshness - The ability to load data into the warehouse and to update data in the warehouse at the speed the business operates. This is an indication of whether the data in the warehouse can be kept current and in sync with business processes and operations to the degree necessary to respond to events and business activities as well as to provide meaningful analyses. Mixed Workload - The ability of the database to handle the broad mix of tasks for which a data warehouse is used today without impacting the effectiveness in any area. For example, data warehouses must answer complex strategic questions as well as brief tactical questions or customer inquiries. At the same time, data must be loaded and updated. Can the database handle the various workloads concurrently while meeting the very different service level agreement attributes (e.g., response time, performance consistency) of the various types of work? Does the database require separation of work (e.g., batch windows)? TB’s MB’s GB’s “Query Freedom“ “Schema Sophistication“ “Query Data Volume“

Summary We live in an increasingly data-intensive world
Much of the data stored online and analyzed today is more varied than the data stored in recent years More of our data arrives in near-real time “Data is the new currency!” This present a large business opportunity. Are you ready for it?

Other Related Presentations
Building a Big Data Solution Choosing technologies for a big data solution in the cloud How does Microsoft solve Big Data? Benefits of the Azure cloud Should I move my database to the cloud? Implement SQL Server on a Azure VM Relational databases vs Non-relational databases Introduction to Microsoft’s Hadoop solution (HDInsight) Introducing Azure SQL Database Introducing Azure SQL Data Warehouse Visit my blog at: JamesSerra.com (where these slide decks are posted under the “Presentation” tab)

Resources Why use a data lake? http://bit.ly/1WDy848
Big Data Architectures The Modern Data Warehouse: Hadoop and Data Warehouses:

? Q & A James Serra, Big Data Evangelist
me at: Follow me Link to me at: Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Big data architectures and the data lake

Similar presentations

Presentation on theme: "Big data architectures and the data lake"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big data architectures and the data lake

Similar presentations

Presentation on theme: "Big data architectures and the data lake"— Presentation transcript:

Similar presentations

About project

Feedback