Scalable Data Pipelines

Slides:



Advertisements
Similar presentations
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Advertisements

Business Intelligence for everyone 2 For BI to deliver maximum value, all Information Workers must participate: Broad access to uncover and share insights.
Andy Roberts Data Architect
AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October
Internal Modern Data Platform Somnath Data Platform Architect.
11/19/2017 9:41 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Connected Infrastructure
4/19/ :02 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
4/18/2018 6:56 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Data Platform and Analytics Foundational Training
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Data Platform and Analytics Foundational Training
Smart Building Solution
Using a Gateway to Leverage On-Premises Data in Power BI
Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.
Using a Gateway to Leverage On-Premises data in Power BI
Enable the Hybrid Data Platform
ADF & SSIS: New Capabilities for Data Integration in the Cloud
Smart Building Solution
Connected Infrastructure
Extensible Platform Microsoft Dynamics 365
Remote Monitoring solution
Using a Gateway to Leverage On-Premises Data in Power BI
Add intelligence to Dynamics AX with Cortana Intelligence suite
Cloudy with a Chance of Data
9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.
Enterprise security for big data solutions on Azure HDInsight
Microsoft Azure Platform Powers New Elements Constellation Software Suite to Deliver Invaluable Insights From Your Data for Marketing and Sales MICROSOFT.
Welcome! Power BI User Group (PUG)
11/9/2018 5:08 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Oscar AP by Massive Analytic: A Precognitive Analytics Platform for Effortless Data-Driven Decisions. Now Available in Azure Marketplace MICROSOFT AZURE.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
BRK2279 Real-World Data Movement and Orchestration Patterns using Azure Data Factory Jason Horner, Attunix Cathrine Wilhelmsen, Inmeta -
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Accelerate Your Self-Service Data Analytics
Welcome! Power BI User Group (PUG)
Near Real Time ETLs with Azure Serverless Architecture
Azure Data Factory + SSIS: Migrating your ETLs to the Cloud
Orchestration and data movement with Azure Data Factory v2
SSIS in the Cloud Integration Runtime in Azure Data Factory V2
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Cloud Analytics for Microsoft Azure
Modern cloud PaaS for mobile apps, web sites, API's and business logic apps
Microsoft Virtual Academy
XtremeData on the Microsoft Azure Cloud Platform:
THR1171 Azure Data Integration: Choosing between SSIS, Azure Data Factory, and Azure Databricks Cathrine Wilhelmsen, | cathrinew.net.
Azure Data Factory + SSIS: Migrating your ETLs to the Cloud
TEMPLATE NOTES Our datasheet and mini-case study templates are formatted specifically for consistency of branding at Microsoft. Please do not alter font.
Improve Patient Experience with Saama and Microsoft Azure
Context about the Data Warehouse
Technical Capabilities
Serverless Architecture in the Cloud
2/19/2019 9:06 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Azure Data Factory + SSIS: Migrating your ETLs to the Cloud
Orchestration and data movement with Azure Data Factory v2
School Districts Can Analyze and Report on Data Across Multiple Systems with EdWire, a Powerful Integration Solution that Utilizes Microsoft Azure MICROSOFT.
Introducing Power BI dataflows
Understanding Azure Data Engineering Options Finding Clarity in a Vast & Changing Landscape Cameron Snapp.
Azure Data Factory + SSIS: Migrating your ETLs to the Cloud
ETL Patterns in the Cloud with Azure Data Factory
TN19-TCI: Integration and API management using TIBCO Cloud™ Integration
Data Wrangling for ETL enthusiasts
Michael French Principal Consultant 5/18/2019
Beyond orchestration with Azure Data Factory
SQL Server 2019 Bringing Apache Spark to SQL Server
Get your data flowing with Data Flows! and...umm...dataflows.
Visual Data Flows – Azure Data Factory v2
Dimension Load Patterns with Azure Data Factory Data Flows
Visual Data Flows – Azure Data Factory v2
Architecture of modern data warehouse
Presentation transcript:

Scalable Data Pipelines Azure Data Factory and Data Flows Ted Malone – Solution Architect – Microsoft @tedmalone

Thank you to our Sponsors www.Meetup.com/EdiDpMeetup Thank you to our Sponsors Global Partner GOLD Silver Bronze ..and special THANK YOU to our Volunteers (light blue #SQLSat927polo shirt)

Make Plans for PASS Summit 2020 Join us in Houston November 10 – 13, 2020 For the largest gathering of Microsoft Data Professionals. Make Plans for PASS Summit 2020 Over 200 sessions by industry experts and 3 days of networking with people just like you. Don’t miss out, future-proof your career at PASSsummit.com

About me Microsoft Solution Architect – Data and AI, Customer Success Unit AI, Machine Learning, Big Data, Advanced Analytics, Data Warehousing, etc. Long-time SQL Server geek (1st version installed on OS/2 1.1) Previous Visual Studio Team System MVP http://aka.ms/tedmalone http://blog.sqltrainer.com

        Azure Data Factory

The world is changing Let’s start by talking about how the world is changing We all aspire to create disruption that constructs new realities for customers and builds a distinct advantage for our organizations. In order to make that leap, we have to look across trends and decide: which variables and trends we need to care about which ones will prevail which ones we invest in

The 1st is the astronomical explosion of data Data will grow to 44 ZB in 2020 Today, 80% of organizations adopt cloud-first strategies AI investment increased by 300% in 2017 We’ve identified 3 major trends we believe will heavily shape and shoulder disruption in the future The 1st is the astronomical explosion of data By 2020, data will reach 44ZB, quadruple what it is today The 2nd trend is cloud adoption 4 out of every 5 companies invest in public cloud technologies The 3rd major trend is artificial intelligence AI gives life to all that data we’re creating This year alone, investment in AI has increased 300%

Organizations that harness data, cloud, and AI outperform <click>

There are barriers to getting value from data Data silos Incongruent data types Complexity of solutions Multi cloud environment Rising costs Through a Keystone research study we learned companies in the top quartile for “investing in their data platform” vastly outperformed companies in the bottom quartile With double the operating margin, they are crushing the competition Investments ALONE don’t make you money and they don’t give you a competitive edge. Investments enable innovation. This leads us to our second point that…. Data integration Support for diverse data models Unlimited scale Fully-managed infrastructure Lower TCO On-premises, in the cloud or hybrid

Derive real value from your data Data silos Incongruent data types Performance constraints Complexity of solutions Rising costs One hub for all data Support for diverse types of data Unlimited data scale Familiar tools and ecosystem Lower TCO Through a Keystone research study we learned companies in the top quartile for “investing in their data platform” vastly outperformed companies in the bottom quartile With double the operating margin, they are crushing the competition Investments ALONE don’t make you money and they don’t give you a competitive edge. Investments enable innovation. This leads us to our second point that…. On-premises, hybrid, Azure

Organizations that harness data, cloud, and AI outperform Nearly double operating margin $100M in additional operating income Through a Keystone research study we learned companies in the top quartile for “investing in their data platform” vastly outperformed companies in the bottom quartile With double the operating margin, they are crushing the competition Investments ALONE don’t make you money and they don’t give you a competitive edge. Investments enable innovation. This leads us to our second point that….

A fully-managed data integration service in the cloud Azure data factory A fully-managed data integration service in the cloud PRODUCTIVE HYBRID SCALABLE TRUSTED Drag & Drop UI Codeless Data Movement Orchestrate where your data lives Lift SSIS packages to Azure Serverless scalability with no infrastructure to manage Certified compliant Data Movement

Azure data factory Modernize your enterprise data warehouse at scale Integrate via Azure Data Factory Social LOB Graph IoT Image CRM Cloud VNet On-premise INGEST STORE PREP & TRANSFORM MODEL & SERVE Azure Analysis Services Data orchestration, scheduling and monitoring Azure Data Lake Azure Storage Data Transformations Machine Learning Azure SQL DW, HDInsight, Data Lakes Apps and Insights

Orchestrate with Azure Data Factory Modernize your enterprise data warehouse at scale Orchestrate with Azure Data Factory INGEST STORE PREP & TRAIN MODEL & SERVE On-premises data Oracle, SQL, Teradata, fileshares, SAP Azure Databricks Cloud data Azure, AWS, GCP Azure Data Factory Azure Blob Storage Polybas e Azure SQL Data Warehouse Azure Analysis Services Power BI SaaS data Salesforce, Workday, Dynamics Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure SQL Database and Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs.

SSIS Integration Runtime Lift your SQL Server Integration Services (SSIS) packages to Azure Azure Data Factory SSIS Integration Runtime Cloud data sources SSIS Cloud ETL SQL DB Managed Instance Cloud On-premises VNET Microsoft SQL Server Integration Services On-Premise data sources SQL Server

Hybrid and Multi-Cloud Data Integration Azure Data Factory PaaS Data Integration DATA DRIVEN APPLICATIONS Author, orchestrate and monitor with Azure Data Factory DATA SCIENCE AND MACHINE LEARNING MODELS On-Prem SaaS Apps Public Cloud ANALYTICAL DASHBOARDS USING POWER BI

Access all your data 65+ connectors & growing Azure IR available in 20 regions Hybrid connectivity using self-hosted IR: on-prem & VNet Azure Database File Storage NoSQL Services and Apps Generic Azure Blob Storage Amazon Redshift SQL Server Amazon S3 Couchbase Dynamics 365 Salesforce HTTP Azure Data Lake Store Oracle MySQL File System Cassandra Dynamics CRM Salesforce Service Cloud OData Azure SQL DB Netezza PostgreSQL FTP MongoDB SAP C4C ServiceNow ODBC Azure SQL DW SAP BW SAP HANA SFTP Oracle CRM Hubspot Azure Cosmos DB Google BigQuery Informix HDFS Oracle Service Cloud Marketo Azure DB for MySQL Sybase DB2 SAP ECC Oracle Responsys Azure DB for PostgreSQL Greenplum MariaDB Zendesk Oracle Eloqua Azure Search Microsoft Access Drill Zoho CRM Salesforce ExactTarget Azure Table Storage Hive Phoenix Amazon Marketplace Atlassian Jira Azure File Storage Hbase Presto Megento Concur Impala Spark PayPal QuickBooks Online Vertica Shopify Xero GE Historian Square Web table * Supported file formats: CSV, AVRO, ORC, Parquet, JSON

Control Flow Introduced in Azure Data Factory Coordinate pipeline activities into finite execution steps to enable looping, conditionals and chaining while separating data transformations into individual data flows My Pipeline 1 For Each… My Pipeline 2 Trigger Event Wall Clock On Demand Success, params Success, params Activity 3 Activity 1 Activity 2 Activity 1 Activity 4 Activity 2 Error, params “On Error” Activity 1 … …

Azure Data Factory Updated Flexible Application Model 2/13/2020 1:38 AM Azure Data Factory Updated Flexible Application Model Triggers Trigger Runs Pipeline Pipeline Runs Linked Service Data Movement Activity Integration Runtime Activity Data Transformation Dispatch Dataset Activity Activity Runs A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy activity to copy data from an on-premises SQL Server to Azure Blob storage. Then, you might use a Hive activity that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data. Finally, you might use a second copy activity to copy the output data to Azure SQL Data Warehouse, on top of which business intelligence (BI) reporting solutions are built. A dataset is a named view of data that references the data and the structure of the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data. Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Think of it this way; the dataset represents the structure of the data within the linked data stores, and the linked service defines the connection to the data source. For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the folder within that Azure storage account that contains the input blobs to be processed. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

ADF: Cloud-First Data Integration Objectives Consume hybrid disparate data On-prem + Cloud Grow ADF ecosystem of structured, un-structured, semi-structured data connectors Calculate and format data for analytics Transform, aggregate, join, normalize Separate data flow (transformation) from control flow (orchestration) Address large-scale Big Data requirements Scale-up or Scale-out data movement and transformation Support multiple processing engines Operationalize Support flexible scheduling and triggering mechanism for broad range of use cases Manage & monitor multiple pipelines (via Azure Monitor & OMS) Support secure VNET environments Lift and Shift SSIS to the Cloud Execute SSIS packages in ADF Integration Runtime

ADF: Cloud-First Data Integration Scenarios 2/13/2020 1:38 AM Lift and Shift to the Cloud Migrate on-prem DW to Azure Lift and shift existing on-prem SSIS packages to cloud No changes needed to migrate SSIS packages to Cloud service DW Modernization Modernizing DW arch to reduce cost & scale to needs of big data (volume, variety, etc) Flexible wall-clock and triggered event scheduling Incremental Data Load Build Data-Driven, Intelligent SaaS Application C#, Python, PowerShell, ARM support Big Data Analytics Customer profiling, Product recommendations, Sentiment Analysis, Churn Analysis, Customized offers, customer usage tracking, customized marketing On-demand Spark cluster support Load your Data Lake Separate control-flow to orchestrate complex patterns with branching, looping, conditional processing © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Azure Data Factory Mapping Data Flows        

What are Mapping Data Flows? 2/13/2020 1:38 AM What are Mapping Data Flows? Data Flow is a new feature of Azure Data Factory that allows you to build data transformations in a visual user interface Transform at scale, in the cloud Code-free pipelines​ do NOT require understanding of Spark / Scala / Python / Java​ Serverless scale-out transformation execution engine​ Resilient data transformation Flows​ built for big data scenarios with unstructured data requirements​ Operationalized with Data Factory scheduling, control flow and monitoring​ © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Code-free Data Transformation At Scale Does not require understanding of Spark, big data execution engines, clusters, Scala, Python, etc Focus on building business logic and data transformation Data cleansing Aggregation Data conversions Data prep Data exploration … not …

Modern Data Warehouse Pattern Today Data Loading Azure Data Factory Databases Ingest storage Data processing Serving storage Applications Azure Storage/ Data Lake Store Azure Databricks Load processed data into tables optimized for analytics Azure SQL DW r Load flat files into data lake on a schedule Read data from files using DBFS Logs, files, and media (unstructured) Orchestration Clean and join with stored data Dashboards Azure Data Factory Business/custom apps (structured) Extract and transform relational data Load to SQL DW

Modern Data Warehouse Pattern with Mapping Data Flows Data Loading Azure Data Factory Databases Ingest storage Serving storage Data Flow Data Transformation Applications Azure Storage/ Data Lake Store Extract and transform relational data Load processed data into tables optimized for analytics Azure Data Factory Azure SQL DW r Load files into data lake on a schedule Logs, files, and media (unstructured) Azure Databricks Clean and join disparate data Dashboards Scheduled & orchestrated by ADF Business/custom apps (structured)

Pipeline execution of a Data Flow Activity Design code-free ETL workflows Copy data from on- prem, other clouds and Azure Stage data for transformation Build visual data transformations Schedule triggers for your pipeline execution Monitor processes and configure alerts All within ADF

Mapping Data Flow common scenarios  

Slowly Changing Dimension Scenario Common DW pattern to manage changing attributes to dimension members Graphically build code-free SCD ETL pattern to load your data warehouse Connect directly to Azure SQL DB and Azure SQL DW Use Lookup, Surrogate Key, Derived Column and Select transforms

Data De-Duplication Use this pattern to eliminate common rows from your data You pick a heuristic to use during duplicate matching You can tag rows and/or remove duplicate rows Use exact matching and/or fuzzy matching Available as pipeline template Dedupe Pipeline

Load Fact Table in DW Scenario Classic ETL pattern is easy to build in ADF’s code-free Data Flow visual data transformation environment Add Aggregate transforms to produce calculations that you store in your analytical database schema Use Join transform to combine data from multiple data sources and data streams inside your data flow Land your data in your Lake folders or direct to Azure SQL DW

Fuzzy Lookups Sometime when performing inline lookups, you don’t have exact matches when looking for references Fuzzy Lookups with Soundex helps find matches based on phonetic algorithms Very useful in data lake scenarios where joins and lookups are against data that is not normalized or cleaned

Data Lake Data Science Scenario ADF supports building visual data transformations against your data directly in Data Lake locations (i.e. Azure Blob Store, Azure Data Lake Store) Built-in handling of schema drift for frequent changes in data lake file formats, columns, and data types Perform data exploration and data profiling across your data lake in ADF Data Flow with interactive debug data preview and quick actions

Resources Tutorial Videos: http://aka.ms/dataflowvideos Patterns: http://aka.ms/dataflowpatterns Documentation: https://docs.microsoft.com/en-us/azure/data- factory/concepts-data-flow-overview Expression Language: http://aka.ms/dataflowexpressions Data Flow Performance guide: https://aka.ms/dfperf Combined Links: https://aka.ms/dflinks

Thank you to our Sponsors www.Meetup.com/EdiDpMeetup Global Partner GOLD Silver Bronze

Newsletters, Recorded Training, Giving Back Own your career with Interactive learning built by community and guided by data experts. 1 Attend an event 2 Join a Community 3 Explore More Newsletters, Recorded Training, Giving Back In-person Get involved. Get ahead. .org Online