… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012” Data sources
5 Increasing data volumes 1 Real- time data 2 Non-Relational Data New data sources & types 3 Cloud-born data 4
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Data Marts Data Lake(s) Dashboards Apps
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Data Marts Data Lake(s) Dashboards Apps
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
BI Tools Data Marts Data Lake(s) Dashboards Apps Data Hub (Storage & Compute) Data Sources (Import From) Move data among Hubs Data Hub (Storage & Compute) Data Sources (Import From) Ingest Connect & CollectTransform & EnrichPublish Information Production: Ingest Move to data mart, etc
BI Tools Data Marts Data Lake(s) Dashboards Apps Data Hub (Storage & Compute) Data Sources (Import From) Data Connector: Import from source to Hub Data Connector: Import/Export among Hubs Data Hub (Storage & Compute) Data Sources (Import From) Data Connector: Import from source to Hub Data Connector: Export from Hub to data store Connect & CollectTransform & EnrichPublish Information Production: Coordination & Scheduling Monitoring & Mgmt Data Lineage
Example Scenario: Customer Profiling (game usage analytics)
2277, :26: ,111, , ,true,8,1, , :26: ,111, , ,true,8,1, , :22: ,111, , ,true,8,1, 2277, :43: ,111, , ,true,8,1, , :11: ,111, , ,true,8,1, , :37: ,111, , ,true,8,1, 2277, :12: ,111, , ,true,8,1, … Log Files Snippet (10s of TBs per day in cloud storage) User Table UserIDFirstNameLastNameState… 2277PratikPatelOregon DaveNettletonWashington 8853MikeFlaskoCalifornia New User Activity Per Week By Region profileiddaystatedurationrankweaponsusedinteractedwith 11486/2/2013Oregon /2/2013Missouri /1/2013Georgia /2/2013Oregon /2/2013California /3/2013Nebraska219552
Data Factory Walkthrough
New-AzureDataFactory -Name “HaloTelemetry“ -Location “West-US“ New-AzureDataFactory -Name “GameTelemetry“ -Location “West-US“
New-AzureDataFactoryLinkedService -Name "MyHDInsightCluster“ -DataFactory“GameTelemetry" -File HDIResource.json New-AzureDataFactoryLinkedService -Name "MyStorageAccount" -DataFactory“GameTelemetry" -File BlobResource.json
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Azure Data Factory
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Azure Data Factory View Of Game Usage View Of New Users New User Activity
View Of On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy “NewUsers” to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of New Users New User Activity Pipeline
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy NewUsers to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of Mask & Geo- Code New Users Geo Dictionary Geo Coded Game Usage HDInsight New User Activity Pipeline
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy NewUsers to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of Runs On Mask & Geo- Code New Users Geo Dictionary Geo Coded Game Usage Join & Aggregate HDInsight New User Activity View Of Pipeline
On Premises SQL Server Azure Blob Storage 1000’s Log Files New User View Copy NewUsers to Blob Storage Cloud New Users Azure Data Factory View Of Game Usage View Of Runs On Mask & Geo- Code New Users Geo Dictionary Geo Coded Game Usage Join & Aggregate HDInsight New User Activity View Of Pipeline
“GeoCoded Game Usage” Table:
Pipeline Definition:
// Deploy Table New-AzureDataFactoryTable -DataFactory“GameTelemetry“ -File NewUserActivityPerRegion.json // Deploy Pipeline New-AzureDataFactoryPipeline -DataFactory “GameTelemetry“ -File NewUserTelemetryPipeline.json // Start Pipeline Set-AzureDataFactoryPipelineActivePeriod -Name “NewUserTelemetryPipeline“ -DataFactory “GameTelemetry“ -StartTime 10/29/ :00:00
"availability": { "frequency": "Day", interval": 1 } Hourly GameUsage Activity: (e.g. Hive) :
Dataset2 Dataset3 Hourly Daily Monday Tuesday Wednesday Daily Monday Tuesday Wednesday Hive Activity GameUsage GeoCodeDictionary Geo-Coded GameUsage
Is my data successfully getting produced? Is it produced on time? Am I alerted quickly of failures? What about troubleshooting information? Are there any policy warnings or errors?
Easily move data to my existing data marts for consumption by my existing BI tools Azure DB SQL Server on premises
Automation & Management Data Transformation & Movement Execution Layer (Data Storage & Processing) Automation/Coordination Layer (Coordination, Scheduling, Management) Low Frequency $0.60$0.48$1.50$1.20 High Frequency $1.00$0.80$2.50$ activities100+ activities0-100 activities100+ activities CloudOn Premises HDInsight (hrs) Compute/VM (hrs) Data Transfer (GB) ADF Pricing Per Month Resources Used to Execute Activities in a Pipeline: Note: public preview = 50% discount on the rates shown above
Coordination: Rich scheduling Complex dependencies Incremental rerun Authoring: JSON & Powershell/C# Management: Lineage Data production policies (late data, rerun, latency, etc) Hub: Azure Hub (HDInsight + Blob storage) Activities: Hive, Pig, C# Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, MDS [internal]
Contact me: