What is Azure Data Lake? How to position it in Big Big Data business?

What is Azure Data Lake? How to position it in Big Big Data business?
Umit Sunar Cloud Solution Architect Microsoft – MEA HQ @umitsunar

SQLSat Kyiv Team Olena Smoliak Oksana Borysenko Vitaliy Popovych
Yevhen Nedashkivskyi Mykola Pobyivovk

Umit Sunar – @umitsunar
Umit Sunar has been working in IT Industry since 2000 and he is currently working as Cloud Solution Architect (Azure) at Microsoft MEA HQ. He is based in Dubai and working with enterprises, CSVs and startups for Windows Azure and Cloud Computing in Middle East and Africa Region. He has been playing with Cloud Computing since 2007, more than 8 years now. During his professional career, he has worked as architect, evangelist and trusted advisor and has been involved in a wide variety of projects, ranging from scalable web apps, IoT, IT security, ERP, CRM and real-time transactional systems.

Data Lake vs Data Warehouse
Some of us have been hearing more about the data lake, especially during the last year. There are those that tell us the data lake is just a reincarnation of the data warehouse, in the spirit of “been there, done that.” Others have focused on how much better this “shiny, new” data lake is, while others are standing on the shoreline screaming, “Don’t go in! It’s not a lake—it’s a swamp!”

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Two Approaches to Information Management for Analytics: Top-Down + Bottoms-Up
(Deductive) Bottoms-Up (Inductive) VALUE How can we make it happen? Prescriptive Analytics What will happen? Theory Theory Predictive Analytics Hypothesis Why did it happen? Hypothesis OPTIMIZATION Pattern Diagnostic Analytics Observation What happened? Observation Confirmation Descriptive Analytics INFORMATION DIFFICULTY

Data Warehousing Uses A Top-Down Approach
Data sources OLTP ERP CRM LOB ETL BI and analytic Dashboards Reporting Data warehouse Understand Corporate Strategy Gather Requirements Business Requirements Technical Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure

The “data lake” Uses A Bottoms-Up Approach
Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Devices Social Batch queries Devices LOB applications Video Interactive queries Social LOB applications Real-time analytics Sensors Web Sensors Video Relational Machine Learning Web Clickstream Data warehouse Relational Clickstream

Data Lake + Data Warehouse Better Together
Data sources OLTP ERP CRM LOB ETL BI and analytic Dashboards Reporting Data warehouse What happened? What is happening? Why did it happen? What are key relationships? What will happen? What if? How risky is it? What should happen? What is the best option? How can I optimize? LOB applications Devices Social Relational Video Web Sensors Clickstream

4/28/2017 Azure Data Lake © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Azure Data Lake Analytics Storage HDInsight Azure Data Lake Analytics
(“managed clusters”) Azure Data Lake Analytics Azure Data Lake Storage

Microsoft Azure Data Lake Azure Services for big data analytics
YARN HDFS HDInsight Analytics Service Store Partners U-SQL Clickstream Sensors Video Social Web Devices Relational Applications Integrated analytics and storage Fully managed Easy to use–“dial for scale” Proven at scale Analyze data of any size, shape or speed Open-standards based

Azure Data Lake Analytics Azure Services for big data analytics
YARN HDFS HDInsight Analytics Service Store Partners U-SQL Clickstream Sensors Video Social Web Devices Relational Applications Distributed, parallel analytics framework U-SQL (based on C# and SQL) Dial for scale Hides infrastructure complexity Visual Studio integration Instant scale on demand Reduced learning curve

Azure Data Lake: Store YARN HDFS HDInsight Analytics Service Store
U-SQL Clickstream Sensors Video Social Web Devices Relational Applications Distributed, parallel file system in the cloud Performance-tuned and optimized for analytics No fixed size limits Stores all data types Highly available with local & geo redundant storage WebHDFS REST API Supported by leading Hadoop distros Role-based security Low latency and high throughput workloads

Azure Data Lake as part of Cortana Analytics Suite
Information Management Azure Data Factory Data Catalog Event Hub Big Data Stores Machine Learning and Analytics Dashboards and Visualizations Business apps Custom apps Sensors and devices Power BI Azure SQL Data Warehouse Azure Machine Learning Personal Digital Assistant People Cortana Azure Stream Analytics Perceptual Intelligence Azure HDInsight (Hadoop) Face, vision Azure Data Lake Store Speech, text Automated Systems Azure Data Lake Analytics Business Scenarios Recommendations, customer churn, forecasting, etc. DATA INTELLIGENCE ACTION

? Why data lakes?

Traditional business analytics process
Start with end-user requirements to identify desired reports and analysis Define corresponding database schema and queries Identify the required data sources Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) Create reports. Analyze data New requirements Create ETL pipeline Create reports Do analytics Identify data schema and queries Identify data sources ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications All data not immediately required is discarded or archived

New big data thinking: All data has value
All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit Iterate Gather data from all sources Store indefinitely Analyze See results

Data Lake Store: Technical Requirements
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place). Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up Reliable Must be highly available and reliable (no permanent loss of data). Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark Low latency Must have low latency for high-frequency operations. Details Must be able to store data with all details; aggregation may lead to loss of details. Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance. All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc. Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc. No one analytic framework can work for all data and all types of analysis. Multiple analytic frameworks

Azure Data Lake Analytics Service
A new distributed analytics service Built on Apache YARN Scales dynamically with the turn of a dial Pay by the query Supports Azure AD for access control, roles, and integration with on-prem identity systems Built with U-SQL to unify the benefits of SQL with the power of C# Processes data across Azure

YARN? Apache Hadoop YARN is a joins
Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop. In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

ADL Analytics Features Developing big data apps
Works across cloud data Simplified management and admin

Developing big data apps
Author, debug, & optimize big data apps in Visual Studio Multiple Languages U-SQL, Hive & Pig Seamlessly integrate .NET

Work across all cloud data
Azure Data Lake Analytics Azure Data Lake Store Azure Storage Blobs SQL DB in an Azure VM Azure SQL DW Azure SQL DB

Analytics: Two form factors
ADLA Analytics service HDInsight Managed Hadoop clusters HDInsight Cluster n1 n2 n3 n4 Hive/Pig/etc. job Lots of containers YARN Layer U-SQL/Hive/Pig job ADLA Account Input Output Storage Blobs or ADLS

ADLA complements HDInsight Target the same scenarios, tools, and customers
For developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, automatic scale, and management in a “job service” form factor

Azure Data Lake U-SQL 4/28/2017
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

What is U-SQL? A hyper-scalable, highly extensible language for preparing, transforming and analyzing all data Allows users to focus on the what—not the how—of business problems Built on familiar languages (SQL and C#) and supported by a fully integrated development environment Built for data developers & scientists

The Origins of U-SQL U-SQL Hive
Next generation large-scale data processing language combining The declarative, optimizable and parallelizability of SQL The extensibility, expressiveness and familiarity of C# T-SQL Hive SCOPE High performance Scalable Affordable Easy to program Secure

Usage scenarios Achieve the same programming experience in batch or interactive
Schematizing unstructured data (Load-Extract-Transform-Store) for analysis Cook data for other users (LETS & Share) As unstructured data As structured data Large-scale custom processing with custom code Augment big data with high-value data from where it lives

U-SQL language philosophy
Declarative query and transformation language: Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL Analytics functions Optimizable, scalable Operates on unstructured & structured data Schema on read over files Relational metadata objects (e.g. database, table) Extensible from ground up: Type system is based on C# Expression language is C# 21 User-defined functions (U-SQL and C#) User-defined types (U-SQL/C#) (future) User-defined aggregators (C#) User-defined operators (UDO) (C#) U-SQL provides the parallelization and scale-out framework for usercode EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS Expression-flow programming style: Easy to use functional lambda composition Composable, globally optimizable Federated query across distributed data sources (soon) REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt“ USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt“ @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , SUM(c.amount) AS totalamount AS c LEFT OUTER AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT *

U-SQL overview

E-commerce scenario: Before
Custom file format Server logs Web server farm Custom aggregations Reports CSV Parse Users Combined data CSV Transactional database servers Customer purchase records

E-commerce scenario: Sample web log data
Encrypted user ID Start time End time Region Comma separated list of pages <A23XŞ%28J8> <2/15/ :53:16 AM> <2/15/ :58:16 AM> <en-us> < <6,8,22,33,34,66,37> <B4332&8*ŞR> <2/15/ :53:16 AM> <2/15/ :58:32 AM> <EN-US> <en.wikipedia.org/wiki/Weight_loss;webmd.com/diet; exercise.about.com> <82,5,6,7,34,56,56,78> <2/15/ :54:16 AM> <2/15/ :56:17 AM> <en-gb> <microsoft.com; Wikipedia.org/wiki/Microsoft; xbox.com;msn.com; <87,92,45,5,33,45,4,28> <OSD934#*HH> <2/15/ :54:16 AM> <2/15/ :56:27 AM> <en-gb> <dominos.com;Wikipedia.org/wiki/Domino’s_Pizza;facebook.com/dominos> <1,3,5,81,18,35,3,5,56> <2/15/ :55:16 AM> <2/15/ :58:36 AM> <en-us> <skiresorts.com;ski- europe.com; <2,45,56,4,6,9,65,98,24> <OPO*&BSD%S> <2/15/ :56:16 AM> <2/15/ :59:45 AM> <en-fr> <running.about.com;ehow.com;go.com’nike.com;nfl.com> <1,8,72,34,89,34,27,48,67> Comma separated list of page IDs visited Field separators

E-commerce scenario: Sample transactional data
Unique user ID (not encrypted) Comma separated list of products in the order User ID Time Product IDs Total amount ($) 2/15/ :53:16 AM SR27821, CO98241, HG4214 214.50 SRT242421, VFG3243, TR3253, BET353, OPB236, FE4365, KL5634, HI4634, MI4634 4,213.78 2/15/ :54:16 AM A3256 58.67 A8427 44.42 2/15/ :55:16 AM B242V421, GH324342, YT325352, RT35325, RE235235 1,241.50 2/15/ :56:16 AM LW04682, MJ54655 305.75

E-commerce scenario: Challenges before
Needed a future-proof storage solution to hold many PBs Open Source big data analytics tools have a steep learning curve Even with scale out, reporting time increases as data volume increases Home-grown scale-out frameworks are difficult to develop and maintain Query Time Data Volume

E-Commerce scenario: After
Web server farm Azure Data Lake Custom aggregations Reports Custom file format Server logs U-SQL analytic app Users U-SQL analytic app Transactional database servers Azure SQL DB Customer purchase records U-SQL analytic app Azure Data Lake Analytics

Anatomy of a U-SQL query
10 log records by Duration (End time minus Start time). Sort rows in descending order of Duration. Query 1 REFERENCE ASSEMBLY WebLogExtASM; @rs = EXTRACT UserID string, Start DateTime, End DatetTime, Region string, SitesVisited string, PagesVisited string FROM "swebhdfs://Logs/WebLogRecords.txt" USING WebLogExtractor(); @result = SELECT UserID, (End.Subtract(Start)).TotalSeconds AS Duration ORDER BY Duration DESC FETCH 10; TO "swebhdfs://Logs/Results/top10.txt" USING Outputter.Tsv(); U-SQL types are the same as C# types The structure (schema) is first imposed when the data is first extracted/read from the file (schema-on-read) Rowset: Conceptually is like an intermediate table… is how U-SQL passes data between statements Input is read from this file in ADL Custom function to read from input file C# Expression Output is stored in this file in ADL Built-in function that writes the output in TSV format

U-SQL data types Category Types Numeric byte, byte? sbyte, sbyte?
int, int? uint, unint? long, long? decimal, decimal? short, short? ushort, ushort? ulong, unlong? float, float? double, double? Text char, char? string Complex MAP<K> ARRAY<K,T> Temporal DateTime, DateTime? Other bool, bool? Guid, Guid? Byte[] Note: Nullable types have to be declared with a question mark ‘?’

U-SQL Complex Type: Array
Select the list of users who have visited more than 10 pages Query 2 Desired Output A$A892 12 HG54#A 29 14 JADI899 45 YCPB(%U 30 HADS46$ 18 MVDRY79% TYUSPS67 Use ARRAY type to hold the list of pages visited @rs1 = SELECT UserId, new ARRAY<string>(PagesVisited.Split( new [] { ';' } )) AS VisitedPagesArray @rs2 = SELECT UserId AS Users, VisitedPagesArray.Count AS VisitedPages WHERE VisitedPagesArray.Count > 10; USING Outputters.Tsv();

Sum duration over the window of all rows
Windowing functions: 1 List user IDs and total duration of time spent on the website by all users Query 5 @irs @result UserId Region Duration A$A892 en-us 10500 HG54#A 22270 38790 JADI899 en-gb 18780 YCPB(%U 17000 BHPY687 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 10250 Sum duration over the window of all rows @result = SELECT UserID, SUM(Duration) OVER() AS TotalDuration UserId TotalDuration A$A892 207715 HG54#A JADI899 YCPB(%U BHPY687 BGFSWQ BSD805 BSDYTH7 [SUM = ]

Aggregation functions
Count users per region and the AVG, MAX, MIN and total duration Query 3 Can be extended with custom aggregation functions @tmp1 = SELECT Region, (End.Subtract(Start)).TotalSeconds AS Duration @rs1 = SELECT COUNT() AS NumUsers, Region, SUM (Duration) AS TotalDuration, AVG (Duration) AS AvgDuration, MAX (Duration) AS MaxDuration, MIN (Duration) AS MinDuration GROUP BY Region; Built-in aggregation functions AVG ARRAY_AGG COUNT FIRST LAST MAP_AGG MAX MIN STDEV SUM VAR NumUsers Region TotalDuration AvgDuration MaxDuration MinDuration 1 “en-ca” 24 2 “en-ch” 10 3 “en-fr” 241 4 “en-gb” 688 344 614 74 5 “en-gr” 305 6 “en-mx” 422 7 16 “en-us” 8291 518 1270 30

Improve performance with TABLEs
Improve performance of Query1! Query 4 CREATE TABLE LogRecordsTable(UserId int, Start DateTime, End Datetime, Region string INDEX idx CLUSTERED (Region ASC) PARTITIONED BY HASH (Region)); Populate the table Select only required fields Run query directly against the table Top10.Tsv WebLogRecords.txt INSERT INTO LogRecordsTable SELECT UserId, Start, End , Region Azure Data Lake @result = SELECT UserID, (End.Subtract(Start)).TotalSeconds AS Duration FROM LogRecordsTable ORDER BY Duration DESC FETCH 10; TO “swebhdfs://Logs/Results/Top10.Tsv” USING Outputters.Tsv();

Sum the duration over the window of region
Windowing Functions: 2 List user ids, region and total duration of time spent on the website by region Query 6 @irs @total2 UserId Region Duration A$A892 en-us 10500 HG54#A 22270 38790 JADI899 en-gb 18780 YCPB(%U 17000 BHPY687 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 10250 @total2 = SELECT UserId, Region, SUM(Duration) OVER( PARTITION BY Region) AS RegionTotal UserId Region RegionTotal A$A892 en-us 71560 HG54#A 71569 JADI899 en-gb 52480 YCPB(%U BHPY687 BGFSWQ en-bs 57750 BSD805 en-fr 25925 BSDYTH7 Sum the duration over the window of region

Windowing functions: aggregations
List count of users by region Query 7 @irs @result UserId Region Duration A$A892 en-us 10500 HG54#A 22270 38790 JADI899 en-gb 18780 YCPB(%U 17000 BHPY687 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 10250 Count number of users per region @result = SELECT UserId, Region, COUNT(*) OVER( PARTITION BY Region) AS CountByRegion UserId Region CountByRegion A$A892 en-us 3 HG54#A JADI899 en-gb YCPB(%U BHPY687 BGFSWQ en-bs 1 BSD805 en-fr 2 BSDYTH7

Windowing functions: ranking
Find the top 2 users with longest duration in each region Query 8 @irs @result UserId Region Duration A$A892 en-us 10500 HG54#A 22270 38790 JADI899 en-gb 18780 YCPB(%U 17000 BHPY687 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 10250 @result = SELECT UserId, Region, ROW_NUMBER()OVER(PARTITION BY Vertical ORDER BY Duration) AS Rank GROUP BY Region HAVING RowNumber <= 2; UserId Region Rank en-us 1 HG54#A 2 JADI899 en-gb YCPB(%U BGFSWQ en-bs BSD805 en-fr BSDYTH7

Our Awesome Sponsors

What is Azure Data Lake? How to position it in Big Big Data business?

Similar presentations

Presentation on theme: "What is Azure Data Lake? How to position it in Big Big Data business?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What is Azure Data Lake? How to position it in Big Big Data business?

Similar presentations

Presentation on theme: "What is Azure Data Lake? How to position it in Big Big Data business?"— Presentation transcript:

Similar presentations

About project

Feedback