Introduction to Azure Data Lake

Introduction to Azure Data Lake
Oskari Heikkinen Introduction to Azure Data Lake

Sponsors

Machine Learning & Data Science Conference
9/8/2019 1:53 AM Oskari Heikkinen Director, Microsoft Azure at CGI Microsoft P-TSP Cloud Analytics © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Compute Storage

Azure Data Lake Background: Cosmos at Microsoft

Azure Data Lake Storage (Gen 1)
Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE GRADE No limits to SCALE Optimized for analytic workload PERFORMANCE Azure Data Lake Storage (Gen 1) A hyper scale repository for big data analytics workloads

Data Lake Storage (Gen 1): Basics
Unlimited Storage Unlimited store account size Individual files can be size of petabytes Optimized for Analytics Built for running analytics systems that require massive throughput Optimized for parallel computation over petabytes of data High Availability Automatically replicates your data Three copies within a single region 99,9% SLA

Data Lake Storage (Gen 1): Data Security
Encryption TLS for Data in Transit Transparent server-side encryption Service managed keys or Azure Key Vault and customer-managed keys Authentication & authorization Azure Active Directory POSIX-style Access Control Lists on folders and files Auditing Audit logs for all operations Audit logs can be analysed with U-SQL

Data Lake Storage (Gen 1)
A LARGE FILE Files are split into Extents. Extents can be up to 2GB in size. For availability and reliability, extents are replicated (3 copies). Enables parallelized read 1 2 3 4

Large files provide parallelism opportunities
Extent Vertex Extent Vertex Extent Vertex Extent Vertex Extent Vertex Extent Vertex

Parallel writing Front-end machines for a web service Azure Data lake
Log files Simultaneous uploads Azure Data lake

Azure Data Lake Storage (Gen 1) Architecture

Key takeaway?

Data Lake Storage Gen1 Azure Blob Storage Scenarios Structure
Optimized for Analytics General purpose bulk storage Structure Hierarchy on File System Flat namespace object store Size limits No* ~4,77 TB per file, TB per storage account Geo-redundancy LRS LRS, ZRS, GRS, RA-GRS HDFS Client Yes Yes

Data Lake Storage Gen1 Azure Blob Storage Authentication Authorization
Azure Active Directory Access Keys / SAS Tokens Authorization POSIX-style ACLs Access Keys / SAS Tokens Data Encryption Transparent Server-side Encryption Storage Service Encryption Connection protocols HTTPS HTTP / HTTPS Firewall Yes Yes

Data Lake Storage Gen 2

Data Lake Gen2: Combining the best of both?

Blob Storage Data Lake Gen1 Data Lake Gen2 Authentication Structure
Access Keys/SAS Tokens Azure AD Azure AD Authentication Structure Flat namespace Hierarchical File System Both ~4,77 TB per file, TB per account No* ~4,77 TB per file Size limits Geo-redundancy LRS, ZRS, GRS, RA-GRS LRS LRS, ZRS, GRS, RA-GRS Hot/Cold Storage Tiers Yes No Yes Price* 16,6€ / TB 32,9€ / TB 16,6€ / TB *Prices per month in West Europe for LRS on

Storage Best Practices
Microsoft Build 2016 9/8/2019 1:53 AM Storage Best Practices Design folder hierarchy structure Split into several services Service level limits Gen2: disaster recovery © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Services for processing Big Data

HDInsight

Azure HDInsight Hadoop as a Service on Azure
Fully-managed Hadoop and Spark for the cloud 100% Open Source Hortonworks data platform Cluster up and running in 20 minutes Supported by Microsoft with 99.9% SLA Familiar BI tools for analysis Open source notebooks for interactive data science 63% lower TCO than deploying Hadoop on-premise* Hadoop as a Service on Azure *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

History Why do we have Big Data technologies today? MapReduce RDBMS
Microsoft Build 2016 9/8/2019 1:53 AM History Why do we have Big Data technologies today? MapReduce RDBMS Data volume Petabyte scale Gigabyte scale Access mode Batch Interactive, batch Updates Write once, read many Write many, read many Structure Schema-on-read Schema-on-write Integrity Low High Scaling Linear Nonlinear © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Apache Hive: Enterprise Data Warehousing
Machine Learning & Data Science Conference 9/8/2019 1:53 AM Apache Hive: Enterprise Data Warehousing 2015 Hive introduces ACID 2006 Hive incubated at Facebook 2012 ODBC/JDBC drivers released 2013 Hive introduces Tez, vectorization, ORC 2010 Top level Apache project 2016 In-memory through LLAP © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Execution engines and LLAP

Azure DataBricks

Spark as a Service on Azure Azure Databricks
Azure Databricks is a first party service on Azure. Unlike with other clouds, it is not an Azure Marketplace or a 3rd party hosted service. Azure Databricks is integrated seamlessly with Azure services: Azure Portal: Service an be launched directly from Azure Portal Azure Storage Services: Directly access data in Azure Blob Storage and Azure Data Lake Store Azure Active Directory: For user authentication, eliminating the need to maintain two separate sets of users in Databricks and Azure. Azure SQL DW and Azure Cosmos DB: Enables you to combine structured and unstructured data for analytics Apache Kafka for HDInsight: Enables you to use Kafka as a streaming data source or sink Azure Billing: You get a single bill from Azure Azure Power BI: For rich data visualization Eliminates need to create a separate account with Databricks. Spark as a Service on Azure

Spark Structured Streaming
Apache Spark An unified, open source, parallel, data processing framework for Big Data Analytics Spark Unifies: Batch Processing Interactive SQL Real-time processing Machine Learning Deep Learning Graph Processing Yarn Mesos Standalone Scheduler Spark MLlib Machine Learning Spark Structured Streaming Stream processing

General Spark Cluster Architecture
Data Sources (HDFS, SQL, NoSQL, …) Cluster Manager Worker Node Cache Task Driver Program SparkContext ‘Driver’ runs the user’s ‘main’ function and executes the various parallel operations on the worker nodes. The results of the operations are collected by the driver The worker nodes read and write data from/to Data Sources including HDFS. Worker node also cache transformed data in memory as RDDs (Resilient Distributed Datasets). Worker nodes and the Driver Node execute as VMs in public clouds (AWS, Azure).

Catalyst query optimizer

DEMO: HDInsight & DataBricks

External Metastore

Call to Action Read how these work: HDFS YARN Spark
Learning by doing: Start playing around with the services 

Thank you! 

Introduction to Azure Data Lake

Similar presentations

Presentation on theme: "Introduction to Azure Data Lake"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Azure Data Lake

Similar presentations

Presentation on theme: "Introduction to Azure Data Lake"— Presentation transcript:

Similar presentations

About project

Feedback