Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databricks: the new kid on the block

Similar presentations


Presentation on theme: "Databricks: the new kid on the block"— Presentation transcript:

1 Databricks: the new kid on the block
Antonio Abalos Castillo

2 A big thanks to all of our sponsors!

3 … the new kid on the block
[informal] Someone who is new in a place or organization and has many things to learn about it Well, it is actually us the ones who really need to learn about it!!

4 Ok, this is about BI and data science…
85% !! …and failure rates for analytics, BI, and big data projects =

5 Who is already “in the block”?
AZURE DATA FACTORY AZURE IMPORT EXPORT SERVICE AZURE SQL DB AZURE COSMOS DB AZURE SQL DATA WAREHOUSE AZURE ANALYSIS SERVICES POWER BI AZURE STORAGE BLOBS AZURE DATA LAKE STORE AZURE ML ML SERVER AZURE DATABRICKS AZURE DATA LAKE ANALYTICS AZURE HDINSIGHT AZURE DATABRICKS AZURE IOT HUB AZURE EVENT HUBS KAFKA ON AZURE HDINSIGHT AZURE SEARCH AZURE DATA CATALOG AZURE STREAM ANALYTICS HDINSIGHT DATABRICKS COGNITIVE SERVICES BOT SERVICE AZURE ACTIVE DIRECTORY AZURE NETWORK SECURITY GROUPS AZURE KEY MANAGEMENT SERVICE AZURE EXPRESSROUTE OPERATIONS MANAGEMENT SUITE AZURE FUNCTIONS VISUAL STUDIO

6 More precisely, on big data, HDInsight
Includes Jupyter and Zeppelin notebooks Remote API for job management Integrated with Blob storage, Event Hubs for streaming and Power Bi for analytics Quick to deploy and scale

7 HDInsight, other considerations
Provisioning (template: 101-hdinsight-spark-linux): spark-jupyter-spark-sql Clusters have to be created (20 minutes) and deleted after use Admins have to decide on what to do with the disks and files Data Factory can be used to automate the process (on-demand)

8 Azure Machine Learning Studio
Serverless Web based Active Directory integrated Notebooks Limited regions (West Europe)

9 Azure Machine Learning Studio

10 Azure Machine Learning Workbench
(aka Machine Learning Services) (In preview as of August 2018) Desktop application Python based and Git compatible Built-in Jupyter notebooks Integrated in Azure AD Deploys and runs models via Docker containers (Azure Machine Learning Experimentation service)

11 Azure Machine Learning Workbench and Jupyter notebooks

12 Microsoft Machine Learning Server
Previously known as “R Server” Extends R with parallel tools for big data processing Available in HDInsight Runs models via Hadoop or Spark Can publish models via web service Can run Python too

13 What is the point with notebooks?

14 Isn’t everything about Jupyter?
Azure Machine Learning Studio Azure Machine Learning Workbench Data Science VM HDInsight Databricks

15 How does the technology framework look like?

16 Some tools in the Azure technology framework for data science
Data preparation Azure Notebooks Azure Machine Learning Workbench Azure Machine Learning Studio Other tools (R Studio, Visual Studio Code, …) Data Factory/Data Lake Analytics Model execution Spark on HDInsight Docker Machine Learning Server SQL Server (yes!) Azure Machine Learning web service Some tools in the Azure technology framework for data science

17 Big data architectures

18 Big data architectures

19 Big data and advanced analytics scenarios
Modern Data Warehousing “We want to integrate all our data including ‘big data’ with our data warehouse” Advanced Analytics “We are trying to predict when our customers churn” Real-time Analytics “We are trying to get insights from our devices in real-time”

20 Fast, easy, and collaborative Apache Spark-based analytics platform
Databricks Fast, easy, and collaborative Apache Spark-based analytics platform

21 Ok, but what is Databricks?
Best of Databricks Best of Microsoft The leading Apache Spark analytics platform “It is not so often in the software industry that the most widely used tool is also the best available platform to choose from” Dr. Veljko Krunic

22 Databricks foundations
What is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets on top of an existing Hadoop Distributed File System (HDFS) infrastructure. Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations. What is Hadoop?

23 What do we get with Spark?
Allows programmers to develop complex, multi-step data pipelines In-memory data sharing across different jobs (not like Hadoop, which is HDFS file-based) More than just Map and Reduce functions Optimizes arbitrary operator graphs Lazy evaluation of big data queries Provides concise and consistent APIs in Scala, Java and Python Interactive shell for Scala and Python Support for SQL and R

24 Ok wait, I like Spark but… I don’t want Databricks
Azure still has HDInsight with Spark on top, but: Cluster management is up to you Notebook integration has to be configured (Jupyter or Zeppelin) Lacks memory and performance enhancements Some good things still remain: Anaconda comes preloaded by default Azure integration with other services (Data lake, Machine Learning, Power BI) REST APIs for service deployment and job management (Livy)

25 Why Databricks then? Unified platform for data science and data engineering Easy to promote experiments to “products” Unified security model, encryption and auditing Optimized version of Spark, running 10 to 40x faster

26 Machine learning models MULTI-STAGE PIPELINES
Azure Databricks Azure Databricks Collaborative Workspace Machine learning models IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Deploy Production Jobs & Workflows BI tools Cloud storage MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Data exports Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs Ingest, ETL, Scheduling, Monitoring Hadoop storage DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits

27 Databricks in Azure Control plane managed by Databricks
Data plane controlled by Azure Deployed as IaaS using as many nodes as required

28 Control plane Notebooks, jobs, clusters, users and ACLs are managed from the control plane These services store data in dedicated Databricks databases (not accessible to external users) The control plane is accessible from Databricks UX Databricks API

29 Data plane The Spark clusters are deployed to the customer’s Azure subscription Each workspace and associated clusters are created in dedicated VNETs Access to VNETs is restricted by network security groups (NSG)

30 How to provision Databricks from Azure
Databricks setup

31 Databricks setup – Creating workspace

32 Databricks setup – Creating workspace
Control plane provisioned

33 Databricks setup – Creating workspace
Control plane So far, nothing to worry about

34 Databricks setup – Creating clusters

35 Databricks setup – Creating clusters
Provisioning time is approx. 8’

36 Databricks setup – Testing setup
Impressive results!! ;)

37 Databricks setup – Behind the scenes
Here are the cost drivers Separated resource group, managed from the control plane network, VMs, storage, disks

38 Databricks setup – Behind the scenes
Cluster terminated Virtual machines and networks removed Storage account remains

39 Other resources https://azure.microsoft.com/en-us/services/databricks/
warehouse-in-practice/ patterns-and-strategies/ patterns-and-anti-patterns/ practices

40 Help deciding which Machine Learning tool to use
Help deciding what Machine Learning technology to use: guide/technology-choices/data-science-and-machine-learning learning/service/overview-what-is-azure-ml learning/service/overview-more-machine-learning

41 Thank you!!


Download ppt "Databricks: the new kid on the block"

Similar presentations


Ads by Google