Databricks: the new kid on the block

Slides:



Advertisements
Similar presentations
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Advertisements

Running Hadoop-as-a-Service in the Cloud
How* to Win the #BestMicrosoftHack Shahed Chowdhuri Sr. Technical WakeUpAndCode.com *Hint: Use the Cloud.
Matthew Winter and Ned Shawa
Andy Roberts Data Architect
AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October
An Introduction To Big Data For The SQL Server DBA.
What if your app could put the power of analytics everywhere decisions are made? Modern apps with data visualizations built-in have the power to inform.
A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Business Insights Play briefing deck.
Big Data from Microsoft Azure Robert Turnage Data Solutions Architect
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Connected Infrastructure
Run Azure Services in your datacenter
AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.
Data Platform and Analytics Foundational Training
Big Data Enterprise Patterns
Data Platform and Analytics Foundational Training
Smart Building Solution
Examine information management in Cortana Intelligence
Cortana Intelligence Overview
Creating Enterprise Grade BI Models with Azure Analysis Services
Build interactive data analysis environments using Apache Spark
Microsoft Machine Learning & Data Science Summit
Working With Azure Batch AI
Hybrid Management and Security
Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.
Spark Presentation.
Smart Building Solution
Connected Infrastructure
Building Analytics At Scale With USQL and C#
Data Platform and Analytics Foundational Training
Add intelligence to Dynamics AX with Cortana Intelligence suite
Cloudy with a Chance of Data
Shubha Vijayasarathy Program Manager, Azure Event Hubs - Microsoft
Azure Infrastructure as a Service
9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.
Enterprise security for big data solutions on Azure HDInsight
Turning back time … … to 1998.
Capitalize on modern technology
Welcome! Power BI User Group (PUG)
Migrating Your BI Platform To Azure
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Welcome! Power BI User Group (PUG)
Near Real Time ETLs with Azure Serverless Architecture
Data science and machine learning at scale, powered by Jupyter
12/5/ :36 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Modern cloud PaaS for mobile apps, web sites, API's and business logic apps
Analytics in the Cloud using Microsoft Azure
Technical Capabilities
What’s New and What’s Coming…
Microsoft Azure.
Azure Machine Learning on Databricks
Agenda Need of Cloud Computing What is Cloud Computing
ETL Patterns in the Cloud with Azure Data Factory
Big-Data Analytics with Azure HDInsight
Moving your on-prem data warehouse to cloud. What are your options?
Introduction to Azure Data Lake
Productive + Hybrid + Intelligent + Trusted
Data Wrangling for ETL enthusiasts
Customer 360.
Michael French Principal Consultant 5/18/2019
Beyond orchestration with Azure Data Factory
SQL Server 2019 Bringing Apache Spark to SQL Server
Get your data flowing with Data Flows! and...umm...dataflows.
Visual Data Flows – Azure Data Factory v2
Visual Data Flows – Azure Data Factory v2
Architecture of modern data warehouse
Presentation transcript:

Databricks: the new kid on the block Antonio Abalos Castillo antonioa@avanade.com http://www.sqlsaturday.com/746/Sessions/Details.aspx?sid=78633

A big thanks to all of our sponsors!

… the new kid on the block [informal] Someone who is new in a place or organization and has many things to learn about it Well, it is actually us the ones who really need to learn about it!! https://dictionary.cambridge.org/es/diccionario/ingles/new-kid-on-the-block?q=the-new-kid-on-the-block https://www.phrases.org.uk/meanings/255875.html

Ok, this is about BI and data science… 85% !! …and failure rates for analytics, BI, and big data projects = https://designingforanalytics.com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85-yikes/ http://www.digitaljournal.com/tech-and-science/technology/big-data-strategies-disappoint-with-85-percent-failure-rate/article/508325 https://twitter.com/nheudecker/status/928720268662530048

Who is already “in the block”? AZURE DATA FACTORY AZURE IMPORT EXPORT SERVICE AZURE SQL DB AZURE COSMOS DB AZURE SQL DATA WAREHOUSE AZURE ANALYSIS SERVICES POWER BI AZURE STORAGE BLOBS AZURE DATA LAKE STORE AZURE ML ML SERVER AZURE DATABRICKS AZURE DATA LAKE ANALYTICS AZURE HDINSIGHT AZURE DATABRICKS AZURE IOT HUB AZURE EVENT HUBS KAFKA ON AZURE HDINSIGHT AZURE SEARCH AZURE DATA CATALOG AZURE STREAM ANALYTICS HDINSIGHT DATABRICKS COGNITIVE SERVICES BOT SERVICE AZURE ACTIVE DIRECTORY AZURE NETWORK SECURITY GROUPS AZURE KEY MANAGEMENT SERVICE AZURE EXPRESSROUTE OPERATIONS MANAGEMENT SUITE AZURE FUNCTIONS VISUAL STUDIO

More precisely, on big data, HDInsight Includes Jupyter and Zeppelin notebooks Remote API for job management Integrated with Blob storage, Event Hubs for streaming and Power Bi for analytics Quick to deploy and scale https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview

HDInsight, other considerations Provisioning (template: 101-hdinsight-spark-linux): https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache- spark-jupyter-spark-sql Clusters have to be created (20 minutes) and deleted after use Admins have to decide on what to do with the disks and files Data Factory can be used to automate the process (on-demand) https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query

Azure Machine Learning Studio Serverless Web based Active Directory integrated Notebooks Limited regions (West Europe)

Azure Machine Learning Studio https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml

Azure Machine Learning Workbench (aka Machine Learning Services) (In preview as of August 2018) Desktop application Python based and Git compatible Built-in Jupyter notebooks Integrated in Azure AD Deploys and runs models via Docker containers (Azure Machine Learning Experimentation service) https://docs.microsoft.com/en-us/azure/machine-learning/service/ https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/experimentation-service-configuration https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml

Azure Machine Learning Workbench and Jupyter notebooks https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks

Microsoft Machine Learning Server Previously known as “R Server” Extends R with parallel tools for big data processing Available in HDInsight Runs models via Hadoop or Spark Can publish models via web service Can run Python too http://blog.revolutionanalytics.com/2016/01/microsoft-r-open.html https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server https://docs.microsoft.com/en-us/machine-learning-server/ https://docs.microsoft.com/en-us/machine-learning-server/operationalize/quickstart-publish-r-web-service#b-publish-model-as-a-web-service

What is the point with notebooks? https://www.svds.com/why-notebooks-are-super-charging-data-science/

Isn’t everything about Jupyter? Azure Machine Learning Studio Azure Machine Learning Workbench Data Science VM HDInsight Databricks https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks https://notebooks.azure.com/

How does the technology framework look like?

Some tools in the Azure technology framework for data science Data preparation Azure Notebooks Azure Machine Learning Workbench Azure Machine Learning Studio Other tools (R Studio, Visual Studio Code, …) Data Factory/Data Lake Analytics Model execution Spark on HDInsight Docker Machine Learning Server SQL Server (yes!) Azure Machine Learning web service Some tools in the Azure technology framework for data science https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/data-science-and-machine-learning

Big data architectures

Big data architectures https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/

Big data and advanced analytics scenarios Modern Data Warehousing “We want to integrate all our data including ‘big data’ with our data warehouse” Advanced Analytics “We are trying to predict when our customers churn” Real-time Analytics “We are trying to get insights from our devices in real-time”

Fast, easy, and collaborative Apache Spark-based analytics platform Databricks Fast, easy, and collaborative Apache Spark-based analytics platform

Ok, but what is Databricks? Best of Databricks Best of Microsoft The leading Apache Spark analytics platform “It is not so often in the software industry that the most widely used tool is also the best available platform to choose from” Dr. Veljko Krunic

Databricks foundations What is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets on top of an existing Hadoop Distributed File System (HDFS) infrastructure. Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations. What is Hadoop?

What do we get with Spark? Allows programmers to develop complex, multi-step data pipelines In-memory data sharing across different jobs (not like Hadoop, which is HDFS file-based) More than just Map and Reduce functions Optimizes arbitrary operator graphs Lazy evaluation of big data queries Provides concise and consistent APIs in Scala, Java and Python Interactive shell for Scala and Python Support for SQL and R

Ok wait, I like Spark but… I don’t want Databricks Azure still has HDInsight with Spark on top, but: Cluster management is up to you Notebook integration has to be configured (Jupyter or Zeppelin) Lacks memory and performance enhancements Some good things still remain: Anaconda comes preloaded by default Azure integration with other services (Data lake, Machine Learning, Power BI) REST APIs for service deployment and job management (Livy) https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview

Why Databricks then? Unified platform for data science and data engineering Easy to promote experiments to “products” Unified security model, encryption and auditing Optimized version of Spark, running 10 to 40x faster

Machine learning models MULTI-STAGE PIPELINES Azure Databricks Azure Databricks Collaborative Workspace Machine learning models IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Deploy Production Jobs & Workflows BI tools Cloud storage MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Data exports Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring Hadoop storage DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits

Databricks in Azure Control plane managed by Databricks Data plane controlled by Azure Deployed as IaaS using as many nodes as required

Control plane Notebooks, jobs, clusters, users and ACLs are managed from the control plane These services store data in dedicated Databricks databases (not accessible to external users) The control plane is accessible from Databricks UX Databricks API

Data plane The Spark clusters are deployed to the customer’s Azure subscription Each workspace and associated clusters are created in dedicated VNETs Access to VNETs is restricted by network security groups (NSG)

How to provision Databricks from Azure Databricks setup

Databricks setup – Creating workspace https://azure.microsoft.com/en-us/pricing/details/databricks/

Databricks setup – Creating workspace Control plane provisioned

Databricks setup – Creating workspace Control plane So far, nothing to worry about

Databricks setup – Creating clusters

Databricks setup – Creating clusters Provisioning time is approx. 8’

Databricks setup – Testing setup Impressive results!! ;)

Databricks setup – Behind the scenes Here are the cost drivers Separated resource group, managed from the control plane network, VMs, storage, disks

Databricks setup – Behind the scenes Cluster terminated Virtual machines and networks removed Storage account remains

Other resources https://azure.microsoft.com/en-us/services/databricks/ https://blogs.msdn.microsoft.com/sqlcat/2016/08/18/migrating-data-to-azure-sql-data- warehouse-in-practice/ https://blogs.msdn.microsoft.com/sqlcat/2017/05/17/azure-sql-data-warehouse-loading- patterns-and-strategies/ https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload- patterns-and-anti-patterns/ https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK3377 https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK4016 https://databricks.com/product/azure https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best- practices

Help deciding which Machine Learning tool to use Help deciding what Machine Learning technology to use: https://docs.microsoft.com/en-us/azure/architecture/data- guide/technology-choices/data-science-and-machine-learning https://docs.microsoft.com/en-us/azure/machine- learning/service/overview-what-is-azure-ml https://docs.microsoft.com/en-us/azure/machine- learning/service/overview-more-machine-learning

Thank you!!