Breeding Data Scientists Danielle Dean, PhD Senior Data Scientist Lead, Microsoft Amy O’Connor Business Value Enablement, Cloudera
Five changes in the world of the Data Scientist More Data, Insights, Results Organization & Culture Data Engineering Productivity Tools Cloud Enabled
More Data, More Insights Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning BI ETL Modeling
More Results Working to Cure Cancer Rocket Science Destroying Human Trafficking Networks Working to Cure Cancer Rocket Science Top Cancer Research Institutions Thorn
Organization & Culture: Sobering Statistics “Only 27% of the big data projects are regarded as successful” Only 13% of organizations have achieved full-scale production for their Big Data implementations “Only 8% of the big data projects are regarded as VERY successful” “Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics program in place, while 80% said they planned on implementing such a program within five years” Dataversity 2015 Survey Source: CapGemini 2014
Math and Statistical Knowledge Substantive Expertise The Data Scientist is not one person Source: Drew Conway Curiosity Traditional Research Data Science Danger Zone Machine Learning Hacking Skills Math and Statistical Knowledge Substantive Expertise
The Data Scientist does not stand alone Executive Sponsor Data Engineer/ETL Engineer Data Scientist + Product Owner, app developer, program manager, devOps etc Subject Matter Expert Data Steward/SME
The Data Scientist does not sit in a centralized org Source: Gartner 2016
“How do I become a Data Scientist?”
“How do I become a Data Scientist?”
Machine Learning & Data Science Conference Importance of Process 4/14/2018 9:37 AM Data Science != Software Engineering But, we can learn a lot, especially on processes after all…Failing to plan is planning to fail Data Science 1. Data Problem Formulation 6. Model evaluation and tuning 7 . Model Deployment 2. Acquire Data Sources 3. Data exploration 4. Create analytics dataset 5. Modeling & Descriptive Analysis Data Acquisition 1. Data Flow Architecture 2. Data Schema Architecture 2. Feature Extraction 3. Data Flow Implementation 4. Data Flow Validation © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Machine Learning & Data Science Conference Four Pillars of the Team Data Science Process 4/14/2018 9:37 AM Standard Project Lifecycle Standardized Document Templates, Project Structure Shared, Distributed Resources Productivity Tools, Shared Utilities 1 2 3 4 © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Team Data Science Process at Microsoft Data science virtual machines (DSVMs) as the fundamental development platform on cloud Use Visual Studio Team Services (VSTS) Work item tracking and scrum planning Git repositories Shared data science utilities in Git repository Use cloud-based Azure resources as needed
Data Engineering – ready for ML? 4/14/2018 9:37 AM Data Engineering – ready for ML? The better the raw materials, the better the product. Question is sharp. Data measures what they care about. Data is accurate. Data is connected. A lot of data. E.g. Predict whether component X will fail in the next Y days; clear path of action with answer E.g. Identifiers at the level they are predicting E.g. Failures are really failures, human labels on root causes; domain knowledge translated into process E.g. Machine information linkable to usage information E.g. Will be difficult to predict failure accurately with few examples © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
A Bit more on Data Engineering How do Data Scientists spend their time? Source: CrowdFlower Gartner estimates that poor quality of data costs an average organization $13.5 million per year, and yet data governance problems — which all organizations suffer from — are worsening.
A Bit more on Data Engineering Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end-to-end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.
Data Engineering/Science/Analyst Tools Data Science/Analytics Data Analyst / BI Cloudera Certified Partners
Flexible deployments: Cloud enabled Easy Administration Dynamic cluster lifecycle management Single pane of glass: multi-cluster view Consumption based billing and metering Enterprise-grade Integration across Cloudera Enterprise Management of CDH deployments at scale Flexible Deployments No cloud vendor lock-in: open plugin framework for IaaS platforms Scaling of provisioned clusters Spot instance provisioning Cloudera Director
Cortana Intelligence Suite on Azure cloud platform Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Sources Machine Learning Cognitive Services Data Factory Data Lake Store SQL Data Warehouse Data Lake Analytics Bot Framework Apps Web Mobile Bots Data Catalog Apps HDInsight (Hadoop and Spark) Event Hubs Cortana Sensors and devices Dashboards & Visualizations Stream Analytics Automated Systems Power BI Data Data Intelligence Action
More Data = More results! Create a data driven culture & DS processes Careful checking and cleaning of data Use the right tool for the job Leverage the power of the cloud
Resources Microsoft’s “Team Data Science Process” Github: http://aka.ms/tdsp Productive utilities repository: https://github.com/Azure/Azure-TDSP-Utilities Sign up for a free VSTS account: http://www.visualstudio.com Complete Cloudera resource library: https://www.cloudera.com/resources.html Coursera Data Science: http://www.coursera.org