Presentation is loading. Please wait.

Presentation is loading. Please wait.

Breeding Data Scientists

Similar presentations


Presentation on theme: "Breeding Data Scientists"— Presentation transcript:

1 Breeding Data Scientists
Danielle Dean, PhD Senior Data Scientist Lead, Microsoft Amy O’Connor Business Value Enablement, Cloudera

2 Five changes in the world of the Data Scientist
More Data, Insights, Results Organization & Culture Data Engineering Productivity Tools Cloud Enabled

3 More Data, More Insights
Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning BI ETL Modeling

4 More Results Working to Cure Cancer Rocket Science
Destroying Human Trafficking Networks Working to Cure Cancer Rocket Science Top Cancer Research Institutions Thorn

5 Organization & Culture: Sobering Statistics
“Only 27% of the big data projects are regarded as successful” Only 13% of organizations have achieved full-scale production for their Big Data implementations “Only 8% of the big data projects are regarded as VERY successful” “Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics program in place, while 80% said they planned on implementing such a program within five years” Dataversity 2015 Survey Source: CapGemini 2014

6 Math and Statistical Knowledge Substantive Expertise
The Data Scientist is not one person Source: Drew Conway Curiosity Traditional Research Data Science Danger Zone Machine Learning Hacking Skills Math and Statistical Knowledge Substantive Expertise

7 The Data Scientist does not stand alone
Executive Sponsor Data Engineer/ETL Engineer Data Scientist + Product Owner, app developer, program manager, devOps etc Subject Matter Expert Data Steward/SME

8 The Data Scientist does not sit in a centralized org
Source: Gartner 2016

9 “How do I become a Data Scientist?”

10 “How do I become a Data Scientist?”

11 Machine Learning & Data Science Conference
Importance of Process 4/14/2018 9:37 AM Data Science != Software Engineering But, we can learn a lot, especially on processes after all…Failing to plan is planning to fail Data Science 1. Data Problem Formulation 6. Model evaluation and tuning 7 . Model Deployment 2. Acquire Data Sources 3. Data exploration 4. Create analytics dataset 5. Modeling & Descriptive Analysis Data Acquisition 1. Data Flow Architecture 2. Data Schema Architecture 2. Feature Extraction 3. Data Flow Implementation 4. Data Flow Validation © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

12 Machine Learning & Data Science Conference
Four Pillars of the Team Data Science Process 4/14/2018 9:37 AM Standard Project Lifecycle Standardized Document Templates, Project Structure Shared, Distributed Resources Productivity Tools, Shared Utilities 1 2 3 4 © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

13 Team Data Science Process at Microsoft
Data science virtual machines (DSVMs) as the fundamental development platform on cloud Use Visual Studio Team Services (VSTS) Work item tracking and scrum planning Git repositories Shared data science utilities in Git repository Use cloud-based Azure resources as needed

14 Data Engineering – ready for ML?
4/14/2018 9:37 AM Data Engineering – ready for ML? The better the raw materials, the better the product. Question is sharp. Data measures what they care about. Data is accurate. Data is connected. A lot of data. E.g. Predict whether component X will fail in the next Y days; clear path of action with answer E.g. Identifiers at the level they are predicting E.g. Failures are really failures, human labels on root causes; domain knowledge translated into process E.g. Machine information linkable to usage information E.g. Will be difficult to predict failure accurately with few examples © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

15 A Bit more on Data Engineering
How do Data Scientists spend their time? Source: CrowdFlower Gartner estimates that poor quality of data costs an average organization $13.5 million per year, and yet data governance problems — which all organizations suffer from — are worsening.

16 A Bit more on Data Engineering
Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end-to-end data pipelines in order to power their business.  Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security.   Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark.  Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.

17 Data Engineering/Science/Analyst Tools
Data Science/Analytics Data Analyst / BI Cloudera Certified Partners

18 Flexible deployments: Cloud enabled
Easy Administration Dynamic cluster lifecycle management Single pane of glass: multi-cluster view Consumption based billing and metering Enterprise-grade Integration across Cloudera Enterprise Management of CDH deployments at scale Flexible Deployments No cloud vendor lock-in: open plugin framework for IaaS platforms Scaling of provisioned clusters Spot instance provisioning Cloudera Director

19 Cortana Intelligence Suite on Azure cloud platform
Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Sources Machine Learning Cognitive Services Data Factory Data Lake Store SQL Data Warehouse Data Lake Analytics Bot Framework Apps Web Mobile Bots Data Catalog Apps HDInsight (Hadoop and Spark) Event Hubs Cortana Sensors and devices Dashboards & Visualizations Stream Analytics Automated Systems Power BI Data Data Intelligence Action

20 More Data = More results! Create a data driven culture & DS processes Careful checking and cleaning of data Use the right tool for the job Leverage the power of the cloud

21 Resources Microsoft’s “Team Data Science Process” Github: Productive utilities repository: Sign up for a free VSTS account: Complete Cloudera resource library: Coursera Data Science:


Download ppt "Breeding Data Scientists"

Similar presentations


Ads by Google