Download presentation
Presentation is loading. Please wait.
Published byΠαρθενιά Κοτζιάς Modified over 6 years ago
1
11/21/ :32 PM BRK3316 Operationalizing Microsoft Cognitive Toolkit and TensorFlow models with HDInsight Spark Mary Wahl Data Scientist, AI Enablement Artificial Intelligence & Microsoft © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
2
Common customer request:
11/21/ :32 PM Common customer request: Train a DNN at scale on a huge pool of collected images… …and apply in real-time to new images. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
3
On further investigation: Very few of those images are labeled…
11/21/ :32 PM On further investigation: Very few of those images are labeled… …and the customer would like the model’s predictions on the rest. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
4
Machine Learning, Analytics, & Data Science Conference
11/21/ :32 PM Session Goals Introduce an example use case Explain methods for DNN operationalization with PySpark Using Cognitive Toolkit (CNTK) and TensorFlow (TF) APIs Using MMLSpark Highlight common and insidious errors Enable attendees to adapt the methods © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
5
Example use case: aerial image classification
11/21/ :32 PM Example use case: aerial image classification © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
6
Land use classification of aerial imagery
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Land use classification of aerial imagery Large, freely-available, labeled datasets Imagery: National Agriculture Imagery Program, every two years Labels: National Land Cover Database, every five years (w/ delay) Common need in industry and government Enforce regulations, collect taxes, geopolitical surveillance Monitor crop performance, property value estimation, marketing Barren Forested Shrub Cultivated Grassland Developed © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
7
Selecting training and validation data
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Selecting training and validation data © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
8
Training method: transfer learning
11/21/ :32 PM Training method: transfer learning Adapts pretrained models for new tasks Used AlexNet and 52-layer ResNet pretrained on ImageNet classification task Accommodates smaller training datasets Avoids overfitting by retraining only part of the model Used a balanced training set of 44k labeled images Lower computation burden Performed retraining in under one hour on a single-GPU Windows Data Science Virtual Machine © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
9
Data readers offer huge benefits during training
11/21/ :32 PM Data readers offer huge benefits during training Minibatching Makes efficient use of multiple cores Improve gradient estimation Faster convergence (potentially) Queuing Pre-load the next minibatch while the GPU processes the current one Distributed training Partition data between workers Transformations Add diversity through random cropping/scaling/colorization © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
10
Most commonly-used data readers
11/21/ :32 PM Most commonly-used data readers Cognitive Toolkit (CNTK): “MAP file” lists filename and label for each image in the training set Read by MinibatchSource TensorFlow: “TFRecords” are binary files containing images and labels Read by TFRecordReader © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
11
Quick look: data preparation and use in training
11/21/ :32 PM Quick look: data preparation and use in training © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
12
Batch scoring with CNTK and TF models on HDInsight Spark
11/21/ :32 PM Batch scoring with CNTK and TF models on HDInsight Spark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
13
Motivation for operationalizing DNNs on Spark
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Motivation for operationalizing DNNs on Spark Reduces image data transfer latency Cluster and images can be located on the same Azure Data Lake Store (HDFS) Even scoring with DNNs is a time-intensive task Often 100s of milliseconds per image on CPU Split scoring task over arbitrarily-many worker nodes No interdependency -> “Embarrassingly parallel” scoring is possible Familiar Python interface to Cognitive Toolkit/TensorFlow © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
14
Operationalization architecture on Azure
11/21/ :32 PM Operationalization architecture on Azure Azure Data Lake Store (HDFS) - or - Azure HDInsight Spark Azure storage account © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
15
Replicating image loading steps on Spark
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Replicating image loading steps on Spark Can’t use the data readers that we used during training: Cognitive Toolkit: MinibatchSource expects local file access to images listed in MAP files TensorFlow: Can’t realistically write TFRecords for all files Alternative: match the data loading steps that each reader performed during training with custom code © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
16
Image pre-processing with OpenCV
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Image pre-processing with OpenCV Color channels loaded in “BGR” order Many other packages load images in RGB order Image data dimensions: “# color channels x width x height” Many other packages load images with dimensions “width x height x # channels” Data type (float vs. int, precision) may also differ NB: some mistakes have a surprisingly small effect on prediction accuracy! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
17
Split scoring task appropriately among workers
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Split scoring task appropriately among workers Divide images into n partitions: Map partitions to workers: Workers access data through a tuple generator: image_rdd = sc.binaryFiles('adl://account_name.azuredatalakestore.net/images/*.png', minPartitions=num_workers).coalesce(num_workers) labeled_images = image_rdd.mapPartitions(image_scoring_func).collect() def image_scoring_func(file_generator): for file in file_generator: # file is a two-tuple: [0] filename, [1] byte data ... return predicted_labels © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
18
Demo: Batch scoring on Azure HDInsight Spark
11/21/ :32 PM Demo: Batch scoring on Azure HDInsight Spark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
19
Results: Parallelization and processing time
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Results: Parallelization and processing time Measured time required to score an entire balanced test set of 11,760 images. From 38 minutes to <1 minute through parallelization (using CPU-only workers) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
20
Machine Learning, Analytics, & Data Science Conference
11/21/ :32 PM Results: overall classification accuracy ~80% for both CNTK and TensorFlow © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
21
11/21/ :32 PM Operationalizing CNTK models with Microsoft Machine Learning for Apache Spark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
22
Microsoft Machine Learning for Apache Spark (MMLSpark)
11/21/ :32 PM Microsoft Machine Learning for Apache Spark (MMLSpark) Easily ingest and preprocess images from HDFS Seamless integration with CNTK and OpenCV Featurize images and other inputs with pretrained DNNs BYOM or use one of many pretrained CNTK models Can use a GPU edge node to accelerate this process Train classifiers on featurized images Fast form of transfer learning that does not require GPU compute © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
23
Demo: Training and scoring with DNNs using MMLSpark
11/21/ :32 PM Demo: Training and scoring with DNNs using MMLSpark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
24
Results: Identifying newly-developed regions
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Results: Identifying newly-developed regions 2010 2016 © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
25
Results: Predicting land use in Middlesex County, MA in 2016
Machine Learning, Analytics, & Data Science Conference 11/21/ :32 PM Results: Predicting land use in Middlesex County, MA in 2016 Most recent ground-truth labels are from 2011 Red: developed; white: cultivated; green: all others (undeveloped) Come visit us at Microsoft’s NERD Center! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
26
Machine Learning, Analytics, & Data Science Conference
11/21/ :32 PM Where to learn more: End-to-end tutorial covering the aerial image classification use case, with sample data/code/models: aka.ms/aerialimageclassification Download MMLSpark and examples from: You can reach me (Mary Wahl) at © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
27
Please evaluate this session
Tech Ready 15 11/21/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite Phone: download and use the Microsoft Ignite mobile app Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
28
11/21/ :32 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.