Presentation is loading. Please wait.

Presentation is loading. Please wait.

Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models.

Similar presentations


Presentation on theme: "Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models."— Presentation transcript:

1 Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

2 Agenda 1. How to Work with Large Datasets Sample Dataset: NYC Taxi HDInsight (Hadoop on Azure) iPython notebook and HDInsight 2. Building Predictive Models Azure ML Studio Learning with Counts 3. Putting it all together: Learning with Counts and HDInsight

3 Sample Data: NYC Taxi One year log of NYC taxi rides 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/http://www.andresmh.com/nyctaxitrips/ Trip (driver id, times, locations) and fare (fare, tip, tolls) Rest of tutorial: data wrangling and tip prediction Tools: AzCopy, HDInsight, iPython, Azure ML Studio

4 100% Apache Hadoop as an Azure service Can deploy on Windows or Linux Provides Map-Reduce capability over big data in Azure blobs Head node: job and cluster monitoring Hive: SQL-like queries as an alternative to writing code SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10; HD Insight : Hadoop on Azure

5 Web-based Python REPL environment Combines authoring, execution, visualization Can author and execute HDInsight Hive queries Sample query (python code snippet) def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams) data = json.load(response) self.hiveJobID = data[‘id’] def query(self, queryString): self.submit_hive_query() Example query string: SELECT * FROM sample_table LIMIT 10; Ipython Notebook

6 Fully managed cloud service Browser based authoring of dataflow Best in class machine learning algorithms Support for R/Python/SQL Collaborative data science Quickly deploy models as web services/REST API’s Publish to a gallery for collaboration with community What is Azure ML Studio

7 ( Distributed Robust Algorithm for CoUnt-based LeArning) Misha Bilenko Microsoft Azure Machine Learning Microsoft Research Learning with Counts a.k.a Dracula

8 adid = 1010054353 adText = K2 ski sale! adURL= www.k2.com/sale Userid = 0xb49129827048dd9b IP = 131.107.65.14 Query = powder skis QCategories = {skiing, outdoor gear} 8 Information retrieval Advertising, recommending, search: item, page/query, user Transaction classification Payment fraud: transaction, product, user Email spam: message, sender, recipient Intrusion detection: session, system, user IoT: device, location Large Scale learning in multi entity domains

9 adid: 1010054353 adText: Fall ski sale! adURL: www.k2.com/sale userid 0xb49129827048dd9b IP 131.107.65.14 query powder skis qCategories {skiing, outdoor gear} 9 Large Scale learning in multi entity domains

10 IP 173.194.33.946964993424 87.250.251.1131843 131.107.65.1412430 ……… REST74562313964931 Learning with Counts

11 IP 173.194.33.946964993424 87.250.251.1131843 131.107.65.1412430 ……… REST74562313964931 Learning with Counts

12 IP 173.194.33.946964993424 87.250.251.1131843 131.253.13.3212430 ……… REST74562313964931 query facebook2819127957321 dozen roses32791640964 ……… REST632178943477252 Query × AdId facebook, ad154546978964 facebook, ad22323438431467 dozen roses, ad312973430982 ……… REST441931 2 52754683 time T now Counting IP[2] 173.194.*.*46964993424 87.250.*.*634191356 131.253.*.*75126430826 ……… 12 Learning with Counts : aggregation

13 IP 173.194.33.946964993424 87.250.251.1131843 131.253.13.3212430 ……… REST74562313964931 query facebook2819127957321 dozen roses32791640964 ……… REST632178943477252 time T now Train predictor …. IsBackoff Aggregated features Original numeric features Counting Train non-linear model on count-based features Counts, transforms, lookup properties Additional features can be injected Query × AdId facebook, ad154546978964 facebook, ad22323438431467 dozen roses, ad312973430982 ……… REST441931 2 52754683 13 Learning with Counts : combiner training

14 IP 173.194.33.946964993424 87.250.251.1131843 131.253.13.3212430 ……… REST74562313964931 query facebook2819127957321 dozen roses32791640964 ……… REST632178943477252 URL × Country url 1, US54546978964 url 2, CA2323438431467 url 3, FR12973430982 ……… REST441931252754683 time T now …. IsBackoff Aggregated features Counts are updated continuously Combiner re-training infrequent T train Original numeric features Prediction with counts

15 State-of-the-art accuracy Good fit for map-reduce Modular (vs. monolithic) Learner can be tuned/monitored/replaced in isolation Monitorable, debuggable (this is HUGE in practice!) Temporal changes easy to monitor Easy emergency recovery (remove bot attacks, etc.) Decomposable predictions Error debugging (which feature can we blame…) 15 What is great about learning with Counts ?

16 Learning with Counts : in Azure ML

17 HDInsight: large data storage and map-reduce processing Azure ML: cloud ML and analytics accessible anywhere Learning with Counts: intuitive, flexible large-scale ML solution Putting it all together

18 Thanks for your time Useful Links: http://azure.microsoft.com/mlhttp://azure.microsoft.com/ml- Sign up for your free Azure ML Trial http://bit.ly/datasc_ebookhttp://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML Need Azure ML for teaching in classroom ? - Contact the speakers Other Questions ? - Contact the speakers Speakers :- Misha Bilenko : mbilenko@Microsoft.commbilenko@Microsoft.com Girish Nathan – ginathan@Microsoft.comginathan@Microsoft.com


Download ppt "Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models."

Similar presentations


Ads by Google