Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Similar presentations


Presentation on theme: "Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012."— Presentation transcript:

1 Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012

2 Today’s Speaker – Josh Wills jwills@cloudera.com Formerly of Google (2008 – 2011) Worked on the ad auction Led the team that build the data infrastructure for Google+ Before that: a bunch of startups Sometimes as a software engineer, sometimes as a statistician Math degree from Duke and a half-finished PhD from The University of Texas at Austin Now: Director of Data Science at Cloudera Copyright 2012 Cloudera Inc. All rights reserved

3 High Availability for Data Scientists Copyright 2012 Cloudera Inc. All rights reserved NIPS

4 Outline Part 1: Industrial Machine Learning Part 2: Machine Learning and Hadoop State of the World Where Things Are Headed Part 3: Offline/Online  Batch/Real-Time Copyright 2012 Cloudera Inc. All rights reserved

5 Industrial Machine Learning Copyright 2012 Cloudera Inc. All rights reserved

6 Delta One: Model Evaluation Machine Learning is One Piece of a Complex System Well-defined objective functions are the exception Multiple, often conflicting goals Weights are fuzzy and shift with business priorities Pareto optimization is the safest play Predictive Accuracy Is Only Useful Up to a Point Examples Computational advertising Friend recommendations on social networks Copyright 2012 Cloudera Inc. All rights reserved

7 Delta Two: Systems Precede Algorithms Greenfield Projects Hardly Ever Happen (and don’t usually launch) Industrial Computational Infrastructure General-purpose Cheap Shared Constraints Drive Innovation Vowpal Wabbit Hashing Trick SETI @ Google Copyright 2012 Cloudera Inc. All rights reserved

8 Delta Three: Workflow Copyright 2012 Cloudera Inc. All rights reserved Practice Over Theory Blog

9 Delta Three: Workflow Optimize the Overall Process Model fitting is a small piece of the overall flow time Parallelize everything Better Features > Better Models Fast Model Deployment Common Feature Extraction Logic Servable Models Validation as Sanity Checking Deploy to a small subset of real data and evaluate Copyright 2012 Cloudera Inc. All rights reserved

10 Outline Part 1: Industrial Machine Learning Part 2: Machine Learning and Hadoop State of the World Where Things Are Headed Part 3: Offline/Online  Batch/Real-Time Copyright 2012 Cloudera Inc. All rights reserved

11 “Hadoop. It’s Where The Data Is.” Copyright 2012 Cloudera Inc. All rights reserved

12 Hadoop Platform: Substrate Commodity servers Open Compute Open source operating system Linux Open source configuration management Puppet Chef Coordination service ZooKeeper Copyright 2012 Cloudera Inc. All rights reserved

13 Hadoop Platform: Storage Distributed schema-less storage HDFS Ceph Append-only storage formats and metadata Avro RCFile HCatalog Mutable key-value storage and metadata HBase Copyright 2012 Cloudera Inc. All rights reserved

14 Hadoop Platform: Integration Tool Access FUSE JDBC ODBC Data Ingestion Flume Sqoop Copyright 2012 Cloudera Inc. All rights reserved

15 ML and Hadoop: The State of the World Copyright 2012 Cloudera Inc. All rights reserved

16 Computation: Plain Old MapReduce Great for: Data Preparation Feature Engineering Model Validation/Evaluation Works For Certain Model Fitting Problems Recommendation Systems Expectation Maximization Decision Trees (PLANET; Gradient Boosted Decision Trees)PLANETGradient Boosted Decision Trees Not A Practical Option for Online Learning Way More Detail from the KDD 2011 TalkKDD 2011 Talk Copyright 2012 Cloudera Inc. All rights reserved

17 Tools for Data Preparation/Feature Engineering Languages/Environments PigLatin HiveQL Need to deal with mismatch between offline/online feature generation Java/Scala APIs Crunch (Cloudera) Crunch Scoobi (NICTA) Scoobi Cascading (Concurrent) Cascading Jaql (IBM) Copyright 2012 Cloudera Inc. All rights reserved

18 Apache Mahout The starting place for MapReduce-based machine learning algorithms Not machine-learning-in-a-box Custom tweaks/modifications are the rule A disparate collection of algorithms for: Recommendations Clustering Classification Frequent Itemset Mining Copyright 2012 Cloudera Inc. All rights reserved

19 Apache Mahout (cont.) Best Library: Taste Recommender Oldest project, most widely-deployed in production SVD implementation is particularly active Good Libraries: Online SGD Does not use MapReduce Vowpal Rabbit is faster, has L-BFGS option Roll Your Own Instead: Naïve Bayes Challenges “Secret sauce” effect Delta between Mahout + the cutting edge in ML Copyright 2012 Cloudera Inc. All rights reserved

20 More Machine Learning Interfaces for Hadoop Based on MapReduce SystemML (IBM) R-Based Systems (Augment MapReduce with R) Segue RHIPE RHadoop Ricardo (IBM) Copyright 2012 Cloudera Inc. All rights reserved

21 ML and Hadoop: Where Things are Headed Copyright 2012 Cloudera Inc. All rights reserved

22 MRv2 and YARN Eliminates JobTracker bottleneck Separate Resource Manager/Scheduler Individual jobs have their own task masters No more map slots and reduce slots Moves MapReduce into user-land Hadoop clusters can run all sorts of jobs Will also allow fine-grained resource allocation CPU Memory Disk Copyright 2012 Cloudera Inc. All rights reserved

23 YARN Job Flows Copyright 2012 Cloudera Inc. All rights reserved

24 The Contenders Copyright 2012 Cloudera Inc. All rights reserved

25 AllReduce Developed at Yahoo! Research Defines the allreduce operation N machines each have a number => each machine has the sum of the numbers At the heart of Vowpal Wabbit’s performance Implemented in C++ Can be patched into Apache Hadoop and used today Copyright 2012 Cloudera Inc. All rights reserved

26 Spark Developed at Berkeley’s AMP Lab Defines operations on distributed in-memory collections Written in Scala Supports reading to and writing from HDFS Copyright 2012 Cloudera Inc. All rights reserved

27 GraphLab Developed at CMU Lower-level primitives (but higher than MPI) Map/Reduce => Update/Sort Flexible, allows for asynchronous computations* C++/Java/Python/Matla b Copyright 2012 Cloudera Inc. All rights reserved

28 Outline Part 1: Industrial Machine Learning Part 2: Machine Learning and Hadoop State of the World Where Things Are Headed Part 3: Offline/Online  Batch/Real-Time Copyright 2012 Cloudera Inc. All rights reserved

29 Offline vs. Online Learning Copyright 2012 Cloudera Inc. All rights reserved

30 Batch vs. Real-Time: The CAP Theorem Impossible for a distributed computer system to simultaneously provide: Consistency Availability Partition Tolerance Instead, we end up with BASE Basically Available Soft State Eventual consistency High availability Cleanup mechanism for providing consistency (eventually) Copyright 2012 Cloudera Inc. All rights reserved

31 Nathan Marz: Beating the CAP TheoremBeating the CAP Theorem Copyright 2012 Cloudera Inc. All rights reserved

32 Models as Queries Copyright 2012 Cloudera Inc. All rights reserved

33 Collapsing Distinctions Copyright 2012 Cloudera Inc. All rights reserved

34 Systems Drive Algorithms, Redux Copyright 2012 Cloudera Inc. All rights reserved

35 Questions? Want A Job? jwills@cloudera.com


Download ppt "Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012."

Similar presentations


Ads by Google