Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning at Scale using h2o

Similar presentations


Presentation on theme: "Machine Learning at Scale using h2o"— Presentation transcript:

1 Machine Learning at Scale using h2o
Giri Tatavarty Data Science Manager – R&D dunnhumby inc

2 What is Machine Learning ?
It is field of computer science that has concepts from statistics, pattern recognition, artificial intelligence and computational learning theory. It is about algorithms which can teach themselves to recognize patterns in data without programming explicitly. Supervised learning: The algorithm is presented with example inputs and their desired outputs, and the goal is to learn a general rule that maps inputs to outputs. Example: Image tagging,voice recognition, fraud detection, time series forecasting, predictive analytics, spam detection, face recognition, finger print recognition, handwriting recognition, Netflix or Amazon recommendations

3 What is Machine Learning?
Unsupervised learning: No labels/examples are given to the learning algorithm, leaving algorithm n its own to find structure in the data. Example: Clustering, Segmentations

4 What is Machine Learning?
Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle, or playing chess game), without a teacher explicitly telling it whether it has come close to its goal.

5 Machine Learning in R

6 Machine Learning with R
Different packages for different ML algorithms xgboost,rpart,randomforest,bigrf,glmnet,gbm Programmatic Interface and parameters inputs are very different for these packages Data needs to be standardized Convert categorical text data to multiple binary columns Standardize values/ Scale them Hyper parameter optimization framework missing with standard packages Working with BigData as these packages are limited by the size of single working machine

7 h2o solves some of these problems
Standard interface for all different algorithms No need to standardize or covert categorical variables Runs on big data . Same code can run on your laptop as well as node cluster. Generates production ready high performance java code for scoring Can work with other languages such as python / REST api or use the web interface for data exploration and analysis Cons Limited by the implementation of h2o ( to be fair h2o is opensource)

8 H2o architecture

9 H2o architecture -II Data is compressed / chunked and distributed across the nodes Processing is done in a tree based topology to minimize inter node communication and summarize the data locally as much as possible.

10 H2o Installation in R # install latest version from Cron or specific version from Amazon install.packages("h2o", type="source", repos=(c(" release.s3.amazonaws.com/h2o/rel-tibshirani/8/R"))) or install.packages("h2o") # load the library and start up a local h2o engine library(h2o) localH2O = h2o.init(nthreads=-1) # Run the demo demo(h2o.kmeans)

11

12 Baby Steps - Iris dataset
150 examples 4 attributes 4 classes

13 Task : Train the model to predict species (class) of the flower based on attributes

14 Model Statistics – Open browser http://localhost:54321 ( Free Plots)

15 Step 2: Split data into training and test datasets; Report performance on unseen data

16

17 Big Data Test Case – http://www.dunnhumby.com/sourcefiles.aspx
~ 50 GB ~ 300M rows 22 columns

18 DataSet 1 - Transactions

19 Predict if customer is going to visit the store next week, based on previous visits
If yes then predict what he is likely to buy and activate the promotion channels necessary.

20 How do we go about predictions
Create a model which takes your spend on previous 12 weeks ( or n weeks ) to predict the current week visit Create a Training Data Set Create a Test Data Set Train the Model on Training data set Test the predictions on Test Dataset

21 Large dataset ingestion on laptop

22 A bit of data munging

23 Plots from R using h2o.hist
h2o.hist(subset.hex$SPEND[subset.hex$SPEND<30 ])

24 Summarize Data and create features - h2o.group_by

25 Final Dataset before ML Model

26 Creating a Random Forest Model

27 Using Flow to explore the Model

28 Change and explore thresholds

29 Logistic Regression – h2o.glm

30 Gradient Boost Machines

31 Deep Learning and Neural Networks

32 Machine Learning Algorithms supported
K-Means GLM (generalized linear models) DRF ( Distributed Random forest) Naïve Bayes PCA (Principal Component Analysis) GBM (Gradient Boosting) Deep Learning

33 Meta Learning Grid Search & non-negative least squares (NNLS)
Ensemble Models

34 Resources http://h2o.ai http://bit.ly/1Qh79Xr h2o booklet on R
mby_-_Let_s_Get_Sort-of-Real_User_Guide.pdf


Download ppt "Machine Learning at Scale using h2o"

Similar presentations


Ads by Google