Download presentation
Presentation is loading. Please wait.
1
Machine Learning at Scale using h2o
Giri Tatavarty Data Science Manager – R&D dunnhumby inc
2
What is Machine Learning ?
It is field of computer science that has concepts from statistics, pattern recognition, artificial intelligence and computational learning theory. It is about algorithms which can teach themselves to recognize patterns in data without programming explicitly. Supervised learning: The algorithm is presented with example inputs and their desired outputs, and the goal is to learn a general rule that maps inputs to outputs. Example: Image tagging,voice recognition, fraud detection, time series forecasting, predictive analytics, spam detection, face recognition, finger print recognition, handwriting recognition, Netflix or Amazon recommendations
3
What is Machine Learning?
Unsupervised learning: No labels/examples are given to the learning algorithm, leaving algorithm n its own to find structure in the data. Example: Clustering, Segmentations
4
What is Machine Learning?
Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle, or playing chess game), without a teacher explicitly telling it whether it has come close to its goal.
5
Machine Learning in R
6
Machine Learning with R
Different packages for different ML algorithms xgboost,rpart,randomforest,bigrf,glmnet,gbm Programmatic Interface and parameters inputs are very different for these packages Data needs to be standardized Convert categorical text data to multiple binary columns Standardize values/ Scale them Hyper parameter optimization framework missing with standard packages Working with BigData as these packages are limited by the size of single working machine
7
h2o solves some of these problems
Standard interface for all different algorithms No need to standardize or covert categorical variables Runs on big data . Same code can run on your laptop as well as node cluster. Generates production ready high performance java code for scoring Can work with other languages such as python / REST api or use the web interface for data exploration and analysis Cons Limited by the implementation of h2o ( to be fair h2o is opensource)
8
H2o architecture
9
H2o architecture -II Data is compressed / chunked and distributed across the nodes Processing is done in a tree based topology to minimize inter node communication and summarize the data locally as much as possible.
10
H2o Installation in R # install latest version from Cron or specific version from Amazon install.packages("h2o", type="source", repos=(c(" release.s3.amazonaws.com/h2o/rel-tibshirani/8/R"))) or install.packages("h2o") # load the library and start up a local h2o engine library(h2o) localH2O = h2o.init(nthreads=-1) # Run the demo demo(h2o.kmeans)
12
Baby Steps - Iris dataset
150 examples 4 attributes 4 classes
13
Task : Train the model to predict species (class) of the flower based on attributes
14
Model Statistics – Open browser http://localhost:54321 ( Free Plots)
15
Step 2: Split data into training and test datasets; Report performance on unseen data
17
Big Data Test Case – http://www.dunnhumby.com/sourcefiles.aspx
~ 50 GB ~ 300M rows 22 columns
18
DataSet 1 - Transactions
19
Predict if customer is going to visit the store next week, based on previous visits
If yes then predict what he is likely to buy and activate the promotion channels necessary.
20
How do we go about predictions
Create a model which takes your spend on previous 12 weeks ( or n weeks ) to predict the current week visit Create a Training Data Set Create a Test Data Set Train the Model on Training data set Test the predictions on Test Dataset
21
Large dataset ingestion on laptop
22
A bit of data munging
23
Plots from R using h2o.hist
h2o.hist(subset.hex$SPEND[subset.hex$SPEND<30 ])
24
Summarize Data and create features - h2o.group_by
25
Final Dataset before ML Model
26
Creating a Random Forest Model
27
Using Flow to explore the Model
28
Change and explore thresholds
29
Logistic Regression – h2o.glm
30
Gradient Boost Machines
31
Deep Learning and Neural Networks
32
Machine Learning Algorithms supported
K-Means GLM (generalized linear models) DRF ( Distributed Random forest) Naïve Bayes PCA (Principal Component Analysis) GBM (Gradient Boosting) Deep Learning
33
Meta Learning Grid Search & non-negative least squares (NNLS)
Ensemble Models
34
Resources http://h2o.ai http://bit.ly/1Qh79Xr h2o booklet on R
mby_-_Let_s_Get_Sort-of-Real_User_Guide.pdf
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.