Machine Learning at Scale using h2o

Machine Learning at Scale using h2o
Giri Tatavarty Data Science Manager – R&D dunnhumby inc

What is Machine Learning ?
It is field of computer science that has concepts from statistics, pattern recognition, artificial intelligence and computational learning theory. It is about algorithms which can teach themselves to recognize patterns in data without programming explicitly. Supervised learning: The algorithm is presented with example inputs and their desired outputs, and the goal is to learn a general rule that maps inputs to outputs. Example: Image tagging,voice recognition, fraud detection, time series forecasting, predictive analytics, spam detection, face recognition, finger print recognition, handwriting recognition, Netflix or Amazon recommendations

What is Machine Learning?
Unsupervised learning: No labels/examples are given to the learning algorithm, leaving algorithm n its own to find structure in the data. Example: Clustering, Segmentations

What is Machine Learning?
Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle, or playing chess game), without a teacher explicitly telling it whether it has come close to its goal.

Machine Learning in R

Machine Learning with R
Different packages for different ML algorithms xgboost,rpart,randomforest,bigrf,glmnet,gbm Programmatic Interface and parameters inputs are very different for these packages Data needs to be standardized Convert categorical text data to multiple binary columns Standardize values/ Scale them Hyper parameter optimization framework missing with standard packages Working with BigData as these packages are limited by the size of single working machine

h2o solves some of these problems
Standard interface for all different algorithms No need to standardize or covert categorical variables Runs on big data . Same code can run on your laptop as well as node cluster. Generates production ready high performance java code for scoring Can work with other languages such as python / REST api or use the web interface for data exploration and analysis Cons Limited by the implementation of h2o ( to be fair h2o is opensource)

H2o architecture

H2o architecture -II Data is compressed / chunked and distributed across the nodes Processing is done in a tree based topology to minimize inter node communication and summarize the data locally as much as possible.

H2o Installation in R # install latest version from Cron or specific version from Amazon install.packages("h2o", type="source", repos=(c(" release.s3.amazonaws.com/h2o/rel-tibshirani/8/R"))) or install.packages("h2o") # load the library and start up a local h2o engine library(h2o) localH2O = h2o.init(nthreads=-1) # Run the demo demo(h2o.kmeans)

Baby Steps - Iris dataset
150 examples 4 attributes 4 classes

Task : Train the model to predict species (class) of the flower based on attributes

Model Statistics – Open browser http://localhost:54321 ( Free Plots)

Step 2: Split data into training and test datasets; Report performance on unseen data

Big Data Test Case – http://www.dunnhumby.com/sourcefiles.aspx
~ 50 GB ~ 300M rows 22 columns

DataSet 1 - Transactions

Predict if customer is going to visit the store next week, based on previous visits
If yes then predict what he is likely to buy and activate the promotion channels necessary.

How do we go about predictions
Create a model which takes your spend on previous 12 weeks ( or n weeks ) to predict the current week visit Create a Training Data Set Create a Test Data Set Train the Model on Training data set Test the predictions on Test Dataset

Large dataset ingestion on laptop

A bit of data munging

Plots from R using h2o.hist
h2o.hist(subset.hex$SPEND[subset.hex$SPEND<30 ])

Summarize Data and create features - h2o.group_by

Final Dataset before ML Model

Creating a Random Forest Model

Using Flow to explore the Model

Change and explore thresholds

Logistic Regression – h2o.glm

Gradient Boost Machines

Deep Learning and Neural Networks

Machine Learning Algorithms supported
K-Means GLM (generalized linear models) DRF ( Distributed Random forest) Naïve Bayes PCA (Principal Component Analysis) GBM (Gradient Boosting) Deep Learning

Meta Learning Grid Search & non-negative least squares (NNLS)
Ensemble Models

Resources http://h2o.ai http://bit.ly/1Qh79Xr h2o booklet on R
mby_-_Let_s_Get_Sort-of-Real_User_Guide.pdf

Machine Learning at Scale using h2o

Similar presentations

Presentation on theme: "Machine Learning at Scale using h2o"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning at Scale using h2o

Similar presentations

Presentation on theme: "Machine Learning at Scale using h2o"— Presentation transcript:

Similar presentations

About project

Feedback