Distributed Machine Learning with H2O

Distributed Machine Learning with H2O
Joint Statistical Meeting 2018 Vancouver, British Columbia, Canada Navdeep Gill @Navdeep_Gill_

Agenda H2O Introduction H2O Core Overview H2O API Demo

H2O Introduction

Company Overview H2O Open Source In-Memory AI Prediction Engine
Founded 2011 Venture-backed, Debuted in 2012 Products H2O Open Source In-Memory AI Prediction Engine Sparkling Water (H2O + Spark) H2O4GPU (H2O on GPUs) Enterprise Steam Driverless AI Mission Operationalize Data Science & Provide a Platform to Build Beautiful Data Products Team 75+ employees Distributed Systems Engineers doing Machine Learning World-class Visualization Designers Headquarters Mountain View, CA

cientific Advisory Council
Dr. Trevor Hastie PhD in Statistics, Stanford University John A. Overdeck Professor of Mathematics, Stanford University Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining Co-author, Generalized Additive Models 108,404 citations (via Google Scholar) Dr. Robert Tibshirani PhD in Statistics, Stanford University Professor of Statistics and Health Research and Policy, Stanford University COPPS Presidents’ Award recipient Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining Author, Regression Shrinkage and Selection via the Lasso Co-author, An Introduction to the Bootstrap Dr. Steven Boyd PhD in Electrical Engineering and Computer Science, UC Berkeley Professor of Electrical Engineering and Computer Science, Stanford University Co-author, Convex Optimization Co-author, Linear Matrix Inequalities in System and Control Theory Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Smart and Fast Algorithms
What is H2O? Java-Based Software for In-Memory Data Modeling H2O is an open source first platform that is well integrated with big data frameworks such as Hadoop and spark to do large scale machine learning the H2O platform provides flexible interfaces including R python Scala and java and a custom built web-based interface called H2O Flow all our algorithms are designed to work at scale, distributed and in memory Open Source H2O Flow Big Data Ecosystem Flexible Interface Smart and Fast Algorithms

Rapid Model Deployment Scalability and Performance
What is H2O? Java-Based Software for In-Memory Data Modeling all the algorithms models create portable model artifacts for deployment including plain old java objects as well binary artifacts which we call mojos these deployment artifacts are very flexible and allow you to integrate with multiple types of deployment environments including batch, streaming, or rest based H2O platform has extended beyond CPUs and in memory processing to also includes GPUs, in addition the platform is easily deployed on any of the cloud platforms Highly portable models deployed in Java (POJO) Automated and streamlined scoring service deployment with Rest API* Distributed In-Memory Computing Platform Distributed Algorithms Fine-Grain MapReduce Rapid Model Deployment GPU Enablement* Scalability and Performance Cloud Integration

The Machine Learning Pipeline
Model M Training Feature Engineering Data Integration & Quality Feature Preprocessing

M Where H2O Fits H2O Data Integration Model Training & Quality Feature
Engineering Data Integration & Quality Feature Preprocessing H2O

Current Algorithm Overview
Statistical Analysis Linear Models (GLM) Naïve Bayes Ensembles Random Forest Gradient Boosting Machine Stacking / Super Learner Deep Neural Networks MLP Autoencoder Anomaly Detection Deep Features H2O AutoML Automatic Machine Learning in H2O Clustering K-Means (Auto-K) Dimension Reduction Principal Component Analysis Generalized Low Rank Models Word Embedding Word2Vec Time Series iSAX Machine Learning Tuning Hyperparameter Search Early Stopping

H2O Core Overview Behind the Scenes

How H2O Core Works Since H2O is just a Java Application, all you need to run it is some compute resource whether that be your laptop H2O

How H2O Core Works (just a java application) H2O
Since H2O is just a Java Application, all you need to run it is some compute resource whether that be your laptop or a cluster of cpus H2O (just a java application)

How H2O Core Works H2O is then installed on that compute resource H2O

How H2O Core Works on a laptop on a virtual machine in a container
could be a laptop, distributed physical cluster, or distributed virtual cluster H2O on a laptop on a virtual machine in a container

H2O Core and once you have h2o running it will have access to your data on disk and your cpu resrouces CPU

H2O Core Once H2O is running it can leverage the compute resources CPU to lift data stored on disk into memory CPU

H2O Core Model Building CPU
Once your data is read into memory then you can begin building your models, which will also take place in memory Model Building CPU

H2O Core now H2O’s distinguisher is that is can scale, which means it can run across multiple machines Since H2O was designed to scale you can install it on multiple machines If you install it on a distributed cluster, it will leverage the compute resources of the cluster (the cores on each box and memory) H2O H2O H2O

H2O Core If you install it on a distributed cluster, it will leverage the compute resources of the cluster (the cores on each box and memory CPU CPU CPU

H2O Core Model Building H2O Distributed In-Memory CPU CPU CPU
and manipulate data and build models in a distributed manner up until this point I’ve referred to how compute resources can read in data, and typically people who want to work with Big Data at Scale will work with Hadoop, but I want to emphasize that It is not a requirement to work Hadoop Model Building H2O Distributed In-Memory CPU CPU CPU

H2O Core with Hadoop YARN Model Building H2O Distributed In-Memory CPU

H2O Core with Other Data Sources
But again we don’t rely on Hadoop, so you can also read from S3, local file systems,etc Model Building SQL NFS S3 H2O Distributed In-Memory CPU CPU CPU YARN

H2O Core on the Cloud YARN Model Building SQL NFS S3
H2O is incredible flexible, it doesn’t require physical machines, it can run inside VMs or Containers on the Cloud All clouds are VMS, when you go to amazon and spin up something, that is a virtual machine Model Building SQL NFS S3 H2O Distributed In-Memory CPU CPU CPU YARN

H2O Distributed Environments
could be a laptop, distributed physical cluster, or distributed virtual cluster H2O H2O H2O distributed across multiple machines distributed on the cloud

H2O API How the client & cluster communicate

H2O API (ALL GREAT THINGS) H2O
rapidly turning over models, doing data munging, and building applications in a fast, scalable environment (ALL GREAT THINGS)

H2O API parallelism & distribution of work H2O
Since H2O is just a Java Application, all you need to run it is some compute resource whether that be your laptop parallelism & distribution of work

H2O API H2O RStudio/Python REST LAYER
Since H2O is just a Java Application, all you need to run it is some compute resource whether that be your laptop RStudio/Python REST LAYER

H2O API H2O Rstudio/Python Via JSON over HTTP
Since H2O is just a Java Application, all you need to run it is some compute resource whether that be your laptop Rstudio/Python Via JSON over HTTP

H2O API Data is held here in the JVM Data is NOT in the R process H2O
Since H2O is just a Java Application, all you need to run it is some compute resource whether that be your laptop Rstudio/Python Via JSON over HTTP Data is held here in the JVM Data is NOT in the R process

Client & Cluster Communication
H2O Cluster H2O REST/ JSON The R interpreter on the left is using H2O REST API to talk to the H2O cluster on the Right (HTTP REST API request to H2Ocarries the path argument) Whenever H2O is launched it exposes a rest API, which means that even though H2O is written in java you as a user don’t have to know Java to use H2O. Instead you can Use the R API that is written to look like R so that the learning curve is very small for you. The H2O REST API allows you to access all the capabilities of H2O from an external program or script, via JSON over HTTP. H2O H2O Client

Communication Layers: Interface
RStudio Interface Standard R Interface H2O the Client

Communication Layers: Code Script
Rstudio using H2O Package RStudio H2O the Client

Communication Layers: A Command
Importing Big Data with R Code R Commands H2O h2o.importFile(…) the Client Local Machine

Communication Layers: A Command
Importing Big Data with R Code Path to Your Dataset R Commands H2O h2o.importFile(…) the Client Local Machine

Fourth: Communicate H2O Cluster h2o.importFile(…) requests file import
REST / JSON The R interpreter on the left is using H2O REST API to talk to the H2O cluster on the Right (HTTP REST API request to H2Ocarries the path argument) Whenever H2O is launched it exposes a rest API, which means that even though it is written in java you as a user don’t have to know Java to use H2O. The API is separate from the runtime that does the computation (java and scala). The API extends h2o on top of the platform as a data scientist tool The H2O REST API allows you to access all the capabilities of H2O from an external program or script, via JSON over HTTP. The REST API is used by the Flow UI, as well as both the R and Python bindings: everything that you can do with those clients can be done by using the REST API, including data import, model building and generating predictions. You can call the REST API: from your browser using browser tools such as Postman in Chrome using curl using the language of your choice Generated payload POJOs for Java are available as part of the release in a separate bindings Jar file and are simple to generate for other languages if desired. H2O Cluster H2O Local Machine

Fifth: Cluster Does Heavy Lifting
H2O H2O Cluster REST/ JSON I’ve sent a command that is telling the distributed backend where all my data is stored, to import my data, and create a data frame in this case but, this process also applies to manipulating a data frame or building a model with it. And whether it is a frame or a model, once the cluster is done doing its work, the user can receive a reference or a handle back to that object Local Machine

H2O H2O Cluster REST/ JSON Each machine has its own JVM and gets a a pieces of the original big dataset I’ve sent a command and really what this is doing is simply telling the distributed backend where all my data is stored to import my data and create a dataframe in this case but it also applies to manipulating a dataframe or building a model with it. And whether it is a frame or a model, once the cluster is done doing its work, the user can receive a reference or a handle back to that object JVM 1 JVM 2 JVM N Local Machine

H2O H2O Cluster REST/ JSON where each JVM receives chunks of the data (which consist of rows in the data frame). bigger dataset just add more machines JVM 1 JVM 2 JVM N Local Machine

H2O H2O Cluster REST/ JSON x1 x2 x3 … xp y together these chunks form vecs which form an h2o frame JVM 1 JVM 2 JVM N Local Machine

H2O Client cluster returns frame key x1 x2 x3 … xp y JVM 1 JVM 2 JVM 2
REST/ JSON x1 x2 x3 … xp y The data that gets stored in H2O and all objects in H2O are associated with a key, a model is also associated with a key (a handle to the key is provided to the client) when you say h2o.importFile, and you set the left arrow equals or in python the equals, you assign that to a variable, and what that variable is, is a handle to the key, meaning when you press return on that local variable, the R package knows that this variable is a pointer to this frame, basically you can think of it as a pointer to the frame or handle to the frame. x = h2o.import_file(), x doesn’t contain the data it just contains a reference to the data siting in the cluster. JVM 1 JVM 2 JVM 2 JVM N Local Machine

Clients Only Tells the Cluster What to Do
What does the Client Do? Clients Only Tells the Cluster What to Do Hi Cluster, Can you please get me …

Client Only Passes Requests
Big Data Never Flows Through The Client Unless Explicitly Asked No, you take care of the heavy lifting

Pulling Big Data into R Can Overwhelm Your Session
What if? Pulling Big Data into R Can Overwhelm Your Session It’s fine to pull data into your R session if you know it will be small, raw data is typically very large but after you do some aggregate calculations or explicitly subset your original data you can return a frame that is small like 5 rows or 1000 row and fits on your machine AH! as.data.frame(my_big_dataframe)

H2O Resources Documentation: http://docs.h2o.ai
Tutorials: Slidedecks: Videos: Stack Overflow: Google Group: Gitter: Events & Meetups:

Get in touch over email, Gitter or JIRA.
Contribute to H2O! Get in touch over , Gitter or JIRA.

@Navdeep_Gill_ on Twitter
Thank you! @navdeep-G on Github @Navdeep_Gill_ on Twitter

Distributed Machine Learning with H2O

Similar presentations

Presentation on theme: "Distributed Machine Learning with H2O"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Machine Learning with H2O

Similar presentations

Presentation on theme: "Distributed Machine Learning with H2O"— Presentation transcript:

Similar presentations

About project

Feedback