Building a Recommendation Engine with Spark

Building a Recommendation Engine with Spark
Yongkai Wu Hello everyone I am glad to share my experience of building a recommendation system with spark

Outline Spark MLlib Collaborative filtering recommendation system
MovieLen dataset Build a movie recommender Make predictions Here is my outline I will provide some background knowledge about Spark MLlib, then introduce how to build a movie recommendation system with spark

Spark MLlib Classification Regression Clustering Recommendation:
Decision tree, naive Bayes, random forests, gradient-boost trees… Regression Logistic regression , generalized regression, isotonic regression… Clustering K-means, … Recommendation: Alternating least squares(ALS) MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. In this library, you can find the most widely used algorithm in Machine learning field. It provides classification, regression and clustering algorithms. I think you are familiar with DT…. In the next ten minutes, I will show you the ALS and how to implement Rec with als

Recommendation systems
Content-Based system Focus on properties of items and preference of users E.g.: If John likes Avengers: Age of Ultron, the system will recommend Doctor Stranger, because they are fictional and produced by Marvel. Collaborative-Filtering system Focus on the relationship between users and items Key point: people who liked similar items in the past will like similar items in the future E.g.: Allice and Bob both like House of Card. The system knows Allice likes Arrival, so it guest Bob will like this movie. What is recommendation systems Here is an Amazon example. Every time you go to Amazon.com, Amazon shows several items based on your purchase history or browsing history. Amazon try to guess what you are interested General speaking, two type recommendation systems.

Collaborative-Filtering: a movie example
Alice Bob Carol Rating matrix

BIG and SPARSE matrix Amazon has 300M users and 200M products
The matrix has 60 million billion cells A certain user rates a very small group of items Most cells is empty

Matrix Factorization 𝑹 𝒎×𝒏 𝑼 𝒎×𝒌 × 𝑷 𝒌×𝒏 𝑻 ≈
One of the most popular algorithms to solve collaborative recommender systems is called Matrix Factorization In its simplest form, it assumes a matrix of rartings R given by m users and n items. Apply matrix factorization on R will end up factorizing R into two matrices U and P. Their multiplication approximates R. Note that we have a new quantity k. k is the rank of the factorization. If a factorization has a larger number k, the better approximation

Alternating Least Squares (ALS)
Cost function: 𝐽= 𝑅−𝑈× 𝑃 𝑇 𝜆 𝑈 𝑃 2 Mean squared error regularization Optimization problem: Gradient Decent MF is a form of optimization process that aims to approximate the original matrix R with the two matrices U and P, such that it minimizes the following cost function: The first term in this cost function is the Mean Square Error (MSE) distance measure between the original rating matrix R and its approximation {U \times {P^T}} The second term is called a “regularization term” and is added to govern a generalized solution (to prevent overfitting to some local noisy effects on ratings). Gradient Descent is a first-order optimization algorithm that is widely used in the field of machine learning. Decrease the cost with iteration method. Iterations: # of iterations Lambda: regularization

Example M1 M2 M3 M4 1.4 1.3 0.9 1.2 0.8 1.1 2 M1 M2 M3 M4 Alice 4 3
Bob 2 Carol 5 Daniel Alice 1.4 0.9 Bob 1.2 1 Carol 1.5 Daniel 0.8 × k = 2 interactions = 20 lambda= 0.2

M1 M2 M3 M4 1.4 1.3 0.9 1.2 0.8 1.1 2 Alice 1.4 0.9 Bob 1.2 1 Carol 1.5 Daniel 0.8 M1 M2 M3 M4 Alice 4.23 2.61 Bob 2.64 2.36 Carol 4.82 2.30 Daniel 3.86 × = M1 M2 M3 M4 Alice 4 3 Bob 2 Carol 5 Daniel 𝑅𝑀𝑆𝐸= 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝑦 𝑖 = − − …

MovieLen dataset 10 million ratings from 72,000 users on 10,000 movies
UserID::MovieID::Rating::Timestamp Split: 60%: training 20%: validation 20%: testing

Build a recommender Train a bunch of models on the training set and evaluate on the validation set. The implementation in Spark Mllib has some parameters: rank: # of latent factors in the model. iteration: # of iterations of ALS. lambda: the regularization parameter

UserID, MovieID, TrueRating
UserID, MovieID, PredictRating (UserID, MovieID), PredictRating (UserID, MovieID), TrueRating (UserID, MovieID), Prediction Rating, TrueRating

Make prediction

Compare with the mean rating
RMSE of the ALS model is 0.87 RMSE of the naive model is 1.11 The best model improves the baseline by 21.97%

Reference: Spark Summit 2014: Movie Recommendation with Mllib recommendation-with-mllib.htm An online movie recommending service using Spark&Flask lens/blob/master/notebooks/building-recommender.ipynb

Building a Recommendation Engine with Spark

Similar presentations

Presentation on theme: "Building a Recommendation Engine with Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building a Recommendation Engine with Spark

Similar presentations

Presentation on theme: "Building a Recommendation Engine with Spark"— Presentation transcript:

Similar presentations

About project

Feedback