Google News Personalization: Scalable Online Collaborative Filtering

Google News Personalization: Scalable Online Collaborative Filtering
Abhinandan Das, Mayur Datar, Ashutosh Garg WWW2007, May 8-12, 2007, Banff, Alberta, Canada Presented By, Naveenkumar Selvaraj

Google News A news aggregator website Personalized news-
Records search queries and news clicks Makes them accessible online Recommended for you – top stories based on the user’s past click history

Collaborative Filtering
Learn User Preferences Make recommendations based on User Data Community Data Ex : Amazon.com

Contribution of this paper
Aims to present recommendations to signed-in users for Google News based on User’s click history Click history of the community Generate recommendations using three approaches- collaborative filtering using MinHash Clustering Probabilistic Latent Semantic Indexing (PLSI) Covisitation counts

Problem Statement Given click history for N users and M items and given a specific user u with click history Cu consisting of stories, recommend K stories to the user, which he/she may be interested in reading.

Challenges Scalability – has several million users visiting it and several million items Large number of item churns Most systems including Amazon assume amount of churn – static or minimal For Google News – large amount of churn. Stories of interest are the ones that appeared in the last couple of hours Strict Timing Requirements Strict Response time requirement – less than a second Must also take into account – time for Generating news story clusters by accessing various indexes storing the content Generating the HTML content as a response for HTTP request Hence only a few milliseconds left

Challenges ( Contd..) Treating click as positive vote
More noisy – than giving 1-5 ratings (like Amazon.com ) Clicks do not say anything about the user’s negative interest

Related Work Recommender Systems
Content-based Filtering – Items similar to the ones rated highly by the user recommended Collaborative Filtering – Two types Memory Based – Rating predictions based on users’ past rating Weighted average of the ratings given by other users ( weight proportional to similarity between users) Challenge – Make it more scalable

Related Work (Contd..) Model-based
Model the users based on past ratings Use the models to predict ratings on unseen items Shortcomings of earlier work in Model-based Categorizes each user into only one class User may have different tastes for different topics

Work Done Mix of Memory Based and Model Based algorithms used.
Model Based – MinHash and PLSI ( Probabilistic Latent Semantic Indexing) Memory Based – Covisitation Each algorithm - assigns a numeric score to a story Scores combined – obtain a ranked list of stories

MinHash Probabilistic Clustering technique.
Assigns pair of user to the same cluster with probability proportional to the overlap between the set of items users voted for Similarity between users ui and uj represented as Doing this in real-time – not scalable

MinHash LSI ( Locality Sensitive Hashing)
Randomly permute set of items S. For each user ui, compute hash value h(ui) – index of the first item under the permutation that belongs to user’s item set. Probability that two users will have same hash function is equal to similarity between users. Each hash bucket corresponds to a cluster , which puts two users in the same cluster with similarity S(ui,uj) Can concatenate more hash keys- underlying clusters more refined and average similarity of users within clusters more. Has high precision but low recall

MinHash Improve recall by repeating above step in parallel multiple times Generating random permutations over millions of items not feasible Generate set of independent random seed values, one for each MinHash function- map each news story to a hash-value computed using Id of news story, seed value MapReduce used Handles large amounts of data in short period of time and makes it more scalable

PLSI (Probabilistic Latent Semantic Indexing)
Models users and items as random variables –taking values from space of all possible users and items Relationship learned by modeling as joint distribution of users and items Hidden variable Z captures this relationship where |Z|=L

PLSI Mapreducing EM algorithm
EM algorithm – used to learn maximum likelihood parameters of the model. When the number of users and items are very large of the order millions, memory requirements for the Conditional Probability Distributions is huge. Solution : MapReduce

PLSI Map-reduce Framework

PLSI R x K grid Users and items sharded into R and K groups respectively Click data corresponding to the (u,s) pair sent to (i,j) th machine – u belongs to ith shard and s belongs to jth shard Each machine has to load only (1/R)th of user CPDs and (1/K)th of item CPDs Reducer shard receives key-value pair and computes the values for the next iteration Problems with PLSI : If new user/items added – whole model needs to be retrained Even after parallelizing, it is not real time

Covisitation Event in which two stories are clicked by the same user within certain time interval Graph Nodes represent items Weighted edges – number of Covisitation instances Graph maintained as an adjacency list in Bigtable keyed by item-id (Bigtable maintains data in lexicographic order by row key)

Covisitation Fetch every item in users click history, limited to past few hours For every item si, Lookup for the pair si , s in the adjacency list for si stored in the Bigtable. Add the value stored in the entry to the recommendation score normalized by the sum of all entries. Covisitation scores normalized between 0 and 1 by linear scaling

System setup

System setup (contd..) Two types of requests handled Recommend Request
Update Statistics Request

Evaluation Precision-recall curves for the Movie Lens Dataset

Evaluation Precision-recall curves for the GoogleNews Dataset

Evaluation Live traffic click ratios for different algorithms with baseline as Popular algorithm

Evaluation Live traffic click ratios for comparing PLSI and MinHash algorithms

Conclusion The paper present the algorithms behind scalable real time recommendation engines Novel approaches to clustering over dynamic datasets using MinHash and PLSI with MapReduce Framework for scalability implemented Evaluation on live traffic was done over a large fraction of Google News traffic over a period of several days

Future Work Suitable learning techniques to determine how to combine the scores from different algorithms Exploring cost-benefit tradeoffs of using higher order covisitation statistics

Questions?

Google News Personalization: Scalable Online Collaborative Filtering

Similar presentations

Presentation on theme: "Google News Personalization: Scalable Online Collaborative Filtering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Google News Personalization: Scalable Online Collaborative Filtering

Similar presentations

Presentation on theme: "Google News Personalization: Scalable Online Collaborative Filtering"— Presentation transcript:

Similar presentations

About project

Feedback