Download presentation

Presentation is loading. Please wait.

Published byDevante Teesdale Modified over 2 years ago

1
Abhinandan Das, Mayur Datar, Ashutosh Garg WWW2007, May 8-12, 2007, Banff, Alberta, Canada Presented By, Naveenkumar Selvaraj Google News Personalization: Scalable Online Collaborative Filtering

2
Google News A news aggregator website Personalized news- Records search queries and news clicks Makes them accessible online Recommended for you – top stories based on the users past click history

3

4
Collaborative Filtering Learn User Preferences Make recommendations based on User Data Community Data Ex : Amazon.com

5
Contribution of this paper Aims to present recommendations to signed-in users for Google News based on Users click history Click history of the community Generate recommendations using three approaches- collaborative filtering using MinHash Clustering Probabilistic Latent Semantic Indexing (PLSI) Covisitation counts

6
Problem Statement Given click history for N users and M items and given a specific user u with click history C u consisting of stories, recommend K stories to the user, which he/she may be interested in reading.

7
Challenges Scalability – has several million users visiting it and several million items Large number of item churns Most systems including Amazon assume amount of churn – static or minimal For Google News – large amount of churn. Stories of interest are the ones that appeared in the last couple of hours Strict Timing Requirements Strict Response time requirement – less than a second Must also take into account – time for Generating news story clusters by accessing various indexes storing the content Generating the HTML content as a response for HTTP request Hence only a few milliseconds left

8
Challenges ( Contd..) Treating click as positive vote More noisy – than giving 1-5 ratings (like Amazon.com ) Clicks do not say anything about the users negative interest

9
Related Work Recommender Systems Content-based Filtering – Items similar to the ones rated highly by the user recommended Collaborative Filtering – Two types Memory Based – Rating predictions based on users past rating Weighted average of the ratings given by other users ( weight proportional to similarity between users) Challenge – Make it more scalable

10
Related Work (Contd..) Model-based Model the users based on past ratings Use the models to predict ratings on unseen items Shortcomings of earlier work in Model-based oCategorizes each user into only one class oUser may have different tastes for different topics

11
Work Done Mix of Memory Based and Model Based algorithms used. Model Based – MinHash and PLSI ( Probabilistic Latent Semantic Indexing) Memory Based – Covisitation Each algorithm - assigns a numeric score to a story Scores combined – obtain a ranked list of stories

12
MinHash Probabilistic Clustering technique. Assigns pair of user to the same cluster with probability proportional to the overlap between the set of items users voted for Similarity between users u i and u j represented as Doing this in real-time – not scalable

13
MinHash LSI ( Locality Sensitive Hashing) Randomly permute set of items S. For each user u i, compute hash value h( u i ) – index of the first item under the permutation that belongs to users item set. Probability that two users will have same hash function is equal to similarity between users. Each hash bucket corresponds to a cluster, which puts two users in the same cluster with similarity S(u i, u j ) Can concatenate more hash keys- underlying clusters more refined and average similarity of users within clusters more. Has high precision but low recall

14
MinHash Improve recall by repeating above step in parallel multiple times Generating random permutations over millions of items not feasible Generate set of independent random seed values, one for each MinHash function- map each news story to a hash-value computed using Id of news story, seed value MapReduce used Handles large amounts of data in short period of time and makes it more scalable

15
PLSI (Probabilistic Latent Semantic Indexing) Models users and items as random variables –taking values from space of all possible users and items Relationship learned by modeling as joint distribution of users and items Hidden variable Z captures this relationship where |Z|=L

16
PLSI Mapreducing EM algorithm EM algorithm – used to learn maximum likelihood parameters of the model. When the number of users and items are very large of the order millions, memory requirements for the Conditional Probability Distributions is huge. Solution : MapReduce

17
PLSI Map-reduce Framework

18
PLSI R x K grid Users and items sharded into R and K groups respectively Click data corresponding to the (u,s) pair sent to (i,j) th machine – u belongs to ith shard and s belongs to jth shard Each machine has to load only (1/R)th of user CPDs and (1/K)th of item CPDs Reducer shard receives key-value pair and computes the values for the next iteration Problems with PLSI : If new user/items added – whole model needs to be retrained Even after parallelizing, it is not real time

19
Covisitation Event in which two stories are clicked by the same user within certain time interval Graph Nodes represent items Weighted edges – number of Covisitation instances Graph maintained as an adjacency list in Bigtable keyed by item-id (Bigtable maintains data in lexicographic order by row key)

20
Covisitation Fetch every item in users click history, limited to past few hours For every item s i, Lookup for the pair s i, s in the adjacency list for s i stored in the Bigtable. Add the value stored in the entry to the recommendation score normalized by the sum of all entries. Covisitation scores normalized between 0 and 1 by linear scaling

21
System setup

22
System setup (contd..) Two types of requests handled Recommend Request Update Statistics Request

23
Evaluation Precision-recall curves for the Movie Lens Dataset

24
Evaluation Precision-recall curves for the GoogleNews Dataset

25
Evaluation Live traffic click ratios for different algorithms with baseline as Popular algorithm

26
Evaluation Live traffic click ratios for comparing PLSI and MinHash algorithms

27
Conclusion The paper present the algorithms behind scalable real time recommendation engines Novel approaches to clustering over dynamic datasets using MinHash and PLSI with MapReduce Framework for scalability implemented Evaluation on live traffic was done over a large fraction of Google News traffic over a period of several days

28
Future Work Suitable learning techniques to determine how to combine the scores from different algorithms Exploring cost-benefit tradeoffs of using higher order covisitation statistics

29
Questions?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google