Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Google News Personalization Scalable Online Collaborative Filtering
Algorithms of Google News An Analysis of Google News Personalization Scalable Online Collaborative Filtering 1.
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
A. Darwiche Learning in Bayesian Networks. A. Darwiche Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Evaluating Search Engine
Lecture 14: Collaborative Filtering Based on Breese, J., Heckerman, D., and Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative.
Probability based Recommendation System Course : ECE541 Chetan Tonde Vrajesh Vyas Ashwin Revo Under the guidance of Prof. R. D. Yates.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Recommendations via Collaborative Filtering. Recommendations Relevant for movies, restaurants, hotels…. Recommendation Systems is a very hot topic in.
Near Duplicate Detection
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Collaborative Filtering CMSC498K Survey Paper Presented by Hyoungtae Cho.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
Finding Similar Items.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Information Retrieval
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Algorithms for Efficient Collaborative Filtering Vreixo Formoso Fidel Cacheda Víctor Carneiro University of A Coruña (Spain)
Recommendation Systems
1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.
Radial Basis Function Networks
Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.
References: Linden, G.; Smith, B.; York, J.; , "Amazon.com recommendations: item-to-item collaborative filtering,". Internet Computing, IEEE , vol.7,
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
GDG DevFest Central Italy Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Google News Personalization: Scalable Online Collaborative Filtering
Learning Geographical Preferences for Point-of-Interest Recommendation Author(s): Bin Liu Yanjie Fu, Zijun Yao, Hui Xiong [KDD-2013]
Collaborative Filtering  Introduction  Search or Content based Method  User-Based Collaborative Filtering  Item-to-Item Collaborative Filtering  Using.
Chapter 23: Probabilistic Language Models April 13, 2004.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Collaborative Filtering Zaffar Ahmed
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
User Modeling and Recommender Systems: recommendation algorithms
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
Learning in Bayesian Networks. Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown Structure Incomplete.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
KNN & Naïve Bayes Hongning Wang
ItemBased Collaborative Filtering Recommendation Algorithms 1.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Big Data Infrastructure
Collaborative Filtering Nearest Neighbor Approach
Google News Personalization: Scalable Online Collaborative Filtering
Recommendation Systems
Presentation transcript:

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai

Problem: finding stuff on Internet Know what you want: –content-based filtering, –search Don’t know –browse How to handle: Don’t know but, show me something interesting!

Google News Top Stories Recommendations for registered users Based on user click history, community clicks

Problem Scale Lots of users, (more is good) –Millions of clicks from millions of users Problem: high churn in item set –Several million items (clusters of news articles about the same story, as identified by GN) per month –Continuous addition, deletion Strict timing (few hundred ms) Existing systems not suitable

Memory-based Ratings General form: where r is rating of item s k for user u a, and w(u a,u i ) is similarity between users u a and u i Problem: scalability, even when similarity is computed offline

Model-based techniques Clustering / segmentation, e.g. based on interests Bayesian models, Markov Decision, … –All are computationally expensive

What’s in this paper? Investigate 2 different ways to cluster users: MinHash, and PLSI Implement both on MapReduce

Google News Rating Model 1 click = 1 positive vote Noisier than 1-5 ranking (Netflix) No explicit negatives Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested

Design guidelines for a scalable rating system Associate users into clusters of similar users (based on prior clicks, offline) Users can belong to multiple clusters Generate rating using much smaller sets of user clusters, rather than all users:

Technique 1: MinHash Probabilistically assign users to clusters based on click history Use Jaccard coefficient: distance is a metric Using this metric is computationally expensive, not feasible even offline

MinHash as a form of Locality Sensitive Hashing Basic idea: assign hash value to each use based on click history How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user Probability that 2 users have the same hash is equal to the Jaccard coefficient

Using MinHash for clusters Concatenate p>1 such hashes as cluster id for increased precision Apply q>1 in parallel (users belong to q clusters) to improve recall Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds

MinHash on MapReduce Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation Map using cluster id’s as keys Reduce to form membership lists for each cluster id

Technique 2: PLSI clustering Probabilistic Latent Semantic Indexing Main idea: hidden state z that correlates users and items Generate this clustering from training set based on EM algorithm give by Hoffman04 –Iterative technique, generates new probability estimates based on previous estimates

PLSI as MapReduce Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively) Reduce is simply addition

PLSI in a dynamic environment Treat Z as user clusters On each click, update p(s|z) for all clusters the user belongs to This approximates PLSI, but is updated dynamically as additional items are added Does not allow additions of users

Cluster-based recommendation For each cluster, maintain number of clicks, decayed by time, for each item visited by a member For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks Do this using both MinHash and PLSI clustering

One more technique: Covisitation Memory-based technique Create adjacency matrix between all pairs of items (can be directed) Increment corresponding count if one item visited soon after another Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately

Whole System Offline clustering Online click history update, cluster item stats update, covisitation update

Results Generally around 30-50% better than popularity based recommendations

Techniques don’t work well together, though

Discussion Covisitation appears to work as well as clustering Operational details missing: how big are cluster memberships, etc. All of the clustering is done offline