Proceedings of the 28th VLDB Conference Hong Kong, China, 2002

Slides:



Advertisements
Similar presentations
Chapter 13: Query Processing
Advertisements

1 IDX. 2 What you will learn: What IDX is Why its important How to use it Tips and tricks Introduction Q & A.
Chapter 7 System Models.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
DCV: A Causality Detection Approach for Large- scale Dynamic Collaboration Environments Jiang-Ming Yang Microsoft Research Asia Ning Gu, Qi-Wei Zhang,
1 State Wildlife Action Plans Wiki: Business Transformation Tutorial Brand Niemann July 5, 2008
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Click to edit Master title style Page - 1 OneSky Teams Step-by-Step Online Corporate Communication Support 2006.
State of New Jersey Department of Health and Senior Services Patient Safety Reporting System Module 2 – New Event Entry.
Recommender Systems & Collaborative Filtering
Fawaz Ghali Web 2.0 for the Adaptive Web.
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
Website Design What is Involved?. Web Design ConsiderationsSlide 2Bsc Web Design Stage 1 Website Design Involves Interface Design Site Design –Organising.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Google News Personalization: Scalable Online Collaborative Filtering
Google News Personalization Scalable Online Collaborative Filtering
Week 2 The Object-Oriented Approach to Requirements
Configuration management
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Yong Choi School of Business CSU, Bakersfield
Chapter Information Systems Database Management.
Cache and Virtual Memory Replacement Algorithms
1 Contract Inactivation & Replacement Fly-in Action ( Continue to Page Down/Click on each page…) Electronic Document Access (EDA)
By Waqas Over the many years the people have studied software-development approaches to figure out which approaches are quickest, cheapest, most.
Location-Based Social Networks Yu Zheng and Xing Xie Microsoft Research Asia Chapter 8 and 9 of the book Computing with Spatial Trajectories.
Text Categorization.
Memory vs. Model-based Approaches SVD & MF Based on the Rajaraman and Ullman book and the RS Handbook. See the Adomavicius and Tuzhilin, TKDE 2005 paper.
Machine Learning: Intro and Supervised Classification
Executional Architecture
Who are the Experts?Simon KampaSlide 1 Who are the Experts? Simon Kampa IAM Group University of Southampton
Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,
Determining How Costs Behave
DIKLA GRUTMAN 2014 Databases- presentation and training.
Page 1 of 36 The Public Offering functionality in Posting allows users to submit requests for public offerings of Petroleum and Natural Gas(PNG) and Oil.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
Choosing an Order for Joins
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Link-State Routing Protocols Routing Protocols and Concepts – Chapter.
Vendor Guide to Mandatory Pre-Population in WAWF 5.4
© Copyright 2011 John Wiley & Sons, Inc.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Yusuf Simonson Title Suggesting Friends Using the Implicit Social Graph.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Recommender systems Ram Akella November 26 th 2008.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
1 Computing Relevance, Similarity: The Vector Space Model.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Collaborative Filtering - Pooja Hegde. The Problem : OVERLOAD Too much stuff!!!! Too many books! Too many journals! Too many movies! Too much content!
Recommender Systems & Collaborative Filtering
Presentation transcript:

Proceedings of the 28th VLDB Conference Hong Kong, China, 2002 REFEREE: An open framework for practical testing of recommender systems using ResearchIndex Proceedings of the 28th VLDB Conference Hong Kong, China, 2002 Dan Cosley, Steve Lawrence, David M Pennock Presenter: Raghav Karumur Date: 4/11/2011 Course: [CSCI 8735] Advanced Database Systems Department of Computer Science and Engineering University of Minnesota, Twin Cities Spring 2011

Background What is REFEREE? What is ResearchIndex? REcommender Framework and Evaluator for REsEarchIndex An autonomous citation indexing system that indexes a large fraction of the computer science literature available on the web. It locates papers, downloads them, and automatically extracts information such as abstracts, authors, and citations. It provides citation links both to and from a document. It computes similarity: “users who viewed x also viewed y”. Advanced Database Systems Raghav Karumur Spring 2011

Currently we have lot of automated recommender systems!! Motivation Currently we have lot of automated recommender systems!! Advanced Database Systems Raghav Karumur Spring 2011

Motivation Problem: How this paper addresses this issue? No standard, public, real-world test-bed for evaluating the recommendation algorithms. How this paper addresses this issue? This paper built one such test-bed REFEREE using ResearchIndex database so that anyone can develop , deploy or evaluate recommender systems relatively easily and quickly. Why did they choose ResearchIndex? Because it is : (author, title, abstract, fulltext), collaborative filtering information (user behaviors), as well as linkage info(citations). content rich Advanced Database Systems Raghav Karumur Spring 2011

Rest Of the Paper: Types of Recommenders A brief history of CF Hybrid systems Why use Research Index? REFEREE PD-Live: An example recommender Results and Discussion Conclusion References Advanced Database Systems Raghav Karumur Spring 2011

Types of Recommenders alone uses several recommender systems: Non-personalized: Ex: “Users who bought Men Are From Mars, Women Are From Venus also bought Space Invaders From Pluto!”. Users post comments about an item. Customers read the comments before they buy. Personalized recommendations based on user’s ratings (explicit or implicit). (An example of a pure collaborative filtering (CF) system that computes recommendations based only on the ratings (explicit or implicit) given by users, ignoring product attributes.) As a result, CF systems are completely domain independent. Advanced Database Systems Raghav Karumur Spring 2011

Collaborative Filtering Earlier collaborative filtering systems used to be manual. User had to create filters, know the product, etc. Automated collaborative filtering addresses this problem, users assign ratings to items. System uses this rating to suggest users with similar interests. Original algorithms use similarity metrics such as correlation, MS distance, vector similarity, etc or use ML algorithms such as clustering, neural networks, Bayesian networks, etc. Advanced Database Systems Raghav Karumur Spring 2011

Collaborative Filtering However, CF systems can fail when Data is too sparse, When recommending new items, cold start problem. or when recommending to new users. More sophisticated recommender systems combine user ratings with domain-specific descriptive information about the items. Advanced Database Systems Raghav Karumur Spring 2011

Hybrid Systems content-based filtering + collaborative filtering Standard databases such as EachMovie dataset of movie ratings, exist for testing pure CF algorithms. This is not the case with hybrid systems! You have to populate it with content information from databases such as IMDB. Ex: filterbot, P-Tango, PD-Live etc Advanced Database Systems Raghav Karumur Spring 2011

Hybrid Systems PD-Live is a hybrid recommender they have built in order to evaluate the performance of this testbed. It uses content based filtering to select the documents and collaborative filtering to rank them. They claim that PD-live performs reasonably well compared to other recommenders in ResearchIndex. “Makes recommendations from all 1,000,000 documents in <200msec” Advanced Database Systems Raghav Karumur Spring 2011

Why use ResearchIndex as a testbed?   Research papers have both objective ("this work is related to mine") and taste ("I think this work is interesting") components, thus can help for both content-based and collaborative filtering. Recommender system developers can install a local copy of ResearchIndex, and use a rich API to access the database. ResearchIndex also has a large, active system, which will allow recommender systems developers to ensure that their algorithms scale well to systems which demand quick response time while servicing thousands of requests per hour, over a database of hundreds of thousands of users and items. The algorithm designers of this paper have also taken performance into consideration as >1000 users access this database of >1,000,000 papers per hour. Advanced Database Systems Raghav Karumur Spring 2011

ResearchIndex: A peek inside !   The basic entities in ResearchIndex: users, documents, citations, and citation groups. Users : People who make queries to access documents and citations. ResearchIndex makes it easy to observe a user’s behavior, automatically generating & assigning user IDs and stores them in cookies. Documents: (research papers) each is assigned a document ID (DID). Citations: The actual text of citations extracted from the documents. Each citation receives a unique citation ID (CID). Citation groups: relate individual citations that cite the same document, with each citation group receiving a group ID (GID). User actions (interaction with a document or citation/issue a query) are called “events”. There are about 30 kinds of events; some of the events most likely to interest recommender systems developers are shown in Table 1. Advanced Database Systems Raghav Karumur Spring 2011

ResearchIndex: A peek inside ! Event Name Happens When Parameters documentdetails User visits a document’s details page UID, DID, related doc info download User downloads a document UID, DID rate User assigns an explicit rating UID, DID, rating cachedpage User is viewing document pages UID, DID, pagenumber documentquery User searches documents for a query string UID, query Table1: A sample of user actions in ResearchIndex. Most actions involve a user interacting with a document or citation in the database. Advanced Database Systems Raghav Karumur Spring 2011

ResearchIndex: A peek inside !   REFEREE has no built-in notion of how important or interesting an event is. Recommender system developers have to infer what an event signifies about a user’s preferences. This is also known as using implicit ratings, as opposed to explicit ratings, where the user effectively says “I like item x this much”. Implicit ratings are useful in ecommerce recommenders, as it is a burden on users to ask for explicit ratings. ResearchIndex does provide a way for users to give a rating on a 1 to 5 scale! Advanced Database Systems Raghav Karumur Spring 2011

ResearchIndex: Built in recommenders   Sentence Overlap: recommends docs with significant text identical to current document. Cited By: documents that site current document. Active Bibliography: documents that make similar citations. Users who viewed: documents that userbase has seen. Co-citation: documents sited together with current document. On Same site: documents found on same webpage as current document. Advanced Database Systems Raghav Karumur Spring 2011

click-through rates and download rates, purchase rates REFEREE Purposes: To help recommender systems communicate with ResearchIndex. To minimize the amount of work recommender system developers must do in order to implement and test working systems. REFEREE allows anyone to quickly implement, and evaluate recommenders for documents in ResearchIndex. Want to evaluate recommenders using: user behavior A Recommender Framework Daemon (RFD) runs on the ResearchIndex site and fetches user events to the recommenders. It also requests recommendations of the form “user x is looking at document y: what three documents might x want to look at next?” The system generally issues between one and five recommendation requests per second, and usually requires recommenders to respond within 200 milliseconds. Figure 2 shows the REFEREE architecture. click-through rates and download rates, purchase rates Advanced Database Systems Raghav Karumur Spring 2011

REFEREE: Architecture Logs recs Requests to join, recs upon request Engine RFD events RFD Client Broadcast events, request recs from specific engines recs Events, rec requests Research Index Figure 2: An overview of the REFEREE architecture. Recommender engine writers need only write the Engine portion, making use of the shaded components which REFEREE provides. Advanced Database Systems Raghav Karumur Spring 2011

The Process In order to develop a REFEREE-based recommender, a researcher first contacts ResearchIndex for access to the framework, codebase, etc. Then he develops a recommender & tests it locally against his local instance. Once the recommender is ready, the researcher changes the client to point at the actual ResearchIndex site, and it then goes live. The recommender can run on any machine, as long as the network latency plus computation time required to make recommendations is within the deadline which ResearchIndex imposes. Advanced Database Systems Raghav Karumur Spring 2011

Metrics and Accuracy How accurate are the metrics? MAE : user ratings Vs system ratings User ratings are often not precise: depend on user’s mood. Also often they are not explicit; system must infer from user behavior. So, do we need absolute prediction accuracy? - In most cases, we don’t! Metrics are there to measure whether user accepted a recommendation. This paper distinguishes between user looking at a document’s detail page and a user actually downloading it. Instead of Click-Through and Buy-Now, they have Click-Soon and Buy Soon. Advanced Database Systems Raghav Karumur Spring 2011

What should(can) a recommender system do? Help the user make a decision about what item to choose. Actually change user behavior. By suggesting a set of items, a recommender provides a user with a set of choices which the user might not otherwise have made. influence which items users consume (and thus can rate) and might even influence the actual ratings users give (“the system recommended it, so it has to be good”). Neither of these important cases can be captured by evaluating recommenders using accuracy metrics on a static dataset!!

Static Datasets: Of what use are they? Provide a fixed and unchanging world. Then what use???? . Valuable tool for evaluating recommender algorithms Helpful in removing unsuitable algorithms Increase effectiveness of online experiments. In any case, ResearchIndex captures raw data: it logs user events, recommendation requests, and documents recommended. Advanced Database Systems Raghav Karumur Spring 2011

PD-Live : An example hybrid recommender PD-Live : variant of Personality Diagnosis algorithm of Pennock. Maintains an in-memory sparse-matrix representation of users, items and ratings Remembers its state between runs Computes most recommendations in under 200 milliseconds The authors of this paper chose PD because they had a reasonably fast working offline version, and the algorithm has been shown to be more accurate than standard correlation methods on EachMovie and ResearchIndex data. Advanced Database Systems Raghav Karumur Spring 2011

Considering context in hybrid recommenders A user viewing recipes should not be suggested jokes. Most CF recommenders do not consider context explicitly, they draw recommendations from all the items they know about. They get some implicit context from the fact that people tend to like similar types of items (e.g., if Dan likes mostly science fiction movies, a recommender will tend to match him up with other sci-fi fans and recommend science fiction rather than romances). Issues with ResearchIndex: Scalability: Computing a recommendation from among 500,000 items Second, researchers typically have several areas of interest. Documents with the highest predicted scores may not be what the user is currently interested in. Advanced Database Systems Raghav Karumur Spring 2011

Two Phase Approach for determining context similarity Citation links and other similarity measures form a directed graph with documents as nodes and similarity relationships as edges. PD-Live does a BFS from the document a user is currently looking at, to select a candidate set of documents. It then uses the PD algorithm to generate predictions for these documents and conducts a lottery biased toward higher-scoring documents to select which ones to recommend. If the user is new, or no neighbors can be found who have rated a document, the system falls back to non-personalized average ratings for the documents. If the document is new and no similarity information is available, the system chooses the set of recommendable documents randomly. Advanced Database Systems Raghav Karumur Spring 2011

Two Phase Approach: Advantages Easy and fast to access and use. Using content to select a set of plausible documents and then collaborative information to rank them is also a reasonable model of how people often make decisions: eliminate obviously wrong alternatives and then choose the best among them. This approach fits well with the notion of relevance adopted by the IR community. Their recommender starts with the impersonal, binary notion of relevance, then uses CF to include personal notions of relevance. Advanced Database Systems Raghav Karumur Spring 2011

Two Phase Approach: Drawbacks Consumes a lot of memory. It also reduces the number of documents considered by only recommending documents that have at least five ratings, since documents rated by only a few users would seldom be recommended anyway. User’s session is updated only every 15 minutes. PD-Live also makes use of only some user actions: those where the user is interacting with a document already stored in the system. Advanced Database Systems Raghav Karumur Spring 2011

Results and Discussion Four metrics produce the same rank ordering of the recommenders. The best PD-Live-based recommender places in the middle of the pack, outperforming three of ResearchIndex’s built-in recommenders (On Same Site, Co-citation, and Users Who Viewed) while lagging behind three others (Sentence Overlap, Cited By, and Active Bibliography). The percentages for click-thru do not add up to 100% because there are many other actions that users can take besides following a recommendation link. The Sentence Overlap and Cited By recommenders do much better than the rest. These recommenders often produce documents related to the current document and that are more recent (and often, most detailed). Cited By helps researchers find newer/more papers related to area of interest. Advanced Database Systems Raghav Karumur Spring 2011

Results and Discussion REFEREE uses metrics that evaluate how the recommender system affects user behavior. PD-Live exploits both content-based (to limit the population of documents to recommend)and collaborative similarity measures (to order them). PD-Live has great flexibility and corresponds well to the problem of recommending items to a user in context. The prototype version makes reasonably good recommendations compared to several content-based recommenders, has good performance, and appears to have much potential for improvement. Advanced Database Systems Raghav Karumur Spring 2011

Future Research REFEREE has the potential to stimulate research in at least three areas: Recommender Systems that merge content and collaborative data Systems that must make tradeoffs to sacrifice accuracy and elegance for speed and memory savings, and Systems that recommend based on current user actions (context) Advanced Database Systems Raghav Karumur Spring 2011

Advanced Database Systems Raghav Karumur Spring 2011

Thank You! Questions? Advanced Database Systems Raghav Karumur Spring 2011