Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Parallel and Distributed IR
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, November 12, 2007.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Information Filtering LBSC 878 Douglas W. Oard and Dagobert Soergel Week 10: April 12, 1999.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
SINGULAR VALUE DECOMPOSITION (SVD)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Relevance Feedback Hongning Wang
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Information Retrieval in Practice
Information Retrieval (in Practice)
Multimedia Information Retrieval
Relevance Feedback Hongning Wang
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
CS 4501: Information Retrieval
Presentation transcript:

Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011

Information Access Problems Collection Information Need Stable Different Each Time Data Mining Retrieval Filtering Different Each Time

Information Filtering An abstract problem in which: –The information need is stable Characterized by a “profile” –A stream of documents is arriving Each must either be presented to the user or not Introduced by Luhn in 1958 –As “Selective Dissemination of Information” Named “Filtering” by Denning in 1975

Information Filtering User Profile Matching New Documents Recommendation Rating

Standing Queries Use any information retrieval system –Boolean, vector space, probabilistic, … Have the user specify a “standing query” –This will be the profile Limit the standing query by date –Each use, show what arrived since the last use

What’s Wrong With That? Unnecessary indexing overhead –Indexing only speeds up retrospective searches Every profile is treated separately –The same work might be done repeatedly Forming effective queries by hand is hard –The computer might be able to help It is OK for text, but what about audio, video, … –Are words the only possible basis for filtering?

Stream Search: “Fast Data Finder” Boolean filtering using custom hardware –Up to 10,000 documents per second (in 1996!) Words pass through a pipeline architecture –Each element looks for one word goodpartygreataid OR AND NOT

Profile Indexing (SIFT) Build an inverted file of profiles –Postings are profiles that contain each term RAM can hold 5 million profiles/GB –And several machines could run in parallel Both Boolean and vector space matching –User-selected threshold for each ranked profile Hand-tuned on a web page using today’s news

Profile Indexing Limitations Privacy –Central profile registry, associated with known users Usability –Manual profile creation is time consuming May not be kept up to date –Threshold values vary by topic and lack “meaning”

Vector space example: query “canine” (1) Source: Fernando D íaz

Similarity of docs to query “canine” Source: Fernando D íaz

User feedback: Select relevant documents Source: Fernando D íaz

Results after relevance feedback Source: Fernando D íaz

Rocchio’ illustrated : centroid of relevant documents

Rocchio’ illustrated does not separate relevant / nonrelevant.

Rocchio’ illustrated centroid of nonrelevant documents.

Rocchio’ illustrated - difference vector

Rocchio’ illustrated Add difference vector to …

Rocchio’ illustrated … to get

Rocchio’ illustrated separates relevant / nonrelevant perfectly.

Rocchio’ illustrated separates relevant / nonrelevant perfectly.

Adaptive Content-Based Filtering Make Profile Vector Compute Similarity Select and Examine (user) Assign Ratings (user) Update User Model New Documents Vector Documents, Vectors, Rank Order Document, Vector Rating, Vector Vector(s) Make Document Vectors Initial Profile Terms Vectors

Latent Semantic Indexing SVD Reduce Dimensions Make Profile Vector Make Document Vectors Compute Similarity Select and Examine (user) Assign Ratings (user) Update User Model Representative Documents Sparse Vectors Matrix New Documents Sparse Vector Dense Vector Documents, Dense Vectors, Rank Order Document, Dense Vector Rating, Dense Vector Dense Vector(s) Reduce Dimensions Make Document Vectors Matrix Initial Profile Terms Sparse Vectors Dense Vectors

Content-Based Filtering Challenges IDF estimation –Unseen profile terms would have infinite IDF! –Incremental updates, side collection Interaction design –Score threshold, batch updates Evaluation –Residual measures

Machine Learning for User Modeling All learning systems share two problems –They need some basis for making predictions This is called an “inductive bias” –They must balance adaptation with generalization

Machine Learning Techniques Hill climbing (Rocchio) Instance-based learning (kNN) Rule induction Statistical classification Regression Neural networks Genetic algorithms

Rule Induction Automatically derived Boolean profiles –(Hopefully) effective and easily explained Specificity from the “perfect query” –AND terms in a document, OR the documents Generality from a bias favoring short profiles –e.g., penalize rules with more Boolean operators –Balanced by rewards for precision, recall, …

Statistical Classification Represent documents as vectors –Usual approach based on TF, IDF, Length Build a statistical models of rel and non-rel –e.g., (mixture of) Gaussian distributions Find a surface separating the distributions –e.g., a hyperplane Rank documents by distance from that surface

Linear Separators Which of the linear separators is optimal? Original from Ray Mooney

Maximum Margin Classification Implies that only support vectors matter; other training examples are ignorable. Original from Ray Mooney

Soft Margin SVM ξiξi ξiξi Original from Ray Mooney

Non-linear SVMs Φ: x → φ(x) Original from Ray Mooney

Training Strategies Overtraining can hurt performance –Performance on training data rises and plateaus –Performance on new data rises, then falls One strategy is to learn less each time –But it is hard to guess the right learning rate Usual approach: Split the training set –Training, DevTest for finding “new data” peak

NetFlix Challenge

Effect of Inventory Costs

Spam Filtering Adversarial IR –Targeting, probing, spam traps, adaptation cycle Compression-based techniques Blacklists and whitelists –Members-only mailing lists, zombies Identity authentication –Sender ID, DKIM, key management