Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Chapter 7 Retrieval Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
Scalable Text Mining with Sparse Generative Models
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Language Modeling Approaches for Information Retrieval Rong Jin.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Evaluation of novelty metrics for sentence-level novelty mining Presenter : Lin, Shu-Han Authors : Flora.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
CMU at TDT 2004 — Novelty Detection Jian Zhang and Yiming Yang Carnegie Mellon University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Relevance Feedback Hongning Wang
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
1 Personalized IR Reloaded Xuehua Shen
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Collaborative Filtering With Decoupled Models for Preferences and Ratings Rong Jin 1, Luo Si 1, ChengXiang Zhai 2 and Jamie Callan 1 Language Technology.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
True/False questions (3pts*2)
Reading Notes Wang Ning Lab of Database and Information Systems
Language Models for Information Retrieval
Author: Kazunari Sugiyama, etc. (WWW2004)
Representation of documents and queries
John Lafferty, Chengxiang Zhai School of Computer Science
Learning Literature Search Models from Citation Behavior
CS590I: Information Retrieval
INF 141: Information Retrieval
Presentation transcript:

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Carnegie Mellon Outline Introduction : task definition and related work Building an filtering system –Filtering system structure –Redundancy measures Experimental methodology –Creating testing datasets –Evaluation measures Experimental result Conclusion and future work

Carnegie Mellon Task Definition What user want in adaptive filtering: relevant & novel information as soon as the document arrives Current filtering systems are relevant oriented. –Optimization: deliver as much relevant information as possible –Evaluation: relevant recall/precision. System gets credit for relevant but redundant information

Carnegie Mellon Relates to First Story Detection in TDT No work on novelty detection in adaptive filtering Current research on FSD in TDT: –Goal : identify the first story of an event –Current performance: far from solved FSD in TDT != Novelty Detection while filtering –Assumption on redundancy definition –Unsupervised learning vs. supervised learning. –Novelty Detection in filtering is about user specified domain, and user information is available

Carnegie Mellon Outline Introduction : task definition and related work Building an filtering system –Filtering system Structure –Redundancy measures Experimental methodology –Creating testing datasets –Evaluation measures Experimental result Conclusion and future work

Carnegie Mellon Relevancy vs. Novelty User wants: relevant and novel information Contradiction? –Relevant: deliver document similar to previously delivered relevant documents to user –Novel: deliver document not similar to previously delivered relevant documents to user Solution: two stages system –Use different similarity measure to model relevancy and novelty

Carnegie Mellon Two Stages Filtering System OR OR Stream of Documents Relevance Filtering Redundancy Filtering Novel Redundant

Carnegie Mellon Two Problems for Novelty Detection Input: –A sequence of document user read –User feedback Redundancy measure (our current focus): –Measures redundancy of current document with previous documents –Profile specific any time updating of redundancy/novelty measure Thresholding –only document with a redundancy score below threshold is considered novel

Carnegie Mellon Redundancy Measures Use similarity/distance/difference between two documents to measure redundancy 3 types of document representation –Set difference –Geometric distance (cosine similarity) –Distributional Similarity (language model)

Carnegie Mellon Set Difference Main idea: –Boolean bag-of-words representation –Use smoothing to add frequent words to the doc representation Algorithm: –w j  Set(d) iff Count (w j, d) > k Count (w j, d) =  1 * tf wj,d +  3 *rdf w +  2 * df wj –Using the number of new words in d t to measure the novelty R(d t | d i )= -|Set(d t )  Set(d i )|

Carnegie Mellon Geometric Distance Main idea: –Basic vector space approach Algorithm: –Represent a document as a vector, and the weight of each dimension is the tf*idf score of corresponding word –Using cosine distance to measure the redundancy R(d t | d i ) = Cosine(d t, d i )

Carnegie Mellon Distributional Similarity (1) Main idea: –Unigram language models Algorithm: –Represent a document d as a words distribution  d –Measure the redundancy/novelty between two documents using Kullback-Leibler (KL) distance of the corresponding two distributions R(d t | d i ) = - KL (  dt,  di,)

Carnegie Mellon Distributional Similarity (2):Smoothing Why smoothing: –maximum likelihood estimation of  d will make KL (  dt,  di, ) infinite because of unseen words –make the estimate of language model more accurate Smoothing algorithms for  d : –Bayesian smoothing using dirichlet priors (Zhai&Lafferty SIGIR 01) –Smoothing using shrinkage (McCallum ICML98) –A mixture model based smoothing

Carnegie Mellon A Mixture Model: Relevancy vs. Novelty M T :  T Topic M E :  E General English M I :  d_core New Information  E  T T  d_core Relevancy detection: focus on learning  T Redundancy detection: focus on learning  d_core

Carnegie Mellon Outline Introduction : task definition and related work Building an filtering system –Filtering system structure –Redundancy measures Experimental methodology –Creating testing datasets –Evaluation measures Experimental result Conclusion and future work

Carnegie Mellon A New Evaluation dataset:APWSJ Combine AP+WSJ to get a corpus which is likely to contain redundant documents Hired undergraduates to read all relevant documents chronologically sorted and let them to judge: –Whether a document is redundant –If yes, identify document set that make this document redundant Two degree of redundancy: absolutely redundant vs. somewhat redundant Adjudicated by two assessors

Carnegie Mellon Another Evaluation Dataset: TREC Interactive Data Combine TREC-6, TREC-7 and TREC-8 interactive dataset (20 TREC topics) Each topic contains several aspects NIST assessors identify aspects for each document Assume d t is redundant if all aspects related to d t have already been covered by previous documents user seen. –Strong assumption on what’s novel/redundant –Can still provide useful information

Carnegie Mellon Evaluation Methodology (1) Four components of an adaptive filtering system –relevancy measure –relevance threshold –redundancy measure –redundancy threshold Goal: focus on redundancy measures, and avoid the influence of other part of the filtering system Assume we have a perfect relevancy detection stage to avoid influence of that stage Use 11-pt average recall and precision graph to avoid the influence of thresholding module

Carnegie Mellon Evaluation Methodology (2) RedundantNon- Redundant Delivered R + N+N+ Not deliveredR-R- N-N-

Carnegie Mellon Outline Introduction : task definition and related work Building an filtering system –Filtering system Structure –Redundancy measures Experimental methodology –Creating testing datasets –Evaluation measures Experimental result Conclusion and future work

Carnegie Mellon Comparing Different Redundancy Measures on Two Datasets Cosine measure is consistently good (ONE SLIDE TO EXPLAIN) Mixture language model works much better than other LM approach

Carnegie Mellon Mistakes After Thresholding Measuresabsolutely redundant or somewhat redundant absolutely redundant only Set Distance 43.5%28% Cosine Distance 28.1%18.7% Shrinkage (LM) 44.3%21% Dirichlet Prior (LM) 42.4%21% Mixture Model (LM) 27.4%16.7% A simple thresholding algorithm that makes the system complete Learning user’s preference is important Similar results for interactive track data on paper

Carnegie Mellon Outline Introduction : task definition and related work Building an filtering system –Filtering system Structure –Redundancy measures Experimental methodology –Creating testing datasets –Evaluation measures Experimental result Conclusion and future work

Carnegie Mellon Conclusion: Our Contributions Novelty/redundancy detection in an adaptive filtering system –Two stages approach Reasonably good at identifying redundant documents –Cosine similarity –Mixture language model Factors affecting accuracy –Accuracy at finding relevant documents –Redundancy measure –Redundancy threshold

Carnegie Mellon Future work Cosine similarity is far from the optimal (symmetric vs. asymmetric) Feature engineering: time, source, author, name entity… Better novelty measure –Doc.-doc. distance vs. doc-cluster distance (?) –Depend on user: what is novel/redundant for the user? Learning user redundancy preferences –Thresholding: sparse training data problem

Carnegie Mellon Appendix: Threshold Algorithm Initialize Rthreshold to let only near duplicates as redundant For each d t delivered: If user said it is redundant and R(d t )> argmax R(d i ) for all d i (delivered relevant document) Rthreshold=R(d t ) Else Rthreshold=Rthreshold-(Rthreshold-R(d t ))/10