1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Biointelligence Laboratory, Seoul National University
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Information Retrieval in Practice
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Chapter 7 Retrieval Models.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Assuming normally distributed data! Naïve Bayes Classifier.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Rutgers’ HARD Track Experiences at TREC 2004 N.J. Belkin, I. Chaleva, M. Cole, Y.-L. Li, L. Liu, Y.-H. Liu, G. Muresan, C. L. Smith, Y. Sun, X.-J. Yuan,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Scalable Text Mining with Sparse Generative Models
Language Modeling Approaches for Information Retrieval Rong Jin.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
1 Probabilistic Language-Model Based Document Retrieval.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Chapter 23: Probabilistic Language Models April 13, 2004.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Language Model for Machine Translation Jang, HaYoung.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Martin Rajman, Martin Vesely
Introduction to Statistical Modeling
Statistical Models for Automatic Speech Recognition
John Lafferty, Chengxiang Zhai School of Computer Science
INF 141: Information Retrieval
Presentation transcript:

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS

2 Topics  LM approach –What is it? –Why is it preferred?  Controlling Filtering decision

3 What is LM Approach?  We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.  The tf-idf metric computes various statistics of words and documents.  By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes P(Doc is Relevant | Query, Document, Collection, etc.)  If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant)  The LM approach is a solution to the second part of this.  The prior probability component is also important.

4 What it is not  If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model.  The LMs would not be expected to be the same even with long queries.

5 Issues in LM Approaches for Filtering  We (ideally) have three sets of documents: –Positive documents –Negative documents –Large corpus of unknown (mostly negative) documents  We can estimate a model for both positive and negative documents –We can find more positive documents in large corpus –We use large corpus to smooth models from positive and negative documents  We compute the probability of each of each new document given each of the models  The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.

6 Language Modeling Choices  We can model the probability of the document given the topic in many ways.  A simple unigram mixture works surprisingly well. –Weighted mixture of distributions from the topic training and the full corpus  We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique  We can extend the model in many ways: –Ngram model of words –Phrases: proper names, collocations  Because we use a formal generative model, we know how to incorporate any effect we want. –E.g., probability of features of top-5 documents given some document is relevant

7 How to Set the Threshold  For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents.  Problems: –The score for a particular document depends on many factors that are not important for the decision Length of document Percentage of low-likelihood words –The range of scores depends on the particular topic.  Would like to map the score for any document and topic into a real posterior probability

8 Score Normalization Techniques  By using the relative score for two models, we remove some of the variance due to the particular document.  We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents.  Advantages of using Off-Topic documents: –We have a very large number of documents –We can fix the probability of false alarms

9 The Bottom Line  For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono- language, cross-language, speech recognition output, etc.  Large improvement will come after multiple sites start using similar techniques.

10 Grand Challenges  Tested in TDT –Operating with small amounts of training data for each category 1 to 4 documents per event –Robustness to changes over time adaptation –Multi-lingual domains –How to set threshold for filtering –Using model of ‘eventness’  Large hierarchical category sets –How to use the structure  Effective use of prior knowledge  Predicting performance and characterizing classes  Need a task where both the discriminative and the LM approach will be tested.

11 What do you really want?  If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about: –Airplane crashes –Terrorism –Building fires –Injuries and Death –Some combination of the above  In general, we need a way to clarify which combination of topics the user wants  In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).