1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS

2 Topics  LM approach –What is it? –Why is it preferred?  Controlling Filtering decision

3 What is LM Approach?  We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.  The tf-idf metric computes various statistics of words and documents.  By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes P(Doc is Relevant | Query, Document, Collection, etc.)  If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant)  The LM approach is a solution to the second part of this.  The prior probability component is also important.

4 What it is not  If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model.  The LMs would not be expected to be the same even with long queries.

5 Issues in LM Approaches for Filtering  We (ideally) have three sets of documents: –Positive documents –Negative documents –Large corpus of unknown (mostly negative) documents  We can estimate a model for both positive and negative documents –We can find more positive documents in large corpus –We use large corpus to smooth models from positive and negative documents  We compute the probability of each of each new document given each of the models  The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.

6 Language Modeling Choices  We can model the probability of the document given the topic in many ways.  A simple unigram mixture works surprisingly well. –Weighted mixture of distributions from the topic training and the full corpus  We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique  We can extend the model in many ways: –Ngram model of words –Phrases: proper names, collocations  Because we use a formal generative model, we know how to incorporate any effect we want. –E.g., probability of features of top-5 documents given some document is relevant

7 How to Set the Threshold  For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents.  Problems: –The score for a particular document depends on many factors that are not important for the decision Length of document Percentage of low-likelihood words –The range of scores depends on the particular topic.  Would like to map the score for any document and topic into a real posterior probability

8 Score Normalization Techniques  By using the relative score for two models, we remove some of the variance due to the particular document.  We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents.  Advantages of using Off-Topic documents: –We have a very large number of documents –We can fix the probability of false alarms

9 The Bottom Line  For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono- language, cross-language, speech recognition output, etc.  Large improvement will come after multiple sites start using similar techniques.

10 Grand Challenges  Tested in TDT –Operating with small amounts of training data for each category 1 to 4 documents per event –Robustness to changes over time adaptation –Multi-lingual domains –How to set threshold for filtering –Using model of ‘eventness’  Large hierarchical category sets –How to use the structure  Effective use of prior knowledge  Predicting performance and characterizing classes  Need a task where both the discriminative and the LM approach will be tested.

11 What do you really want?  If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about: –Airplane crashes –Terrorism –Building fires –Injuries and Death –Some combination of the above  In general, we need a way to clarify which combination of topics the user wants  In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Similar presentations

Presentation on theme: "1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Similar presentations

Presentation on theme: "1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS."— Presentation transcript:

Similar presentations

About project

Feedback