Reading Notes Wang Ning Lab of Database and Information Systems

Modeling Score Distributions for Combining the Outputs of Search Engines
Reading Notes Wang Ning Lab of Database and Information Systems Dec 3rd, 2003

Revision History Nov. 30th, 2003: Draft
Dec. 1st, 2003: Add all pictures Dec. 2nd, 2003: Add references

Literature Information
Title Modeling Score Distributions for Combining the Outputs of Search Engines Author R. T. F. Institution Center for Intelligent Information Retrieval University of Massachusetts Conference Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Basic Idea Meta Search: Difficulties Previous Work The Authors’ Idea
Combining results from search engines Difficulties No architecture and algorithm information No score information Previous Work Linear combination of document ranks COMMIN, COMMAX, COMSUM, COMMNZ The Authors’ Idea Model the score distributions

Test Data TREC: Text REtrieval Conference Search Engines
TREC 3, TREC 4 TREC 6 for Chinese Documents Search Engines INQUERY (Probabilistic Model) CITY (Probabilistic Model) SMART (Vector Space Model) Bellcore (LSI Engine)

Model Assumptions The sets of non-relevant documents can be modeled with exponential distribution The sets of relevant documents can be modeled with Gaussian distribution Explanations and argumentations comes later

Non-relevant Documents: Exponential Distribution

Relevant Documents: Gaussian Distribution

Likelihood Function

MLE: Maximum Likelihood Estimate

Basic Idea of MLE God always let the event with the biggest probability happen firstly -- The MLE of Θ is to make the sample occur the most likely.

Limitations of Gaussian Fit
Well: sufficient relevant documents (>=60) Bad: fewer relevant documents (usually) Why? Model Fault Lack of samples (the authors’ point) Solutions Maybe Bayesian analysis works here

Mixture Model Fit

Mixture Model Fit (cont.)

EM: Expectation Maximization
Important parameter estimation method

EM Steps

Mixture Model Fit: INQUERY

Mixture Model Fit: SMART

Posterior Probabilities

Posterior Probabilities: SMART

Limitations of Posterior Probabilities

Problem I: Mixture Model
Model Selection: Exponential and Gaussian? Fit the data well Can be recovered with EM algorithm EM Algorithm: Limitations and Solutions Local maxima Solutions: Arbitrary initial condition Fit the exponential distribution first, and remove those documents that do not fit well to fit the Gaussian

Problem II: Shapes of Distributions

Shapes of Poisson's

Applications Combining Outputs of Search Engines
Using posterior probabilities Automatic Engine Selection Distinction: larger distance between mean and intersect point of two distributions Relevance: higher maximum of posterior probabilities

Comparative Study: Combining

Comparative Study: Selecting

What Can I Learn from this Paper?
Scientific Methodology Clear and simple models Theoretical reasoning & experimental support Natural and simple mathematical methods Standard test data and comparative study

Alternative Method Bayes Optimal Metasearch: A Probabilistic Model for Combining the Results of Multiple Retrieval Systems J. A. Aslam & M. Montague Dartmouth College SIGIR’01

Probabilistic Model

Comparisons manmatha01modeling aslam01Bayes Pros Cons
Clear and simple models Cons Strong model assumptions Some inherent limitations of EM algorithm aslam01Bayes Training prior probabilities Naive Bayes independent assumptions

My Thoughts Training of prior probabilities to obtain more accurate outputs models The small sample space limits the use of traditional statistics. Maybe we can use Bayes analysis to avoid it.

References R. Manmatha and T. Rath and Fangfang Feng. Modeling Score Distributions for Combining the Outputs of Search Engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, J. A. Aslam and M. Montague. Bayes optimal metasearch: A probabilistic model for combining the results of multiple retrieval systems. In the Proc. of the 23rd ACM SIGIR conf. on Research and Developement in Information Retrieval, pages , 2000. Jiangsheng, Yu. Expectation Maximization: An Approach to Parameter Estimation. Lecture of Machine Learning Seminar, 2003

Reading Notes Wang Ning Lab of Database and Information Systems

Similar presentations

Presentation on theme: "Reading Notes Wang Ning Lab of Database and Information Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reading Notes Wang Ning Lab of Database and Information Systems

Similar presentations

Presentation on theme: "Reading Notes Wang Ning Lab of Database and Information Systems"— Presentation transcript:

Similar presentations

About project

Feedback