Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reading Notes Wang Ning Lab of Database and Information Systems

Similar presentations


Presentation on theme: "Reading Notes Wang Ning Lab of Database and Information Systems"— Presentation transcript:

1 Modeling Score Distributions for Combining the Outputs of Search Engines
Reading Notes Wang Ning Lab of Database and Information Systems Dec 3rd, 2003

2 Revision History Nov. 30th, 2003: Draft
Dec. 1st, 2003: Add all pictures Dec. 2nd, 2003: Add references

3 Literature Information
Title Modeling Score Distributions for Combining the Outputs of Search Engines Author R. T. F. Institution Center for Intelligent Information Retrieval University of Massachusetts Conference Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

4 Basic Idea Meta Search: Difficulties Previous Work The Authors’ Idea
Combining results from search engines Difficulties No architecture and algorithm information No score information Previous Work Linear combination of document ranks COMMIN, COMMAX, COMSUM, COMMNZ The Authors’ Idea Model the score distributions

5 Test Data TREC: Text REtrieval Conference Search Engines
TREC 3, TREC 4 TREC 6 for Chinese Documents Search Engines INQUERY (Probabilistic Model) CITY (Probabilistic Model) SMART (Vector Space Model) Bellcore (LSI Engine)

6 Model Assumptions The sets of non-relevant documents can be modeled with exponential distribution The sets of relevant documents can be modeled with Gaussian distribution Explanations and argumentations comes later

7 Non-relevant Documents: Exponential Distribution

8 Relevant Documents: Gaussian Distribution

9 Likelihood Function

10 MLE: Maximum Likelihood Estimate

11 Basic Idea of MLE God always let the event with the biggest probability happen firstly -- The MLE of Θ is to make the sample occur the most likely.

12 Limitations of Gaussian Fit
Well: sufficient relevant documents (>=60) Bad: fewer relevant documents (usually) Why? Model Fault Lack of samples (the authors’ point) Solutions Maybe Bayesian analysis works here

13 Mixture Model Fit

14 Mixture Model Fit (cont.)

15 EM: Expectation Maximization
Important parameter estimation method

16 EM Steps

17 Mixture Model Fit: INQUERY

18 Mixture Model Fit: SMART

19 Posterior Probabilities

20 Posterior Probabilities: SMART

21 Limitations of Posterior Probabilities

22 Problem I: Mixture Model
Model Selection: Exponential and Gaussian? Fit the data well Can be recovered with EM algorithm EM Algorithm: Limitations and Solutions Local maxima Solutions: Arbitrary initial condition Fit the exponential distribution first, and remove those documents that do not fit well to fit the Gaussian

23 Problem II: Shapes of Distributions

24 Shapes of Poisson's

25 Applications Combining Outputs of Search Engines
Using posterior probabilities Automatic Engine Selection Distinction: larger distance between mean and intersect point of two distributions Relevance: higher maximum of posterior probabilities

26 Comparative Study: Combining

27 Comparative Study: Selecting

28 What Can I Learn from this Paper?
Scientific Methodology Clear and simple models Theoretical reasoning & experimental support Natural and simple mathematical methods Standard test data and comparative study

29 Alternative Method Bayes Optimal Metasearch: A Probabilistic Model for Combining the Results of Multiple Retrieval Systems J. A. Aslam & M. Montague Dartmouth College SIGIR’01

30 Probabilistic Model

31 Comparisons manmatha01modeling aslam01Bayes Pros Cons
Clear and simple models Cons Strong model assumptions Some inherent limitations of EM algorithm aslam01Bayes Training prior probabilities Naive Bayes independent assumptions

32 My Thoughts Training of prior probabilities to obtain more accurate outputs models The small sample space limits the use of traditional statistics. Maybe we can use Bayes analysis to avoid it.

33 References R. Manmatha and T. Rath and Fangfang Feng. Modeling Score Distributions for Combining the Outputs of Search Engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, J. A. Aslam and M. Montague. Bayes optimal metasearch: A probabilistic model for combining the results of multiple retrieval systems. In the Proc. of the 23rd ACM SIGIR conf. on Research and Developement in Information Retrieval, pages , 2000. Jiangsheng, Yu. Expectation Maximization: An Approach to Parameter Estimation. Lecture of Machine Learning Seminar, 2003


Download ppt "Reading Notes Wang Ning Lab of Database and Information Systems"

Similar presentations


Ads by Google