Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Similar presentations


Presentation on theme: "Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid."— Presentation transcript:

1 Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid

2 Language Model based Information Retrieval: University of Saarland 2 Overview Motivation Hidden Markov Model (Introduction) HMM for Information Retrieval System Probability Model Baseline System Experiments HMM Refinements  Blind Feedback  Bigrams  Document Priors Conclusion

3 Language Model based Information Retrieval: University of Saarland 3 Motivation Hidden Markov models have been applied successfully  Speech Recognition  Named Entity Finding  Optical Character Recognition  Topic Identification Ad hoc Information Retrieval (now)

4 Language Model based Information Retrieval: University of Saarland 4 Hidden Markov Model (Introduction) You have seen sequence of observation (words) You don’t know sequence of generator (states). HMM is a solution for this problem Two probabilities are involved in HMM  Jump from one state to others (Transition probability), whose sum is 1.  Probability of observations from one state, whose sum is 1.

5 Language Model based Information Retrieval: University of Saarland 5 A discrete HMM Set of output symbols Set of states Set of transitions between states Probability distribution on output symbols for each state Observed sampling process  Starting from some initial state  Transition from it to another state  Sampling from the output distribution at that state Repeat the steps

6 Language Model based Information Retrieval: University of Saarland 6 HMM for Information Retrieval System Observed data: query Q Unknown key: relevant document D Noisy channel: mind of user  Transform imagined notion into text of Q P(D is R|Q) ?  D is relevant in the user’s mind  Given that Q was the query produced

7 Language Model based Information Retrieval: University of Saarland 7 Probability Model P(D is R|Q) = Output symbols  Union of all words in the corpus States  Mechanism of query word generations  Document  General English Identical for all documents P(Q|D is R).P(D is R) P(Q) Prior probability

8 Language Model based Information Retrieval: University of Saarland 8 A simple two-state HMM The choice of which kind of word to generate next is independent of the previous such choice. General English Document query start query end a0a0 a1a1 P(q| GE) P(q| D)

9 Language Model based Information Retrieval: University of Saarland 9 Why simplification of params? HMM for each document EM for computing these parameters  Need training samples  Document with training queries (not available) P(q|D k ) = P(q|GE) = P(Q|D k is R) = ∏ (a 0 P(q|GE) + a 1 P(q|D k )) # q appears in D k length of D k ∑ k # q appears in D k ∑ k length of D k q Q

10 Language Model based Information Retrieval: University of Saarland 10 Baseline System Performance # of queries: 50 Inverted index is created  Tf value (term frequency)  Ignoring case  Porter stemmer Replaced 397 stop words with special token *STOP* Similarly 4-digit strings by *YEAR*, digit strings by *NUMBER* TREC-6, TREC-7 test collections TREC-6  556,077 documents : average of 26.5 unique terms  News and government agencies TREC-7  528,155 documents: average of 17.6 unique terms

11 Language Model based Information Retrieval: University of Saarland 11 TF.IDF model

12 Language Model based Information Retrieval: University of Saarland 12 Non-interpolated average precision

13 Language Model based Information Retrieval: University of Saarland 13 HMM Refinements Blind Feedback  well-known technique for enhancing performance Bigrams  distinctive meaning when used in the context of other word. e.g. white house, Pop John Paul II Query Section Weighting  Some portion of query is more important than others. Document Priors  longer documents are more informative than short ones

14 Language Model based Information Retrieval: University of Saarland 14 Blind Feedback Constructing a new query based on top-ranked document Rocchio algorithm In 90% of top N retrieved document  word “very” is less informative  word “Nixon” is highly informative a 0 and a 1 can be estimated by EM algorithm by training queries.

15 Language Model based Information Retrieval: University of Saarland 15 Estimate a 1 In equation (5) of paper Q’ = general query q’ = general query word ???(am I right) Q i = one trained query Q = available training queries Negative values are avoided by taking floor of estimate I m,Q i = top m documents for Q i df(w) = document frequency of w I 0q I 1q …. I mq Q i = “Germany” Berlin

16 Language Model based Information Retrieval: University of Saarland 16 Performance gained

17 Language Model based Information Retrieval: University of Saarland 17 Bigrams

18 Language Model based Information Retrieval: University of Saarland 18 Query Section Weighting TREC evaluation  Title section is more important than others. v s(q) =weight for the section of the query  v desc =1.2, v narr =1.9, v title =5.7

19 Language Model based Information Retrieval: University of Saarland 19 Document Priors refereed Journal may be more informative than a supermarket tabloid. Most predicative features  Source  Length  Average word length

20 Language Model based Information Retrieval: University of Saarland 20 Conclusion Novel method in IR using HMMs Offer rich setting  Incorporate new and familiar techniques Experiments with a system that implements  Blind feedback  Bigram modeling  Query Section weighting  Document priors Future work  HMM can be extended to accommodate Passage retrieval Explicit synonym modeling Concept modeling

21 Language Model based Information Retrieval: University of Saarland 21 Resources D. Miller, T. Leek, R. Schwartz  A Hidden Markov Information Retrieval System  SIGIR 99 Berkeley, CA USA L. Rabiner  A tutorial on Hidden Markov Models and selected applications in speech recognition  Proc. IEEE 77, pp 130-137

22 Language Model based Information Retrieval: University of Saarland 22 Thankyou very much! Questions?


Download ppt "Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid."

Similar presentations


Ads by Google