Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Information Retrieval Models: Probabilistic Models
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Introduction to Language Models Evaluation in information retrieval Lecture 4.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Language Modeling Approaches for Information Retrieval Rong Jin.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval in Practice
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.
Graphical models for part of speech tagging
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Chapter 23: Probabilistic Language Models April 13, 2004.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Lecture 13: Language Models for IR
CSCI 5417 Information Retrieval Systems Jim Martin
Compact Query Term Selection Using Topically Related Text
Language Models for Information Retrieval
John Lafferty, Chengxiang Zhai School of Computer Science
Handwritten Characters Recognition Based on an HMM Model
Language Model Approach to IR
CS590I: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid

Language Model based Information Retrieval: University of Saarland 2 Overview Motivation Hidden Markov Model (Introduction) HMM for Information Retrieval System Probability Model Baseline System Experiments HMM Refinements  Blind Feedback  Bigrams  Document Priors Conclusion

Language Model based Information Retrieval: University of Saarland 3 Motivation Hidden Markov models have been applied successfully  Speech Recognition  Named Entity Finding  Optical Character Recognition  Topic Identification Ad hoc Information Retrieval (now)

Language Model based Information Retrieval: University of Saarland 4 Hidden Markov Model (Introduction) You have seen sequence of observation (words) You don’t know sequence of generator (states). HMM is a solution for this problem Two probabilities are involved in HMM  Jump from one state to others (Transition probability), whose sum is 1.  Probability of observations from one state, whose sum is 1.

Language Model based Information Retrieval: University of Saarland 5 A discrete HMM Set of output symbols Set of states Set of transitions between states Probability distribution on output symbols for each state Observed sampling process  Starting from some initial state  Transition from it to another state  Sampling from the output distribution at that state Repeat the steps

Language Model based Information Retrieval: University of Saarland 6 HMM for Information Retrieval System Observed data: query Q Unknown key: relevant document D Noisy channel: mind of user  Transform imagined notion into text of Q P(D is R|Q) ?  D is relevant in the user’s mind  Given that Q was the query produced

Language Model based Information Retrieval: University of Saarland 7 Probability Model P(D is R|Q) = Output symbols  Union of all words in the corpus States  Mechanism of query word generations  Document  General English Identical for all documents P(Q|D is R).P(D is R) P(Q) Prior probability

Language Model based Information Retrieval: University of Saarland 8 A simple two-state HMM The choice of which kind of word to generate next is independent of the previous such choice. General English Document query start query end a0a0 a1a1 P(q| GE) P(q| D)

Language Model based Information Retrieval: University of Saarland 9 Why simplification of params? HMM for each document EM for computing these parameters  Need training samples  Document with training queries (not available) P(q|D k ) = P(q|GE) = P(Q|D k is R) = ∏ (a 0 P(q|GE) + a 1 P(q|D k )) # q appears in D k length of D k ∑ k # q appears in D k ∑ k length of D k q Q

Language Model based Information Retrieval: University of Saarland 10 Baseline System Performance # of queries: 50 Inverted index is created  Tf value (term frequency)  Ignoring case  Porter stemmer Replaced 397 stop words with special token *STOP* Similarly 4-digit strings by *YEAR*, digit strings by *NUMBER* TREC-6, TREC-7 test collections TREC-6  556,077 documents : average of 26.5 unique terms  News and government agencies TREC-7  528,155 documents: average of 17.6 unique terms

Language Model based Information Retrieval: University of Saarland 11 TF.IDF model

Language Model based Information Retrieval: University of Saarland 12 Non-interpolated average precision

Language Model based Information Retrieval: University of Saarland 13 HMM Refinements Blind Feedback  well-known technique for enhancing performance Bigrams  distinctive meaning when used in the context of other word. e.g. white house, Pop John Paul II Query Section Weighting  Some portion of query is more important than others. Document Priors  longer documents are more informative than short ones

Language Model based Information Retrieval: University of Saarland 14 Blind Feedback Constructing a new query based on top-ranked document Rocchio algorithm In 90% of top N retrieved document  word “very” is less informative  word “Nixon” is highly informative a 0 and a 1 can be estimated by EM algorithm by training queries.

Language Model based Information Retrieval: University of Saarland 15 Estimate a 1 In equation (5) of paper Q’ = general query q’ = general query word ???(am I right) Q i = one trained query Q = available training queries Negative values are avoided by taking floor of estimate I m,Q i = top m documents for Q i df(w) = document frequency of w I 0q I 1q …. I mq Q i = “Germany” Berlin

Language Model based Information Retrieval: University of Saarland 16 Performance gained

Language Model based Information Retrieval: University of Saarland 17 Bigrams

Language Model based Information Retrieval: University of Saarland 18 Query Section Weighting TREC evaluation  Title section is more important than others. v s(q) =weight for the section of the query  v desc =1.2, v narr =1.9, v title =5.7

Language Model based Information Retrieval: University of Saarland 19 Document Priors refereed Journal may be more informative than a supermarket tabloid. Most predicative features  Source  Length  Average word length

Language Model based Information Retrieval: University of Saarland 20 Conclusion Novel method in IR using HMMs Offer rich setting  Incorporate new and familiar techniques Experiments with a system that implements  Blind feedback  Bigram modeling  Query Section weighting  Document priors Future work  HMM can be extended to accommodate Passage retrieval Explicit synonym modeling Concept modeling

Language Model based Information Retrieval: University of Saarland 21 Resources D. Miller, T. Leek, R. Schwartz  A Hidden Markov Information Retrieval System  SIGIR 99 Berkeley, CA USA L. Rabiner  A tutorial on Hidden Markov Models and selected applications in speech recognition  Proc. IEEE 77, pp

Language Model based Information Retrieval: University of Saarland 22 Thankyou very much! Questions?