Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea chitea@mpi-inf.mpg.de Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language Model based Information Retrieval

March 10, 2006 Term-Specific Smoothing2 Introduction Experimental approach to Information Retrieval –A formal model specifies an exact formula, which is tried empirically –Formulae are empirically tried because they seem plausible Modeling approach to Information Retrieval –A formal model specifies an exact formula that is used to prove some simple mathematical properties of the model

March 10, 2006 Term-Specific Smoothing3 Information Retrieval – Overview System query returns a ranked result list –Statistical ranking on term frequencies is still standard practice Search engines provide means to override the default ranking mechanisms –Users can specify mandatory query terms (e.g. +term or term in Google)

March 10, 2006 Term-Specific Smoothing4 Information Retrieval – Practice (1) Query: Star Wars Episode I (I is not treated as a mandatory term)

March 10, 2006 Term-Specific Smoothing5 Information Retrieval – Practice (2) Query: Star Wars Episode +I (I is treated as a mandatory term)

March 10, 2006 Term-Specific Smoothing6 Motivation Performance limitations in statistical ranking Statistics-based IR models do not capture term importance specification User/system should be able to override the default ranking mechanism Objective Mathematical model that supports the concept of query term importance

March 10, 2006 Term-Specific Smoothing7 Language Models A statistical model for generating text –Probability distribution over strings in a given language M Consider the Unigram Language Model (LM)

March 10, 2006 Term-Specific Smoothing8 Example – Language Models IR sample text … 0.2 search … 0.1 … mining … 0.1 food … 0.0001 … build model Health sample food … 0.25 nutrition … 0.1 … healthy … 0.05 diet … 0.02 … build model

March 10, 2006 Term-Specific Smoothing9 Language Models in IR Estimate a LM for each document: D Estimate probability of generating a query Q with terms (t 1,…,t n ) using a given model: Rank documents by probability of generating Q:

March 10, 2006 Term-Specific Smoothing10 Insufficient Data If a term is not in the document, the query cannot be generated: Smooth probabilities –Probabilities of observed events are decreased by a certain amount, which is credited to unobserved events

March 10, 2006 Term-Specific Smoothing11 Smoothing Roles –Estimation >> reevaluation of probabilities –Query modeling >> to explain the common and non- informative terms in a query Linear interpolation smoothing –Defines a smoothing parameter necessary for query modeling –Can be defined as a two-state Hidden Markov Model

March 10, 2006 Term-Specific Smoothing12 Smoothing Models Mixture Model smoothing –Define a hidden event for all query terms Term-specific smoothing –Define a hidden event for each query term

March 10, 2006 Term-Specific Smoothing13 Smoothing – Mixture Model Mixes the probability from the document with the general collection probability of the term can be tuned to adjust performance: –High value >> conjunctive-like search, i.e., suitable for short queries –Low value >> suitable for long queries

March 10, 2006 Term-Specific Smoothing14 Bayesian Networks (1) A Bayesian Network (BN) is a directed, acyclic graph G(V, E) where: –Nodes >> Random variables (RVs) –Edges >> Dependencies Properties:

March 10, 2006 Term-Specific Smoothing15 Bayesian Networks (2) From the properties it holds that: By the chain rule: By conditional independence:

March 10, 2006 Term-Specific Smoothing16 LM as a Bayesian Network Nodes >> random variables Edges >> models conditional dependencies Clear nodes >> hidden random variables Shaded nodes >> observed random variables Figure 1: The language modeling approach as a Bayesian network D tntn …t1t1

March 10, 2006 Term-Specific Smoothing17 Example – Mixture Model (1) Collection (2 documents) –d 1 : IBM reports a profit but revenue is down –d 2 : Siemens narrows quarter loss but revenue decreases further Model: MLE unigram from documents; Query: revenue down Ranking: d 1 > d 2

March 10, 2006 Term-Specific Smoothing18 Example – Mixture Model (2) D t 3 :downt 1 :revenue C Figure 2: Bayesian Network for C(d 1,d 2 ) language model

March 10, 2006 Term-Specific Smoothing19 Term-Specific Smoothing D t3t3 t2t2 t1t1 t3t3 t1t1 t2t2 D

March 10, 2006 Term-Specific Smoothing20 Term-Specific Smoothing – Derivation Step 1: Assume query term independence Step 2: For each t i introduce a binary RV I i (i.e. the importance of a query term)

March 10, 2006 Term-Specific Smoothing21 Term-Specific Smoothing – Derivation Step 3: Assume query term importance does not depend on D Step 4: Writing the full sum over the importance values yields:

March 10, 2006 Term-Specific Smoothing22 Term-Specific Smoothing – Derivation Step 4 (contd.): –Let, –Assume

March 10, 2006 Term-Specific Smoothing23 Term-Specific Smoothing – Properties Case 1: Stop Words (–) – >> query term is not important – >> ignore query term t i Case 2: Mandatory Terms (+) – >> relevant documents contain the query term – >> no smoothing by collection model performed Case 3: Coordination level ranking –A 1 i 0)|()1( DtP ii 1, i i 0)|( DtP ii 0 i

March 10, 2006 Term-Specific Smoothing24 Stop Words Query terms that are ignored during the search Reasons: –Frequent words (e.g. the, it, a, …) might not contribute significantly to the final document score, but they do require processing power –Words are stopped if they carry little meaning (e.g. hereupon, whereafter)

March 10, 2006 Term-Specific Smoothing25 Mandatory Terms A query term that should occur in every retrieved document Collection model can be dropped from the calculation of the document score Documents that do not match the query term are assigned null probabilities Users specify mandatory terms (e.g. by +)

March 10, 2006 Term-Specific Smoothing26 Coordination Level Ranking A A document containing n query terms will always rank higher than one with n-1 query terms Most tf.idf-ranking methods do not behave like coordination level ranking

March 10, 2006 Term-Specific Smoothing27 Term-Specific Smoothing – Review Term importance probability accounts for: –Statistics alone cannot always account for ignored query terms –Restrict the retrieved list of documents to documents that match specific terms, regardless of their frequency distributions –Enforce a coordination level ranking of the documents, regardless of the terms frequency distribution

March 10, 2006 Term-Specific Smoothing28 Relevance Feedback Predict optimal values for lambda Train on relevant documents and predict the probability of term importance for each term that maximizes retrieval performance Use the Expectation Maximization (EM) algorithm –Maximize the probability of the observed data given some training data

March 10, 2006 Term-Specific Smoothing29 EM Algorithm The algorithm iteratively maximizes the probability of the query t 1,…,t n given r relevant documents D 1,…,D r E-step M-step

March 10, 2006 Term-Specific Smoothing30 Generalization of Term Importance Allow the RV I i to have more than 2 realizations: –Combine the unigram document model with the bigram document model

March 10, 2006 Term-Specific Smoothing31 Example – General Model last will of Alfred Nobel +last will of Alfred Nobel t3t3 t1t1 t2t2 D Figure 3: Graphical model of dependence relations between query terms

March 10, 2006 Term-Specific Smoothing32 Future Research Define a unigram LM for a topic-specific space Extend beyond term-matching –Use syntax (bag of words vs. structured text) and semantics (exact terms vs. equivalent terms)

March 10, 2006 Term-Specific Smoothing33 Conclusions Extension to the LM approach to IR: model the importance of a query term –Stop Words/Phrases: trade-off between search quality and search speed –Mandatory Terms: the user overrides the default ranking algorithm Statistical ranking algorithms motivated by the LM approach perform well in an empirical setting

March 10, 2006 Term-Specific Smoothing34 Discussion Is this a valid approach? How does it differ from term weighting? Why do we want coordination level ranking? Is the bi-gram generalization valid and/or useful?

March 10, 2006 Term-Specific Smoothing35 References D. Hiemstra. Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term. SIGIR02, August 11-15, 2002. G. Weikum. Information Retrieval and Data Mining. Course Slides. Universität des Saarlandes (Retrieved on: February 15, 2006)

Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

Similar presentations

Presentation on theme: "Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.

Similar presentations

Presentation on theme: "Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language."— Presentation transcript:

Similar presentations

About project

Feedback