Introduction to Statistical Modeling

Introduction to Statistical Modeling
Rong Jin

Why Statistical Modeling?
Vector space model for information retrieval Both documents and queries are vectors in the term space Relevance is measured by the similarity between document vectors and query vector Many problems with vector space model Ad-hoc term weighting schemes Ad-hoc basis vectors Ad-hoc similarity measurement We need something that is much more principled !

A Simple Example (I) Question: how to guess which coin Alex choose?
Consider you have three coins: C1, C2, C3 Alex picked up one of the coins and flipped it six times. You didn’t see which coin he picked out. But, you observed the results of flipping coins: t, h, t, h, t, t Question: how to guess which coin Alex choose?

A Simple Example (II) You experimented with the three coins, say 6 times C1: h, h, h, t, h, t C2: t, t, h, t, t, t C3: t, h, t, t, t, h Given: t, h, t, h, t, t Now, what one you think Alex choose?

A Simple Example (III) q: t, h, t, h, t, t  bias bq = 1/3
C1: h, h, h, t, h  bias b1 = 5/6 C2: t, t, h, t, t, t  bias b2 = 1/6 C3: t, h, t, t, t, h  bias b3 = 1/3 So, which coin you think Alex select? A more principled approach: Compute the likelihood p(q|Ci) for each coin

A Simple Example (IV) p(q|C1) = p(t, h, t, h, t, t | C1)
= p(t|C1)*p(h|C1)*p(t|C1)*p(h|C1)*p(t|C1)*p(t|C1) = 1/6 * 5/6 * 1/6 * 5/6 * 1/6 * 1/6 ~ 5.3*10-4 Compute p(q|C2) and p(q|C3) Which coin has the largest likelihood ?

A Simple Example (IV) p(q|C1) = p(t, h, t, h, t, t | C1)
= p(t|C1)*p(h|C1)*p(t|C1)*p(h|C1)*p(t|C1)*p(t|C1) = 1/6 * 5/6 * 1/6 * 5/6 * 1/6 * 1/6 ~ 5.3*10-4 Compute p(q|C2) and p(q|C3) p(q|C2) = 0.013, p(q|C3) = 0.02 Which coin has the largest likelihood ?

An Information Retrieval View
Query (q): t, h, t, h, t, t Doc1(C1): h, h, h, t, h Doc2(C2): t, t, h, t, t, t Doc3(C3): t, h, t, t, t, h Which document is ranked first if we use the vector space model?

An Information Retrieval View
Query (q): t, h, t, h, t, t Doc1(C1): h, h, h, t, h sim(D1) = 1/3*5/6+2/3*1/6 = 0.39 Doc2(C2): t, t, h, t, t, t sim(D2) = 1/3*1/6+2/3*5/6 = 0.61 Doc3(C3): t, h, t, t, t, h sim(D3) = 1/3*1/3+2/3*2/3 = 0.56 Which document is ranked first if we use the vector space model?

A Simple Example: Summary
q: t, h, t, h, t, t ? ? ? C1:h, h, h, t, h C2:t, t, h, t, h, h C3: t, h, t, t, t, h

q: t, h, t, h, t, t Estimating likelihood p(q|bias) b2 = 1/2 b3 = 1/3 b1 = 5/6 Estimating bias for each coin C1:h, h, h, t, h C2:t, t, h, t, h, h C3: t, h, t, t, t, h

A Probabilistic Framework for Information Retrieval
q: ‘Bush Kerry’ Estimating likelihood p(q| ) ? ? ? Estimating some statistics  for each document d1 … d1000

A Probabilistic Framework for Information Retrieval
Three fundamental questions What statistics  should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics ?

Unigram Language Model
Probabilities for single word p(w) ={p(w) for any word w in vocabulary V} Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d| How to estimate the likelihood p(q|)?

Estimate p(q|) q={w1, w2, …, wk}
Similar to the example of flipping coins E.g.: q={‘bush’, ‘kerry’} d={p(‘bush’)=0.001, p(‘kerry’)=0.02} p(q|d)=0.001 * 0.02 = 2 * 10-5 What if the document didn’t mention word ‘bush’, instead it used phrase ‘president of united states’ ?

Illustration of Language Models for Information Retrieval
q: t, h, t, h, t, t Estimating likelihood p(q|)=[p(h)]2[p(t)]4 2 = {p(h)=1/2, p(t)=1/2} 1 = {p(h)=1/3, p(t)=2/3} 1 = {p(h)=5/6, p(t)=1/6} Estimating language models by counting d1:h, h, h, t, h,h d2:t, t, h, t, h, h d3: t, h, t, t, t, h

q: t, h, t, h, t, t Estimating likelihood p(q|)=[p(h)]2[p(t)]4 2 = {p(h)=1/2, p(t)=1/2} 1 = {p(h)=1/3, p(t)=2/3} 1 = {p(h)=5/6, p(t)=1/6} Estimating language models by counting d1:h, h, h, t, h,h d2:t, t, h, t, h, h d3: t, h, t, t, t, h

q: t, h, t, h, t, t Estimating likelihood p(q|)=[p(h)]2[p(t)]4 2 = {p(h)=1/6, p(t)=5/6} 3 = {p(h)=1/3, p(t)=2/3} 1 = {p(h)=5/6, p(t)=1/6} Estimating language models by counting d1:h, h, h, t, h,h d2:t, t, h, t, h, h d3: t, h, t, t, t, h Problems?

Problems With Unigram LM
Unigram probabilities Insufficient for representing true documents Simple counting for estimating unigram probabilities It does not account for variance in documents If you ask the same person to write the same story twice, it will be different Most words will have zero probabilities Sparse data problem

Sparse Data Problems Shrinkage Maximum a posterior (MAP) estimation
Bayesian approach

Shrinkage: Jelinek Mercer Smoothing
Linearly interpolate between document language model and the collection language model Estimation based on individual document Estimation based on the corpus 0 <  < 1: is a smoothing parameter

Smoothing & TF-IDF Weighting
Are they totally irrelevant ?

Smoothing & TF-IDF Weighting
Similar to TF.IDF weighting irrelevant to documents

Introduction to Statistical Modeling

Similar presentations

Presentation on theme: "Introduction to Statistical Modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Statistical Modeling

Similar presentations

Presentation on theme: "Introduction to Statistical Modeling"— Presentation transcript:

Similar presentations

About project

Feedback