Modern information retrieval: Week 3 Probabilistic Model.

Modern information retrieval: Week 3 Probabilistic Model

Last Time… Boolean model Based on the notion of sets Documents are retrieved only if they satisfy Boolean conditions specified in the query Does not impose a ranking on retrieved documents Exact match Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query (ranked retrieval) Best/partial match

Probabilistic Model Views retrieval as an attempt to answer a basic question: “ What is the probability that this document is relevant to this query? ” expressed as: P(REL|D) ie. Probability of x given y (Probability that of relevance given a particular document D)

Assumptions document here means the content representation or description, i.e. surrogate relevance is binary relevance of a document is independent of relevance of other documents terms are independent of one another

Statistical Independence A and B are independent if and only if: P(A and B) = P(A)  P(B) Simplest example: series of coin flips Independence formalizes “unrelated” P(“being brown eyed”) = 6/10 P(“being a doctor”) = 1/1000 P(“being a brown eyed doctor”) = P(“being brown eyed”)  P(“being a doctor”) = 6/10,000

Dependent Events Suppose: P(“having a B.S. degree”) = 3/10 P(“being a doctor”) = 1/1000 Would you expect: P(“having a B.S. degree and being a doctor”) = P(“having a B.S. degree”)  P(“being a doctor”) = 3/10,000 Another example: P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“having studied anatomy” | “being a doctor”) = ??

Conditional Probability A B A and B P(A | B)  P(A and B) / P(B) Event Space P(A) = prob. of A relative to entire event space P(A|B) = prob. of A considering that we know B is true

Doctors and Anatomy P(A | B)  P(A and B) / P(B) A = having studied anatomy B = being a doctor What is P(“having studied anatomy” | “being a doctor”)? P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“being a doctor who studied anatomy”) = 1/1000 P(“having studied anatomy” | “being a doctor”) = 1

More on Conditional Probability What if P(A|B) = P(A)? Is P(A|B) = P(B|A)? A and B must be statistically independent! A = having studied anatomy B = being a doctor P(“having studied anatomy” | “being a doctor”) = 1 P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“being a doctor who studied anatomy”) = 1/1000 P(“being a doctor” | “having studied anatomy”) = 1/12 If you’re a doctor, you must have studied anatomy… If you’ve studied anatomy, you’re more likely to be a doctor, but you could also be a biologist, for example

Applying Bayes’ Theorem P(“have disease”) = 0.0001 (0.01%) P(“test positive” | “have disease”) = 0.99 (99%) P(“test positive”) = 0.010098 Two case: 1.You have the disease, and you tested positive 2.You don’t have the disease, but you tested positive (error) P(A|B) = P(A and B)/P(B)=P(B|A)XP(A)/P(B) Don’t worry! P(“have disease” | “test positive”) = (0.99)(0.0001) / 0.010098 = 0.009804 = 0.9804%

贝叶斯公式设 D1 ， D2 ， …… ， Dn 为样本空间 S 的一个划分，如果以 P(Di) 表示事件 Di 发生的概率，且 P(Di)>0(i=1 ， 2 ， … ， n) 。对于任一事件 x ， P(x)>0 ，则有：

Probabilistic Model Objective: to capture the IR problem using a probabilistic framework Given a user query, there is an ideal answer set Querying as specification of the properties of this ideal answer set (clustering) But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) Improve by iteration

Probabilistic Ranking Principle Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. But, how to compute probabilities? what is the sample space?

The Ranking Probabilistic ranking computed as: sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) This is the odds of the document dj being relevant Taking the odds minimize the probability of an erroneous judgement C R d

The Ranking Definition: wij  {0,1} P(R | vec(dj)) : probability that given doc is relevant P(  R | vec(dj)) : probability that the given doc is not relevant sim(dj,q) = P(R | vec(dj)) / P(  R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) |  R) * P(  R)] ~ P(vec(dj) | R) P(vec(dj) |  R) P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents

The ranking example Initial query ： t1 t2 t4 P(ti|R) and P(ti|  R) are listed below. Term t1 t2t3t4t5 P(ti|R) 0.8 0.90.30.320.15 P(ti|  R) 0.3 0.10.350.330.10 For document D1 ： t2 t5 So ： P(D|R)=(1-0.8)*0.9*(1-0.3)*(1-0.32)*0.15 P(D|  R)= (1-0.3)*0.1*(1-0.35)*(1-0.33)*0.10 P(D|R)/P(D|  R)=4.216

The Initial Ranking Probabilities P(ki | R) and P(ki |  R) ? Estimates based on assumptions: P(ki | R) = 0.5 P(ki |  R) = ni/N where ni is the number of docs that contain ki Use this initial guess to retrieve an initial ranking Improve upon this initial ranking

Improving the Initial Ranking Let V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki Reevaluate estimates: P(ki | R), P(ki |  R) P(ki | R) = Vi/V P(ki |  R) = (ni – Vi)/(N-V) Repeat recursively

Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms

Pluses and Minuses Advantages: Docs ranked in decreasing order of probability of relevance Disadvantages: need to guess initial estimates for P(ki | R) method does not take into account tf and idf factors

Brief Comparison of Classic Models Boolean model does not provide for partial matches and is considered to be the weakest classic model Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections This seems also to be the view of the research community

Comparison With Vector Space Similar in some ways Terms treated as if they were independent (unigram language model) Different in others Based on probability rather than similarity Intuitions are probabilistic (processes for generating text) rather than geometric Details of use of document length and term, document, and collection frequencies differ

What’s the point? Probabilistic models formalize assumptions Binary relevance Document independence Term independence All of which aren’t true! Relevance isn’t binary Documents are often not independent Terms are clearly not independent But it works!

Extended Boolean Model: Disadvantages of “Boolean Model” : No term weight is used Counterexample: query q=K x AND K y. Documents containing just one term, e,g, K x is considered as irrelevant as another document containing none of these terms. No term weight is used The size of the output might be too large or too small

Extended Boolean Model: The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu The idea is to make use of term weight as vector space model. Strategy: Combine Boolean query with vector space model. Why not just use Vector Space Model? Advantages: It is easy for user to provide query.

Fig. Extended Boolean logic considering the space composed of two terms k x and k y only. k y k x

Extended Boolean Model: For query q=K x or K y, (0,0) is the point we try to avoid. Thus, we can use to rank the documents The bigger the better.

Extended Boolean Model: For query q=K x and K y, (1,1) is the most desirable point. We use to rank the documents. The bigger the better.

Exercise Documents 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 Terms adventure agriculture bridge cathedrals disasters flags horticulture leprosy Mediterranean recipes scholarships tennis Venus Query: bridge tennis

练习给定文档语料 : D1: 北京安立文高新技术公司 D2: 新一代的网络访问技术 D3: 北京卫星网络有限公司 D4: 是最先进的总线技术。。。 D5: 北京升平卫星技术有限公司的新技术有。。。利用中文切分词软件，分别得到用 “/” 分开的一些字词： D1: 北京 / 安 / 立 / 文 / 高新 / 技术 / 公司 / D2: 新 / 一 / 代 / 的 / 网络 / 访问 / 技术 / D3: 北京 / 卫星 / 网络 / 有限 / 公司 / D4: 是 / 最 / 先进 / 的 / 总线 / 技术 / 。。。 D5: 北京 / 升 / 平 / 卫星 / 技术 / 有限 / 公司 / 的 / 新 / 技术 / 有。。。你的任务是设计一个针对这些文档的信息检索系统。具体要求是： (1). 给出系统的有效词汇集合（说明取舍原因）。 (2). 写出 D1 和 D2 在 VSM 中的表示。 (3). 按照向量夹角的余弦计算公式，给出针对查询 “ 技术的公司 ” 的前 3 个反馈结果。

Modern information retrieval: Week 3 Probabilistic Model.

Similar presentations

Presentation on theme: "Modern information retrieval: Week 3 Probabilistic Model."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modern information retrieval: Week 3 Probabilistic Model.

Similar presentations

Presentation on theme: "Modern information retrieval: Week 3 Probabilistic Model."— Presentation transcript:

Similar presentations

About project

Feedback