Download presentation
Presentation is loading. Please wait.
1
Information Retrieval Modeling CS 652 Information Extraction and Integration
2
2 Introduction n IR systems usually adopt index terms to index documents & process queries n Index term: u a keyword or group of selected/related words u any word (in general) n Problems of IR based on index terms: u oversimplication & lose of semantics u badly-formed queries n Stemming might be used: u connect: connecting, connection, connections n An inverted file is built for the chosen index terms
3
3 Introduction Docs Information Need Index Terms doc query Ranking match
4
4 Introduction n Matching at index term level is quite imprecise u Consequence: no surprise that users get frequently unsatisfied n Since most users have no training in query formation, problem is even worst u Result: frequent dissatisfaction of Web users n Issue of deciding relevance is critical for IR systems: ranking
5
5 Introduction n A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query n A ranking is based on fundamental premises regarding the notion of relevance, such as: u common sets of index terms u sharing of weighted terms u likelihood of relevance n Each set of premises leads to a distinct IR model
6
6 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models Boolean Vector (Space) Probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector (Space) Latent Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext
7
7 IR Models n The IR model, the logical view of the docs, and the retrieval task are distinct aspects of the system
8
8 Retrieval: Ad Hoc vs. Filtering n Ad hoc retrieval: Static set of documents & dynamic queries Collection “Fixed Size” Q2 Q3 Q1 Q4 Q5
9
9 Retrieval: Ad Hoc vs. Filtering n Filtering: dynamic set of documents & static queries Documents Stream User 1 Profile User 2 Profile Docs Filtered for User 2 Docs for User 1 incoming outgoing
10
10 Classic IR Models – Basic Concepts n Each document is described by a set of representative keywords or index terms n Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document n However, search engines assume that all words are index terms (full text representation) n User profile construction process: 1) user-provided keywords 2) relevance feedback cycle 3) system adjusted user profile description 4) repeat step (2) till stabilized
11
11 Classic IR Models - Basic Concepts n Not all terms are equally useful for representing the document contensts: less frequent terms allow identifying a narrower set of documents n The importance of the index terms is represented by weights associated to them n Let u k i be an index term u d j be a document u w i j is a weight associated with (k i,d j ), which quantifies the importance of k i for describing the contents of d j
12
12 Classic IR Models - Basic Concepts u K i is a generic index term u d j is a document u t is the total number of index terms in an IR sytem u K = (k 1, k 2, …, k t ) is the set of all index terms u w i j 0 is a weight associated with (k i,d j ) u w i j = 0 indicates that k i does not belong to d j u vec(d j ) = (w 1 j, w 2 j, …, w t j ) is a weighted vector associated with the document d j u g i (vec(d j )) = w i j is a function which returns the weight associated with pair (k i,d j )
13
13 The Boolean Model n Simple model based on set theory & Boolean algebra n Queries specified as boolean expressions u precise semantics u neat formalism u q = k a (k b k c ) n Terms are either present or absent. Thus, K i {0,1} n Consider u q = k a (k b k c ) u vec(q dnf ) = (1,1,1) (1,1,0) (1,0,0) u vec(q cc ) = (1,1,0) is a conjunctive component
14
14 The Boolean Model n q = k a (k b k c ) 1 if vec(q cc ) | (vec(q cc ) vec(q dnf )) ( k i, g i (vec(d j )) = g i (vec(q cc ))) 0 otherwise (1,1,1) (1,0,0) (1,1,0) KaKa KbKb KcKc sim(q,d j ) =
15
15 Drawbacks of the Boolean Model n Retrieval based on binary decision criteria with no notion of partial matching n No ranking of the documents is provided (absence of a grading scale) n Information need has to be translated into a Boolean expression which most users find awkward n The Boolean queries formulated by the users are most often too simplistic n Using the exact matching approach, the model frequently returns either too few or too many documents in response to a user query
16
16 The Vector (Space) Model (VSM) n Use of binary weights is too limiting n Non-binary weights provide consideration for partial matching n These term weights are used to compute a degree of similarity between a query and a document n Ranked set of documents provides better matching, i.e., partal matching vs. exact matching n Answer set of VSM is a lot more precise than the answer set retrieved by the Boolean model
17
17 The Vector (Space) Model n Define: u w i j > 0 whenever k i d j u w i q >= 0 associated with the pair (k i,q) u vec(d j ) = (w 1 j, w 2 j,..., w t j ), document vector of d j vec(q) = (w 1 q, w 2 q,..., w t q ), query vector of q u To each term k i is associated a unitary vector vec(i) u The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) n The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space n In this space, queries and documents are represented as weighted vectors
18
18 The Vector (Space) Model Sim(q,d j ) = cos( ) = [vec(d j ) vec(q)] / |d j | |q| = [ t i=1 w i j w i q ] / t i=1 w i j 2 t i=1 w i q 2 where is the inner product operator & |q| is the length of q n Since w i j 0 and w i q 0, 1 sim(q, d j ) 0 n A document is retrieved even if it matches the query terms only partially i j djdj q
19
19 The Vector (Space) Model n Sim(q, d j ) = [ t i=1 w i j w i q ] / |d j | |q| n How to compute the weights w i j and w i q ? n A good weight must take into account two effects: 1. quantification of intra-document contents (similarity) F tf factor, the term frequency within a document 2. quantification of inter-documents separation (dissi- milarity) F idf factor, the inverse document frequency u w i j = tf(i, j) idf(i)
20
20 The Vector (Space) Model n Let, u N be the total number of documents in the collection u n i be the number of documents which contain k i u freq(i, j), the raw frequency of k i within d j n A normalized tf factor is given by u f(i, j) = freq(i, j) / max(freq(l, j)), where the maximum is computed over all terms which occur within the document d j n The inverse document frequency (idf) factor is u idf(i) = log (N / n i ) u the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term k i.
21
21 The Vector (Space) Model n The best term-weighting schemes use weights which are give by u w i j = f(i, j) log(N / n i ) u the strategy is called a tf-idf weighting scheme n For the query term weights, a suggestion is u W i q = (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/n i ) n The vector model with tf-idf weights is a good ranking strategy with general collections n The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute
22
22 The Vector (Space) Model n Advantages: u term-weighting improves quality of the answer set u partial matching allows retrieval of documents that approximate the query conditions u cosine ranking formula sorts documents according to degree of similarity to the query u A popular IR model because of its simplicity & speed n Disadvantages: u assumes mutually independence of index terms (??); not clear that this is bad though
23
23 The Vector (Space) Model d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 k1k1 k2k2 k3k3 Example I
24
24 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 k1k1 k2k2 k3k3 The Vector (Space) Model Example II
25
25 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 k1k1 k2k2 k3k3 The Vector (Space) Model Example III
26
26 Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying as specification of the properties of this ideal answer set (clustering) n But, what are these properties? n Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) n Improve by iteration
27
27 Probabilistic Model n An initial set of documents is retrieved somehow n User inspects these docs looking for the relevant ones (in truth, only top 10 - 20 need to be inspected) n IR system uses this information to refine description of ideal answer set n By repeating this process, it is expected that the description of the ideal answer set will improve n Have always in mind the need to guess at the very beginning the description of the ideal answer set n Description of ideal answer set is modeled in probabilistic terms
28
28 Probabilistic Ranking Principle n Given a user query q and a document d in a collection, the probabilistic model tries to estimate the probability that the user will find the document d interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. The ideal answer set R should maximize the probability of relevance. Documents in R are predicted to be relevant to q. n Problems: u How to compute probabilities of relevance? u What is the sample space?
29
29 The Ranking n Probabilistic ranking is computed as: u sim(q, d j ) = P(d j relevant-to q) / P(d j non-relevant-to q) u This is the odds of the document d j being relevant to q u Taking the odds minimize the probability of an erroneous judgement n Definitions: u w i j, w i q {0,1}, i.e., all index term weights are binary u P(R | vec(d j )): probability that doc d j is relevant to q, where R is the known set of docs to be relevant u P( R | vec(d j )): probability that doc d j is not relevant to q, where R is the complement of R
30
30 The Ranking n sim(d j, q) = P(R | vec(d j )) / P( R | vec(d j )) = [P(vec(d j ) | R) * P(R)] [P(vec(d j ) | R) * P( R)] ~ P(vec(d j ) | R) P(vec(d j ) | R) where, P(vec(d j ) | R): probability of randomly selecting d j from R of relevant documents P(R): probability of a randomly selected doc from the collection is relevant (Bayes’ rule)
31
31 The Ranking n sim(d j, q) ~ P(vec(d j ) | R) P(vec(d j ) | R) ~ [ g i (vec(d j ))=1 P(k i | R)] * [ g i (vec(d j ))=0 P( k i | R)] [ g i (vec(d j ))=1 P(k i | R)] * [ g i (vec(d j ))=0 P( k i | R)] where, probability P(k i | R) : probability that the index term k i is present in a document randomly selected from R of relevant documents
32
32 The Ranking n sim(d j,q)~ log [ P(k i | R)] * [ P( k i | R)] [ P(k i | R)] * [ P( k i | R)] ~ K * [ log P(k i | R) + log P( k i | R) ] P(k i | R) P( k i | R) ~ t i=1 w i q * w i j * (log P(k i | R) + log 1 - P(k i | R)) P(k i | R) 1 - P(k i | R)
33
33 The Initial Ranking n sim(d j, q) ~ w i q * w i j * (log P(k i | R) + log 1 - P(k i | R)) P(k i | R) 1 - P(k i | R) n Probabilities P(k i | R) and P(k i | R) ? n Estimates based on assumptions: u P(k i | R) = 0.5 u P(k i | R) = n i N where n i is the number of docs that contain k i N is the number of docs in the collection u Use this initial guess to retrieve an initial ranking u Improve upon this initial ranking
34
34 Improving the Initial Ranking n sim(d j,q) ~ w i q * w i j * (log P(k i | R) + log 1 – P(k i | R)) P(k i | R) 1 - P(k i | R) n Let u V : set of docs initially retrieved & ranked u V i : subset of V retrieved that contain k i n Reevaluate estimates: u P(k i | R) = V i V u P(k i | R) = n i - V i N - V n Repeat recursively ( ~ P(K i | R) by the distribution of K i among V) ( All the non-retrieved docs are irrelevant)
35
35 Improving the Initial Ranking n sim(d j, q) ~ w i q * w i j * (log P(k i | R) + log 1 - P(k i | R ) ) P(k i | R) 1 - P(k i | R) n To avoid problems with V = 1 & V i = 0: u P(k i | R) = V i + 0.5 V + 1 u P(k i | R) = n i - V i + 0.5 N - V + 1 n Or (alteratively) u P(k i | R) = V i + n i /N V + 1 u P(k i | R) = n i - V i + n i /N N - V + 1 (n i / N is an adjustment factor)
36
36 Pluses and Minuses n Advantages: u Documents ranked in decreasing order of probability of relevance n Disadvantages: u Need to guess initial estimates for P(k i | R) u Method does not take into account tf and idf factors u assumes mutually independence of index terms (??); not clear that this is bad though
37
37 Brief Comparison of Classic Models n Boolean model does not provide for partial matches and is considered to be the weakest classic model n Salton and Buckley did a series of experiments that indicate that, in general, the vector (space) model outperforms the probabilistic model with general collections n This seems also to be the view of the research community
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.