Presentation is loading. Please wait.

Presentation is loading. Please wait.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Similar presentations


Presentation on theme: "VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop."— Presentation transcript:

1 VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop

2 VIR  The Vector (or vector space) model of IR (VIR) is a classical model of IR.  It is  traditionally  called so because both the documents and queries are conceived, for retrieval purposes, as strings of numbers as though they were (mathematical) vectors.  Retrieval is based on whether the 'query vector' and the 'document vector' are 'close enough'. 2Adrienn Skrop

3 Notations  Given a finite set D of elements called documents: D = {D 1,..., D j,..., D m }  and a finite set T of elements called index terms: T = {t 1,..., t i,..., t n }  Any document D j is assigned a vector v j of finite real numbers, called weights, of length m as follows: v j = (w ij ) i=1,...,n  (w 1j,..., w ij,..., w nj )  where 0  w ij  1 (i.e. w ij is normalised, e.g. division by the largest).  The weight w ij is interpreted as an extent to which the term t i 'characterises' document D j. 3Adrienn Skrop

4 Identifiers  The choice of terms and weights is a difficult theoretical (e.g. linguistic, semantical) and practical problem, and several techniques can be used to cope with it.  The VIR model assumes that the most obvious place where appropriate content identifiers might be found are the documents themselves.  It is assumed that frequencies of words in documents can give meaningful indication of their content, hence they can be taken as identifiers. 4Adrienn Skrop

5 Word significance factors  Steps of simple automatic method to identify word significance factors: 1. Identify lexical units. Write a computer program to recognise words (word = any sequence of characters preceded and followed by ‘space’, ‘dot’, ‘comma’). 2. Exclude stopwords from documents, i.e. those words that are unlikely to bear any significance. For example, a, about, and, etc. 5Adrienn Skrop

6 Word significance factors 3. Apply stemming to the remaining words, i.e. reduce or transform them to their linguistic roots. A widely used stemming algorithm is the well-known Porter's algorithm. Practically, the index terms are the stems. 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. tf i =  j m =1 f ij 5. Calculate the total tf i for each term t i as follows 6Adrienn Skrop

7 Word significance factors 6. Rank the terms in decreasing order according to tf i, and exclude the very high frequency terms, e.g. over some threshold (on the ground that they are almost always insignificant), and the very low frequency terms as well, e.g. below some threshold (on the ground that they are not much on the writer's mind). 7Adrienn Skrop

8 Word significance factors 7. The remaining terms can be taken as document identifiers; they will be called index terms (or simply terms). 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Several methods are known. 8Adrienn Skrop

9 Query object  An object Q k, called query, coming from a user, is also conceived as being a (much smaller) document, a vector v k can be computed for it, too, in a similar way. 9Adrienn Skrop

10 VIR VIR is defined as follows: A document D j is retrieved in response to a query Q k, if the document and the query are "similar enough", i.e. a similarity measure s jk between  the document (identified by v j ) and  the query (identified by v k ) is over some threshold K, i.e. S jk = s(v j, v k ) > K 10Adrienn Skrop

11 Steps of VIR  Define query Q k.  Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k in the same way as the documents D j.  Compute the similarity values for the documents D j.  Give the hit list of the retrieved documents (in decreasing order) in respect to the similarity values. 11Adrienn Skrop

12 Similarity measures  Dot product (simple matching coefficient; inner product) s jk = (v j, v k ) =  i n =1 w ij w ik  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the simple matching coefficient is s jk = |D j  Q k | 12Adrienn Skrop

13 Similarity measures  Cosine measure: c jk s jk = c jk = (v j, v k )/(||v j ||  ||v k ||)  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the Cosine measure is c jk = |D j  Q k | / (|D j |  |Q k |) 1/2 13Adrienn Skrop

14 Similarity measures  Dice's coefficient: d jk s jk = d jk = 2  (v j, v k ) /  i n =1 (w ij + w ik )  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Dice's coefficient is d jk = 2  |D j  Q k | / (|D j | + |Q k |) 14Adrienn Skrop

15 Similarity measures  Jaccard's coefficient: J jk s jk = J jk = (v j, v k ) /  i n =1 (w ij + w ik )/2 w ij w ik  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Jaccard's coefficient is: J jk = |D j  Q k | / | D j  Q k | 15Adrienn Skrop

16 Similarity measures  Overlap coefficient: O jk s jk = O jk = (v j, v k ) / min (  i n =1 w ij,  i n =1 w ik )  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the overlap coefficient is O jk = |D j  Q k | / min (|D j |,|Q k |) 16Adrienn Skrop

17 Example  Let the set of documents be O = {O 1, O 2, O 3 } where  O 1 = Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution).  O 2 = Bayesian Decision Theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest Subjective Expected Utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision. 17Adrienn Skrop

18 Example  O 3 = Bayesian Epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power. 18Adrienn Skrop

19 Example  Step 1. Identify lexical units Original text: Output: Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). Bayes principle the principle that in estimating a parameter one should initially assume that each possible value has equal probability a uniform prior distribution 19Adrienn Skrop

20 Example  Step 2. Exclude stopwords from the document  Stoplist: a aboard about above accordingly across actually add added after afterwards again against ago all allows almost alone along alongside … 20Adrienn Skrop

21 Example  Step 3. Apply stemming  Text before andafter stemmig: Bayes principle estimating parameter one initially assume possible value Bayes principl estim paramet on initi assum possibl valu … 21Adrienn Skrop

22 Example  Step 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. E.g., t 1 := Bayes f 11  1  f 12  2  f 13  3  22Adrienn Skrop

23 Example  Step 5. Calculate the total tf i for each term t i E.g. number of term t 1 = Bayes in all the documents: tf 1  6 23Adrienn Skrop

24 Example  Step 6. Rank the terms in decreasing order according to tf i, and exclude the very high and very low frequency terms 24Adrienn Skrop

25 Example  Step 7. The remaining terms can be taken as index terms. Let the set T of index terms be (not stemmed here) T = {t 1, t 2, t 3 } = { t 1 = Bayes, t 2 = probability, t 3 = epistemology }. Conceive the documents as sets of terms: D = { D 1, D 2, D 3 } where D 1 = {(t 1 = Bayes); (t 2 = probability)} D 2 = {(t 1 = Bayes); (t 2 = probability)} D 3 = {(t 1 = Bayes); (t 2 = probability); (t 3 = epistemology)} 25Adrienn Skrop

26 Example Conceive the documents as sets of terms (together with their frequencies): D = {D 1, D 2, D 3 } where D 1 = { (Bayes, 1); (probability, 1); (epistemology, 0) } D 2 = { (Bayes, 2); (probability, 1); (epistemology, 0) } D 3 = { (Bayes, 3); (probability, 3); (epistemology, 3) } 26Adrienn Skrop

27 Documents in vector space 27Adrienn Skrop

28 Example  Step 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Creating Term-Document matrix TD: TD 3×3 = (w ij ), where w ij denotes the weight of term t i in document D j 28Adrienn Skrop

29 Example  Using Frequency Weighting Method: TD =  Using Term Frequency Normalized Weighting Method: TD = 29Adrienn Skrop

30 Example  Step 9. Define a query. Let the query Q be: Q = {What is Bayesian epistemology? } Let the query Q be (as a set of terms): Q = {(t 1 = Bayes); (t 3 = epistemology) } Notice that, in the VIR, the query is not a Boolean expression anymore (as in BIR). 30Adrienn Skrop

31 Example  Step 10. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k. Using  Frequency or  Binary Weighting Method: Q k  (1, 0, 1) 31Adrienn Skrop

32 Query in vector space 32Adrienn Skrop

33 Example  12. Compute the similarity values, and give the ranking order.  Using, Term Frequency Normalized (tfn) Weighting Method :  Q= TD= 33Adrienn Skrop

34 Example 34Adrienn Skrop

35 Example  Dot Product: Dot 1  0.5  Dot 2  0.632  Dot 3  0.816  Cosine Measure: Cosine 1  0.5  Cosine 2  0.632  Cosine 3  0.816  Dice Measure: Dice 1  0.354  Dice 2  0.459  Dice 3  0.519  Jaccard Measure: Jaccard 1  0.21  Jaccard 2  0.29  Jaccard 3  0.32  The ranking order: D 3, D 2, D 1, 35Adrienn Skrop

36 Conclusion  Both the query and the document are represented as vectors of real numbers.  A similarity measure is meant to express a likeness between a query and a document.  Those documents are said to be retrieved in response to a query for which the similarity measure exceeds some threshold. 36Adrienn Skrop


Download ppt "VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop."

Similar presentations


Ads by Google