VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop

VIR  The Vector (or vector space) model of IR (VIR) is a classical model of IR.  It is  traditionally  called so because both the documents and queries are conceived, for retrieval purposes, as strings of numbers as though they were (mathematical) vectors.  Retrieval is based on whether the 'query vector' and the 'document vector' are 'close enough'. 2Adrienn Skrop

Notations  Given a finite set D of elements called documents: D = {D 1,..., D j,..., D m }  and a finite set T of elements called index terms: T = {t 1,..., t i,..., t n }  Any document D j is assigned a vector v j of finite real numbers, called weights, of length m as follows: v j = (w ij ) i=1,...,n  (w 1j,..., w ij,..., w nj )  where 0  w ij  1 (i.e. w ij is normalised, e.g. division by the largest).  The weight w ij is interpreted as an extent to which the term t i 'characterises' document D j. 3Adrienn Skrop

Identifiers  The choice of terms and weights is a difficult theoretical (e.g. linguistic, semantical) and practical problem, and several techniques can be used to cope with it.  The VIR model assumes that the most obvious place where appropriate content identifiers might be found are the documents themselves.  It is assumed that frequencies of words in documents can give meaningful indication of their content, hence they can be taken as identifiers. 4Adrienn Skrop

Word significance factors  Steps of simple automatic method to identify word significance factors: 1. Identify lexical units. Write a computer program to recognise words (word = any sequence of characters preceded and followed by ‘space’, ‘dot’, ‘comma’). 2. Exclude stopwords from documents, i.e. those words that are unlikely to bear any significance. For example, a, about, and, etc. 5Adrienn Skrop

Word significance factors 3. Apply stemming to the remaining words, i.e. reduce or transform them to their linguistic roots. A widely used stemming algorithm is the well-known Porter's algorithm. Practically, the index terms are the stems. 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. tf i =  j m =1 f ij 5. Calculate the total tf i for each term t i as follows 6Adrienn Skrop

Word significance factors 6. Rank the terms in decreasing order according to tf i, and exclude the very high frequency terms, e.g. over some threshold (on the ground that they are almost always insignificant), and the very low frequency terms as well, e.g. below some threshold (on the ground that they are not much on the writer's mind). 7Adrienn Skrop

Word significance factors 7. The remaining terms can be taken as document identifiers; they will be called index terms (or simply terms). 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Several methods are known. 8Adrienn Skrop

Query object  An object Q k, called query, coming from a user, is also conceived as being a (much smaller) document, a vector v k can be computed for it, too, in a similar way. 9Adrienn Skrop

VIR VIR is defined as follows: A document D j is retrieved in response to a query Q k, if the document and the query are "similar enough", i.e. a similarity measure s jk between  the document (identified by v j ) and  the query (identified by v k ) is over some threshold K, i.e. S jk = s(v j, v k ) > K 10Adrienn Skrop

Steps of VIR  Define query Q k.  Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k in the same way as the documents D j.  Compute the similarity values for the documents D j.  Give the hit list of the retrieved documents (in decreasing order) in respect to the similarity values. 11Adrienn Skrop

Similarity measures  Dot product (simple matching coefficient; inner product) s jk = (v j, v k ) =  i n =1 w ij w ik  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the simple matching coefficient is s jk = |D j  Q k | 12Adrienn Skrop

Similarity measures  Cosine measure: c jk s jk = c jk = (v j, v k )/(||v j ||  ||v k ||)  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the Cosine measure is c jk = |D j  Q k | / (|D j |  |Q k |) 1/2 13Adrienn Skrop

Similarity measures  Dice's coefficient: d jk s jk = d jk = 2  (v j, v k ) /  i n =1 (w ij + w ik )  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Dice's coefficient is d jk = 2  |D j  Q k | / (|D j | + |Q k |) 14Adrienn Skrop

Similarity measures  Jaccard's coefficient: J jk s jk = J jk = (v j, v k ) /  i n =1 (w ij + w ik )/2 w ij w ik  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of Jaccard's coefficient is: J jk = |D j  Q k | / | D j  Q k | 15Adrienn Skrop

Similarity measures  Overlap coefficient: O jk s jk = O jk = (v j, v k ) / min (  i n =1 w ij,  i n =1 w ik )  If D j and Q k are conceived as sets of terms, the set theoretic counterpart of the overlap coefficient is O jk = |D j  Q k | / min (|D j |,|Q k |) 16Adrienn Skrop

Example  Let the set of documents be O = {O 1, O 2, O 3 } where  O 1 = Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution).  O 2 = Bayesian Decision Theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest Subjective Expected Utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision. 17Adrienn Skrop

Example  O 3 = Bayesian Epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power. 18Adrienn Skrop

Example  Step 1. Identify lexical units Original text: Output: Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution). Bayes principle the principle that in estimating a parameter one should initially assume that each possible value has equal probability a uniform prior distribution 19Adrienn Skrop

Example  Step 2. Exclude stopwords from the document  Stoplist: a aboard about above accordingly across actually add added after afterwards again against ago all allows almost alone along alongside … 20Adrienn Skrop

Example  Step 3. Apply stemming  Text before andafter stemmig: Bayes principle estimating parameter one initially assume possible value Bayes principl estim paramet on initi assum possibl valu … 21Adrienn Skrop

Example  Step 4. Compute for each document D j the number of occurrences f ij of each term t i in that document. E.g., t 1 := Bayes f 11  1  f 12  2  f 13  3  22Adrienn Skrop

Example  Step 5. Calculate the total tf i for each term t i E.g. number of term t 1 = Bayes in all the documents: tf 1  6 23Adrienn Skrop

Example  Step 6. Rank the terms in decreasing order according to tf i, and exclude the very high and very low frequency terms 24Adrienn Skrop

Example  Step 7. The remaining terms can be taken as index terms. Let the set T of index terms be (not stemmed here) T = {t 1, t 2, t 3 } = { t 1 = Bayes, t 2 = probability, t 3 = epistemology }. Conceive the documents as sets of terms: D = { D 1, D 2, D 3 } where D 1 = {(t 1 = Bayes); (t 2 = probability)} D 2 = {(t 1 = Bayes); (t 2 = probability)} D 3 = {(t 1 = Bayes); (t 2 = probability); (t 3 = epistemology)} 25Adrienn Skrop

Example Conceive the documents as sets of terms (together with their frequencies): D = {D 1, D 2, D 3 } where D 1 = { (Bayes, 1); (probability, 1); (epistemology, 0) } D 2 = { (Bayes, 2); (probability, 1); (epistemology, 0) } D 3 = { (Bayes, 3); (probability, 3); (epistemology, 3) } 26Adrienn Skrop

Documents in vector space 27Adrienn Skrop

Example  Step 8. Compute for the index terms t i a significance factor (or weight) w ij with respect to each document D j. Creating Term-Document matrix TD: TD 3×3 = (w ij ), where w ij denotes the weight of term t i in document D j 28Adrienn Skrop

Example  Using Frequency Weighting Method: TD =  Using Term Frequency Normalized Weighting Method: TD = 29Adrienn Skrop

Example  Step 9. Define a query. Let the query Q be: Q = {What is Bayesian epistemology? } Let the query Q be (as a set of terms): Q = {(t 1 = Bayes); (t 3 = epistemology) } Notice that, in the VIR, the query is not a Boolean expression anymore (as in BIR). 30Adrienn Skrop

Example  Step 10. Compute for the index terms t i a significance factor (or weight) w ik with respect query Q k. Using  Frequency or  Binary Weighting Method: Q k  (1, 0, 1) 31Adrienn Skrop

Query in vector space 32Adrienn Skrop

Example  12. Compute the similarity values, and give the ranking order.  Using, Term Frequency Normalized (tfn) Weighting Method :  Q= TD= 33Adrienn Skrop

Example 34Adrienn Skrop

Example  Dot Product: Dot 1  0.5  Dot 2  0.632  Dot 3  0.816  Cosine Measure: Cosine 1  0.5  Cosine 2  0.632  Cosine 3  0.816  Dice Measure: Dice 1  0.354  Dice 2  0.459  Dice 3  0.519  Jaccard Measure: Jaccard 1  0.21  Jaccard 2  0.29  Jaccard 3  0.32  The ranking order: D 3, D 2, D 1, 35Adrienn Skrop

Conclusion  Both the query and the document are represented as vectors of real numbers.  A similarity measure is meant to express a likeness between a query and a document.  Those documents are said to be retrieved in response to a query for which the similarity measure exceeds some threshold. 36Adrienn Skrop

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Similar presentations

Presentation on theme: "VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Similar presentations

Presentation on theme: "VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop."— Presentation transcript:

Similar presentations

About project

Feedback