Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR 6 Scoring, term weighting and the vector space model.

Similar presentations


Presentation on theme: "IR 6 Scoring, term weighting and the vector space model."— Presentation transcript:

1 IR 6 Scoring, term weighting and the vector space model

2

3 Term frequency and weighting ● each term in a document a weight for that term, that depends on the number of occurrences of the term in the document. ● TERM FREQUENCY (tf t,d,) : to be equal to the number of occurrences of term t in document d. ● BAG OF WORDS (order is not considered)

4 Inverse document frequency ● Collection frequency )(cf): to be the total number of occurrences of a term in the collection. ● document frequency (dft), defined to be the number of documents in the collection that contain a term t. ● total number of documents in a collection by N, ● inverse document frequency (idf)

5 Tf-idf weighting ● tf-idft,d = tf t,d ×idf t ● term t a weight in document d. 1. highest when t occurs many times within a small number of documents 2. lower when the term occurs fewer times in a document, or occurs in many documents 3. lowest when the term occurs in virtually all documents.

6 DOCUMENT VECTOR ● view each document as a vector with one component corresponding to each term in the dictionary ● together with a weight for each component that is given by Tf-idf. ● For dictionary terms that do not occur in a document, this weight is zero.

7 ● overlap score measure: the score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d.

8 VECTOR SPACE MODEL ● From – scoring documents on a query – document classification – document clustering

9 COSINE SIMILARITY dot product Euclidean length Normalizing length

10 Queries as vectors ● Assign to each document d a score equal to the dot product.


Download ppt "IR 6 Scoring, term weighting and the vector space model."

Similar presentations


Ads by Google