TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.

Presentation on theme: "TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency."— Presentation transcript:

TF/IDF Ranking

Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency (TF) Scheme: –Weight of a term t i in document d j is the number of times that t i appears in d j, denoted by f ij.

Why not just frequency Shortcoming of the TF scheme is that it doesn’t consider the situation where a term appears in many documents of the collection. –E.g. "flight" in a document collection about airplanes. –Such a term may not be discriminative.

TF-IDF term weighting scheme The most well known weighting scheme –TF: (normalized) term frequency –IDF: inverse document frequency. Penalizes terms (words) that occur too often in the document collection. N: total number of docs df i : the number of docs that t i appears. The final TF-IDF term weight is: Each document will be a vector of such numbers. where

Retrieval in the vector space model Query q is represented in the same way as a document. The term w iq of each term t i in q can also computed in the same way as in a document. Relevance of d j to q: Compare the similarity of query q and document d j. For this, use cosine similarity (the cosine of the angle between the two vectors) –The bigger the cosine the smaller the angle and the higher the similarity Dot product Unit vectors Where h is number of words (terms) in q, and k is the number of words (terms) in d.

Cosine similarity - Example in 2d

Document Frequency Suppose query is: calpurnia animal termdf t calpurnia1 animal100 sunday1,000 fly10,000 under100,000 the1,000,000

IDF termdf t idf t calpurnia16 animal1004 sunday1,0003 fly10,0002 under100,0001 the1,000,0000

Computing the Cosine Similarity in Practice Only the terms mentioned by the query matter in q  d. |d| can be computed offline for each document and stored in a document table (docid, |d|). –E.g. d = "I like the red apple." Suppose the idf's are: I: 1, like: 2, red: 5, apple: 10. Since the tf's in this example are 1 for each word, |d| = sqrt( 1 2 + 2 2 + 5 2 + 10 2 ) = 11.4 |q| is easily computed online. –E.g. if q = red apple, |q| = sqrt( 5 2 + 10 2 ) = 11.18 Score for this document is: q  d / |q||d| = (5*5+10*10 ) / (11.4*11.18) =.98

Computing the Cosine Similarity in Practice (2) We store maxf j ’s in the document table which will now have (docid, |d j |, maxf j ) for each document d j. We store idf’s in a word table which will have (t i, idf i ) for each word t i. –This can be implemented by using a HashMap of word-doc_counter pairs. –After building this HashMap, iterate it and insert pairs (word, N/doc_counter) in the word table. –Note. idf’s are retrieved for each word mentioned in the query. These are the only words that matter in computing the dot-product in the numerator of the cosine similarity formula. To compute tf ij =f ij / maxf j, we retrieve –f ij from the inverted index –maxf j from the document table

Download ppt "TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency."

Similar presentations