5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006Polettini Nicola2 Contents 1.Introduction to Vector Space Model Vocabulary & Terms Documents & Queries Similarity measures 2.Term weighting Binary weights SMART Retrieval System 3.Salton: “Term Precision Model” Paper analysis 4.New weighting schemas Web documents 5.Conclusions 6.References

5 June 2006Polettini Nicola3 The Vector Space Model 1.Vocabulary. 2.Terms. 3.Documents & Queries. 4.Vector representation. 5.Similarity measures. 6.Cosine Similarity.

5 June 2006Polettini Nicola4 Vocabulary Documents are represented as vectors in term space (all terms = vocabulary). Queries represented the same as documents. Query and Document weights are based on length and direction of their vector. A vector distance measure between the query and documents is used to rank retrieved documents.

5 June 2006Polettini Nicola5 Terms Documents are represented by binary or weighted vectors of terms. Terms are usually stems. Terms can be also n-grams. “Computer Science” = bigram “World Wide Web” = trigram

5 June 2006Polettini Nicola6 Documents & Queries Vectors Documents and queries are represented as “bags of words” (BOW). Represented as vectors: –A vector is like an array of floating point. –It has direction and magnitude. –Each vector holds a place for every term in the collection. –Therefore, most vectors are sparse.

5 June 2006Polettini Nicola7 Vector representation Documents and Queries are represented as vectors. Vocabulary = n terms Position 1 corresponds to term 1, position 2 to term 2, position n to term n.

5 June 2006Polettini Nicola8 Simple matching Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Similarity Measures

5 June 2006Polettini Nicola9 Cosine Similarity The similarity of two documents is: This is called the cosine similarity. The normalization is done when weighting the terms. Otherwise normalization and similarity can be combined. Cosine similarity sorts documents according to degrees of similarity.

5 June 2006Polettini Nicola10 Example: Computing Cosine Similarity 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

5 June 2006Polettini Nicola11 Example: Computing Cosine Similarity (2)

5 June 2006Polettini Nicola12 Term Weighting 1.Binary weights 2.SMART Retrieval System Local formulas Global formulas Normalization formulas 3.TFIDF

5 June 2006Polettini Nicola13 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

5 June 2006Polettini Nicola14 Binary Weights Formula Binary formula gives every word that appears in a document equal relevance. It can be useful when frequency is not important.

5 June 2006Polettini Nicola15 Why use term weighting? Binary weights too limiting. (terms are either present or absent). Non-binary weights allow to model partial matching. (Partial matching allows retrieval of docs that approximate the query). Ranking of retrieved documents - best matching. (Term-weighting improves quality of answer set).

5 June 2006Polettini Nicola16 Smart Retrieval System SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell. Designed for laboratory experiments in IR –Easy to mix and match different weighting methods. Paper: Salton, “The Smart Retrieval System – Experiments in Automatic Document Processing”, 1971

5 June 2006Polettini Nicola17 Smart Retrieval System (2) In SMART weights are decomposed into three factors:

5 June 2006Polettini Nicola18 Local term-weighting formulas Binary Frequency Maxnorm Augmented Normalized Alternate Log

5 June 2006Polettini Nicola19 Term frequency TF (term frequency) - Count of times term occurs in document.

5 June 2006Polettini Nicola20 Term frequency (2) The more times a term t occurs in document d the more likely it is that t is relevant to the document. Used alone, favors common words, long documents. Too much credit to words that appears more frequently. Tipically used for query weighting.

5 June 2006Polettini Nicola21 Augmented Normalized Term Frequency This formula was proposed by Croft. Usually K = 0,5. K < 0,5 for large documents. K = 0,5 for shorter documents. Output varies between 0,5 and 1 for terms that appear in the document. It’s a “weak” form of normalization.

5 June 2006Polettini Nicola22 Logarithmic Term Frequency Logarithms are a way to de-emphasize the effect of frequency. Logarithmic formula decreases the effects of large differences in term frequencies.

5 June 2006Polettini Nicola23 Global term-weighting formulas Inverse Squared Probabilistic Frequency

5 June 2006Polettini Nicola24 Document Frequency DF = document frequency –Count the frequency considering the whole collection of documents. –Less frequently a term appears in the whole collection, the more discriminating it is.

5 June 2006Polettini Nicola25 Inverse Document Frequency Measures rarity of the term in collection. Inverts the document frequency. It’s the most used global formula. Higher if term occurs in less documents: –Gives full weight to terms that occurr in one document only. –Gives lowest weight to terms that occurr in all documents.

5 June 2006Polettini Nicola26 Inverse Document Frequency (2) IDF provides high values for rare words and low values for common words. Examples for a collection of 10000 documents (N = 10000)

5 June 2006Polettini Nicola27 Other IDF Schemes Squared IDF: used rarely as a variant of IDF. Probabilistic IDF: –It assigns weights ranging from - for a term that appears in every document to log(n-1) for a term that appears in only one document. –Negative weights for terms appearing in more than half of the documents.

5 June 2006Polettini Nicola28 Normalization formulas Sum of weights Cosine Fourth Max

5 June 2006Polettini Nicola29 Document Normalization Long documents have an unfair advantage: –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects: –Related somehow to maximum term frequency. –But also sensitive to the number of terms. If we don’t normalize short documents may not be recognized as relevant.

5 June 2006Polettini Nicola30 Cosine Normalization It’s the most used and popular. Normalize the term weights (so longer documents are not unfairly given more weight). If weights are normalized the cosine similarity results:

5 June 2006Polettini Nicola31 Other normalizations Sum of weights and fourth normalization are rarely used as cosine normalization variant. Max Weight Normalization: It assigns weights between 0 and 1, but it doesn’t take into account the distribution of terms over documents. It gives high importance to the most relevant weighted terms within a document (used in CiteSeer).

5 June 2006Polettini Nicola32 TFIDF Term-weighting

5 June 2006Polettini Nicola33 TFIDF Example It’s the most used term-weighting 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602 tf W i,j idf

5 June 2006Polettini Nicola34 Normalization example nuclear fallout siberia contaminated interesting complicated information retrieval 0.301 0.125 0.602 0.301 0.000 0.602 4 5 6 3 1 3 1 6 5 3 4 3 7 1 2 123 2 3 2 4 4 tf 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 W i,j idf 1.700.972.670.87 Length 0.29 0.37 0.53 0.13 0.62 0.77 0.57 0.14 0.19 0.79 0.05 0.71 123 0.69 0.44 0.57 4 W' i,j

5 June 2006Polettini Nicola35 Retrieval Example nuclear fallout siberia contaminated interesting complicated information retrieval Query: contaminated retrieval 1 query W' i,j 1 0.290.90.190.57 Cosine similarity score Ranked list: Doc 2 Doc 4 Doc 1 Doc 3 0.29 0.37 0.53 0.13 0.62 0.77 0.57 0.14 0.19 0.79 0.05 0.71 123 0.69 0.44 0.57 4 W' i,j

5 June 2006Polettini Nicola36 Gerard Salton paper: “The term precision model” 1.Weighting Schema proposed. 2.Cosine similarity. 3.Density formula. 4.Discrimination Value formulas. 5.Term Precision formulas. 6.Conclusions.

5 June 2006Polettini Nicola37 Gerard Salton paper: “Weighting Schema proposed” 1.Use of tf idf formulas. 2.Underline the importance of term weighting. 3.Use of cosine similarity.

5 June 2006Polettini Nicola38 Gerard Salton paper: “Density formula” Density = the average pairwise cosine similarity between distinct document pairs. N = total number of documents.

5 June 2006Polettini Nicola39 Gerard Salton paper: “Discrimination Value formulas” DV = Discrimination Value. It’s the difference between the two average densities where s k is the density for document pairs from which term k has been removed. If k is useful DV is positive.

5 June 2006Polettini Nicola40 Gerard Salton paper: “Discrimination Value formulas” Terms with a high document frequency increase the total density formula and DV is negative. Terms with a low document frequency leave the density unchanged and DV is near zero value. Terms with medium document frequency decrease the total density and DV is positive.

5 June 2006Polettini Nicola41 Gerard Salton paper: “Term Precision formulas” N = total documents. R = relevant documents with respect to a query. I = (N-R) non relevant documents. r = assigned relevant documents. s = assigned non relevant documents. (df = r + s) w increases in 0<df<R and decreases in R<df<N The maximum value of w is reached at df = R.

5 June 2006Polettini Nicola42 Gerard Salton paper: “Conclusions” Precision weights are difficult to compute in practice because required relevance assessments of documents with respect to queries are not normally available in real retrieval situations.

5 June 2006Polettini Nicola43 New Weighting Schemas 1.Web problems 2.Document Structure 3.Hyperlinks 4.Different weighting schemas

5 June 2006Polettini Nicola44 New Weighting Schemas (2) Weight tokens under particular HTML tags more heavily: – tokens (Google seems to like title matches) –, … tokens – keyword tokens Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.

5 June 2006Polettini Nicola45 References Gerald Salton and Chris Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, Issue 5. 1988. Gerard Salton and M.J.McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983. Gerard Salton, A. Wong, and C.S. Yang. A vector space model for Information Retrieval. Journal of the American Society for Information Science, 18(11):613-620, November 1975. Gerard Salton. The SMART Retrieval System – Experiments in automatic document processing. Prentice Hall, Englewood Cliffs, N. J., 1971.

5 June 2006Polettini Nicola46 References (2) Erica Chisholm and Tamara G. Kolda. New Term Weighting formulas for the Vector Space Method in Information Retrieval. Computer Science and Mathematics Division. Oak Ridge National Laboratory, 1999. W. B. Croft. Experiments with representation in a document retrieval system. Information Technology: Research and Development, 2:1-21, 1983. Ray Larson, Marc Davis. SIMS 202: Information Organization and Retrieval. UC Berkeley SIMS, Lecture 18: Vector Representation, 2002. Kishore Papineni. Why Inverse Document Frequency?. IBM T.J. Watson Research Center Yorktown Heights, New York, Usa, 2001. Chris Buckley. The importance of proper weighting methods. In M. Bates, editor, Human Language Technology. Morgan Kaufman, 1993.

5 June 2006Polettini Nicola47 Questions?

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

Similar presentations

Presentation on theme: "5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

Similar presentations

Presentation on theme: "5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback