1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking

2 Course Administration

3 Building an Index Documents break into words stoplist stemming* term weighting* Index database text non-stoplist words words stemmed words terms with weights *Indicates optional operation from Frakes, page 7 assign document IDs documents document numbers and *field numbers

4 Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

5 Inverse Document Frequency Weighting Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: Very simple approach: w ij = f ij / d j Where: w ij is the weight given to term j in document i f ij is the frequency with which term j appears in document i d j is the number of documents that contain term j

6 Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

7 Term Frequency Suppose term j appears f ij times in document i Let m i = max (f ij ) i.e., m i is the maximum frequency of any term in document i Term frequency (TF): tf ij = f ij / m i i

8 Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

9 Inverse Document Frequency For term j number of documents = n document frequency (number of documents in which term j occurs) = d j One possible measure is: n/d j But this over-emphasizes small differences. Therefore a more useful definition is: Inverse document frequency (IDF): idf j = log 2 (n/d j ) + 1

10 Example of Inverse Document Frequency Example n = 1,000 documents term kd k i k A1004.32 B5002.00 C9001.13 D1,0001.00 From: Salton and McGill

11 Weighting Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard weighting scheme is: w ij = tf ij * idf j = (f ij / m i ) * (log 2 (n/d j ) + 1) Frake, Chapter 14 discusses many variations on this basic scheme.

12 Page Rank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

13 Intuitive Model A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

14 Page Ranks P 1 P 2 P 3 P 4 P 5 P 6 P 1 1 1 1 P 2 1 P 3 1 P 4 1 1 1 1 P 5 1 P 6 1 1 Cited page Citing page Number 2 1 4 1 2 2

15 Normalize by Number of Links from Page P 1 P 2 P 3 P 4 P 5 P 6 P 1 1 0.25 0.5 P 2 0.25 P 3 0.5 P 4 0.5 0.25 1 0.5 P 5 0.5 P 6 0.25 0.5 Cited page Citing page Number 2 1 4 1 2 2 = B

16 Weighting of Pages Initially all pages have weight 1 w 1 = 111111 111111 Recalculate weights w 2 = Bw 1 = 1.75 0.25 0.50 2.25 0.50 0.75

17 Page Ranks (Basic Algorithm) Iterate until w k = Bw k-1 This w is the high order eigenvector of B It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages Basic algorithm: calculates the ranks for all pages lists hits in page rank order

18 Google PageRank Model A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b.With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

19 Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages T i pointing to it. –d: damping factor –C(A): # of links out of A Iterate until:

20 Compare TF.IDF to PageRank With TF.IDF document are ranked depending on how well they match a specific query. With PageRank, the pages are ranked in order of importance, with no reference to a specific query.

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking."— Presentation transcript:

Similar presentations

About project

Feedback