1 CS 430: Information Discovery Lecture 5 Ranking.

1 CS 430: Information Discovery Lecture 5 Ranking

2 Course Administration Optional course readings are optional. Read them if you wish. Some may require a visit to a library! Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours. Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

3 Course Administration Hints on Assignment 1 You are not building a production system!!! The volume of test data is quite small. Therefore Choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., do not implement B trees). Consider batch loading of data (e.g., no need to provide for incremental update). User interface can be minimal (e.g., single letter commands). To save typing, we will provide the arrays char_class and convert_class from Frake Chapter 7.

4 Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

5 Term Frequency Suppose term j appears f ij times in document i Simple method (as illustrated in Lecture 4) is to use f ij as the term frequency. Standard method Scale f ij relative to the other terms in the document. This partially corrects for variations in the length of the documents. Let m i = max (f ij ) i.e., m i is the maximum frequency of any term in document i Term frequency (tf): tf ij = f ij / m i i

6 Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

7 Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is d j. Simple method is to use n/d j as the inverse document frequency. Standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idf j = log 2 (n/d j ) + 1 d j > 0

8 Example of Inverse Document Frequency Example n = 1,000 documents term jd j idf j A1004.32 B5002.00 C9001.13 D1,0001.00 From: Salton and McGill

9 Standard Version of tf.idf Weighting Combining tf and idf: (a) Weight is proportional to the number of times that the term appears in the document. (b) Weight is proportional to the logarithm of the reciprocal of the number of documents that contain the term. Notation w ij is the weight given to term j in document i f ij is the frequency with which term j appears in document i d j is the number of documents that contain term j m i is the maximum frequency of any term in document i n is the total number of documents

10 Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard tf.idf weighting scheme is: w ij = tf ij * idf j = (f ij / m i ) * (log 2 (n/d j ) + 1) Frake, Chapter 14 discusses many variations on this basic scheme.

11 Ranking Based on Reference Patterns With term weighting (e.g., tf.idf) documents are ranked depending on how well they match a specific query. With ranking by reference patterns, documents are ranked based on the references among them. The ranking of a set of documents is independent of any specific query. In journal literature, references are called citations. On the web, references are called links or hyperlinks.

13 Citation Graph Paper cites is cited by Note that journal citations always refer to earlier work.

14 Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Co-citation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

15 Graphical Analysis of Hyperlinks on the Web This page links to many other pages Many pages link to this page 1 2 3 4 5 6

16 Matrix Representation P 1 P 2 P 3 P 4 P 5 P 6 Number P 1 1 1 P 2 1 1 2 P 3 1 1 1 3 P 4 1 1 1 1 4 P 5 1 1 P 6 1 1 Cited page (to) Citing page (from) Number 4 2 1 1 3 1

17 PageRank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

18 Intuitive Model A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

19 Basic Algorithm: Normalize by Number of Links from Page P 1 P 2 P 3 P 4 P 5 P 6 P 1 0.33 P 2 0.25 1 P 3 0.25 0.5 1 P 4 0.25 0.5 0.33 1 P 5 0.25 P 6 0.33 Cited page Citing page Number 4 2 1 1 3 1 = B Normalized link matrix

20 Basic Algorithm: Weighting of Pages Initially all pages have weight 1 w 1 = 111111 111111 Recalculate weights w 2 = Bw 1 = 0.33 1.25 1.75 2.08 0.25 0.33

21 Basic Algorithm: Iterate Iterate: w k = Bw k-1 0.33 1.25 1.75 2.08 0.25 0.33 0.08 1.83 2.79 1.12 0.08 0.03 2.80 2.06 1.05 0.02 0.03 -> 0.00 2.39 1.19 0.00 111111111111 w 1 w 2 w 3 w 4... converges to... w

22 Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b.With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

23 The PageRank Iteration The basic method iterates using the normalized link matrix, B. w k = Bw k-1 This w is the high order eigenvector of B Google iterates using a damping factor. The method iterates using a matrix B', where: B' = pN + (1 - p)B N is the matrix with every element equal to 1/n. p is a constant found by experiment.

24 Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages T i pointing to it. –d: damping factor –C(A): number of links out of A Iterate until:

25 Information Retrieval Using PageRank Simple Method Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal. Display the hits ranked by PageRank. The disadvantage of this method is that it gives no attention to how closely a document matches a query

26 Reference Pattern Ranking using Dynamic Document Sets PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries. Concept. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections. With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.

27 Reference Pattern Ranking using Dynamic Document Sets Teoma Dynamic Ranking Algorithm (used in Ask Jeeves) 1. Search using conventional term weighting. Rank the hits using similarity between query and documents. 2. Select the highest ranking hits (e.g., top 5,000 hits). 3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query. 4. Display the results ranked in the order of the reference patterns calculated.

28 Combining Term Weighting with Reference Pattern Ranking Combined Method 1. Find all documents that share a term with the query vector. 2. The similarity, using conventional term weighting, between the query and document j is s j. 3. The rank of document j using PageRank or other reference pattern ranking is p j. 4. Calculate a combined rank c j = λs j + (1- λ)p j, where λ is a constant. 5. Display the hits ranked by cj. This method is used in several commercial systems, but the details have not been published.

29 Cornell Note Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).

1 CS 430: Information Discovery Lecture 5 Ranking.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 5 Ranking."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430: Information Discovery Lecture 5 Ranking.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 5 Ranking."— Presentation transcript:

Similar presentations

About project

Feedback