1 CS 430: Information Discovery Lecture 5 Ranking.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
The PageRank Citation Ranking “Bringing Order to the Web”
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Link Structure and Web Mining Shuying Wang
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval
Link Analysis HITS Algorithm PageRank Algorithm.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Vector Space Models.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Information Retrieval Lecture 6 Vector Methods 2.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
Plan for Today’s Lecture(s)
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
CS 430: Information Discovery
Lecture #11 PageRank (II)
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Text & Web Mining 9/22/2018.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information retrieval and PageRank
CS 430: Information Discovery
Junghoo “John” Cho UCLA
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

1 CS 430: Information Discovery Lecture 5 Ranking

2 Course Administration Optional course readings are optional. Read them if you wish. Some may require a visit to a library! Teaching assistants do not have office hours. If your query cannot be addressed by , ask to meet with them or come to my office hours. Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

3 Course Administration Hints on Assignment 1 You are not building a production system!!! The volume of test data is quite small. Therefore Choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., do not implement B trees). Consider batch loading of data (e.g., no need to provide for incremental update). User interface can be minimal (e.g., single letter commands). To save typing, we will provide the arrays char_class and convert_class from Frake Chapter 7.

4 Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

5 Term Frequency Suppose term j appears f ij times in document i Simple method (as illustrated in Lecture 4) is to use f ij as the term frequency. Standard method Scale f ij relative to the other terms in the document. This partially corrects for variations in the length of the documents. Let m i = max (f ij ) i.e., m i is the maximum frequency of any term in document i Term frequency (tf): tf ij = f ij / m i i

6 Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

7 Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is d j. Simple method is to use n/d j as the inverse document frequency. Standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idf j = log 2 (n/d j ) + 1 d j > 0

8 Example of Inverse Document Frequency Example n = 1,000 documents term jd j idf j A B C D1, From: Salton and McGill

9 Standard Version of tf.idf Weighting Combining tf and idf: (a) Weight is proportional to the number of times that the term appears in the document. (b) Weight is proportional to the logarithm of the reciprocal of the number of documents that contain the term. Notation w ij is the weight given to term j in document i f ij is the frequency with which term j appears in document i d j is the number of documents that contain term j m i is the maximum frequency of any term in document i n is the total number of documents

10 Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard tf.idf weighting scheme is: w ij = tf ij * idf j = (f ij / m i ) * (log 2 (n/d j ) + 1) Frake, Chapter 14 discusses many variations on this basic scheme.

11 Ranking Based on Reference Patterns With term weighting (e.g., tf.idf) documents are ranked depending on how well they match a specific query. With ranking by reference patterns, documents are ranked based on the references among them. The ranking of a set of documents is independent of any specific query. In journal literature, references are called citations. On the web, references are called links or hyperlinks.

12

13 Citation Graph Paper cites is cited by Note that journal citations always refer to earlier work.

14 Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Co-citation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

15 Graphical Analysis of Hyperlinks on the Web This page links to many other pages Many pages link to this page

16 Matrix Representation P 1 P 2 P 3 P 4 P 5 P 6 Number P P P P P P Cited page (to) Citing page (from) Number

17 PageRank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

18 Intuitive Model A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

19 Basic Algorithm: Normalize by Number of Links from Page P 1 P 2 P 3 P 4 P 5 P 6 P P P P P P Cited page Citing page Number = B Normalized link matrix

20 Basic Algorithm: Weighting of Pages Initially all pages have weight 1 w 1 = Recalculate weights w 2 = Bw 1 =

21 Basic Algorithm: Iterate Iterate: w k = Bw k > w 1 w 2 w 3 w 4... converges to... w

22 Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b.With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

23 The PageRank Iteration The basic method iterates using the normalized link matrix, B. w k = Bw k-1 This w is the high order eigenvector of B Google iterates using a damping factor. The method iterates using a matrix B', where: B' = pN + (1 - p)B N is the matrix with every element equal to 1/n. p is a constant found by experiment.

24 Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages T i pointing to it. –d: damping factor –C(A): number of links out of A Iterate until:

25 Information Retrieval Using PageRank Simple Method Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal. Display the hits ranked by PageRank. The disadvantage of this method is that it gives no attention to how closely a document matches a query

26 Reference Pattern Ranking using Dynamic Document Sets PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries. Concept. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections. With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.

27 Reference Pattern Ranking using Dynamic Document Sets Teoma Dynamic Ranking Algorithm (used in Ask Jeeves) 1. Search using conventional term weighting. Rank the hits using similarity between query and documents. 2. Select the highest ranking hits (e.g., top 5,000 hits). 3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query. 4. Display the results ranked in the order of the reference patterns calculated.

28 Combining Term Weighting with Reference Pattern Ranking Combined Method 1. Find all documents that share a term with the query vector. 2. The similarity, using conventional term weighting, between the query and document j is s j. 3. The rank of document j using PageRank or other reference pattern ranking is p j. 4. Calculate a combined rank c j = λs j + (1- λ)p j, where λ is a constant. 5. Display the hits ranked by cj. This method is used in several commercial systems, but the details have not been published.

29 Cornell Note Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).