1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Information Networks Link Analysis Ranking Lecture 8.
Link Analysis: PageRank
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Advances & Link Analysis
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Overview of Web Ranking Algorithms: HITS and PageRank
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Chapter 6: Link Analysis
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
Roberto Battiti, Mauro Brunato
HITS Hypertext-Induced Topic Selection
Lecture #11 PageRank (II)
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
Text & Web Mining 9/22/2018.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Lecture 22 SVD, Eigenvector, and Web Search
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
CS 430: Information Discovery
Presentation transcript:

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2

2 CS 430 / INFO 430: Information Retrieval Completion of Lecture 16

3 Mercator/Heritrix: Domain Name Lookup Resolving domain names to IP addresses is a major bottleneck of web crawlers. Approach: Separate DNS resolver and cache on each crawling computer. Create multi-threaded version of DNS code (BIND). In Mercator, these changes reduced DNS loop-up from 70% to 14% of each thread's elapsed time.

4

5 Research Topics in Web Crawling How frequently to crawl and what strategies to use. Identification of anomalies and crawling traps. Strategies for crawling based on the content of web pages (focused and selective crawling). Duplicate detection.

6 Further Reading Heritrix Allan Heydon and Marc Najork, Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4): , December

7 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2

8 Course Administration

9 Indexing the Web Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Goal is that the first hits presented should satisfy the user's information need -- requires ranking hits in order that fits user's requirements Recall is not an important criterion Completeness of index is not an important factor. Comprehensive crawling is unnecessary

10 Graphical Methods Document A refers to document B Document A provides information about document B

11 Anchor Text The Faculty of Computing and Information Science The source of Document A contains the marked-up text: The anchor text: The Faculty of Computing and Information Science can be considered descriptive metadata about the document:

12 Concept of Relevance and Importance Document measures Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document. Importance measures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity. Web search engines rank documents by a combination of estimates of relevance and importance.

13 Ranking Options 1.Paid advertisers 2.Manually created classification 3.Vector space ranking with corrections for document length, and extra weighting for specific fields, e.g., title, anchors, etc. 4.Popularity, e.g., PageRank The details of 3 and the balance between 3 and 4 are not made public.

14 Citation Graph Paper cites is cited by Note that journal citations always refer to earlier work.

15 Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Co-citation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

16 Bibliometrics: Impact Factor Impact Factor (Garfield, 1972) Set of journals in Journal Citation Reports of the Institute for Scientific Information Impact factor of a journal j in a given year is the average number of citations received by papers published in the previous two years of journal j. Impact factor counts in-degrees of nodes in the network. Influence Weight (Pinski and Narin, 1976) A journal is influential if, recursively, it is heavily cited by other influential journals.

17 Graphical Analysis of Hyperlinks on the Web This page links to many other pages (hub) Many pages link to this page (authority)

18 Graphical Methods on Web Links Choices Graph of full Web or subgraph In-links to a node or all links Algorithms Hubs and Authorities -- subgraph, all links (Kleinberg, 1997) PageRank -- full graph, in-links only (Brin and Page, 1998)

19 PageRank Algorithm Used to estimate popularity of documents Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages. PageRank is essentially a modified version of Pinski and Narin's influence weights applied to the Web graph.

20 Intuitive Model (Basic Concept) Basic (no damping) A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3.Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

21 Basic Algorithm: Matrix Representation P 1 P 2 P 3 P 4 P 5 P 6 Number P P P P P P Cited page (to) Citing page (from) Number

22 Basic Algorithm: Normalize by Number of Links from Page P 1 P 2 P 3 P 4 P 5 P 6 P P P P P P Cited page Citing page Number = B Normalized link matrix

23 Basic Algorithm: Weighting of Pages Initially all pages have weight 1/n w 0 = 0.17 Recalculate weights w 1 = Bw 0 = If the user starts at a random page, the j th element of w 1 is the probability of reaching page j after one step.

24 Basic Algorithm: Iterate Iterate: w k = Bw k > w 0 w 1 w 2 w 3... converges to... w At each iteration, the sum of the weights is

25 Special Cases of Hyperlinks on the Web There is no link out of {2, 3, 4}

26 Special Cases of Hyperlinks on the Web Node 6 is a dangling node, with no outlink. Possible solution: set each element of column 6 of B to 1/n, but this ruins the sparsity of matrix B.

27 Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability 1-d, selects any random page and jumps to it 2b.With probability d, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited. [For dangling nodes, always follow 2a.]

28 The PageRank Iteration The basic method iterates using the normalized link matrix, B. w k = Bw k-1 This w is an eigenvector of B PageRank iterates using a damping factor. The method iterates: w k = (1 - d)w 0 + dBw k-1 w 0 is a vector with every element equal to 1/n.

29 The PageRank Iteration The iteration expression with damping can be re-written. Let R be a matrix with every element equal to 1/n Rw k-1 = w 0 (The sum of the elements of w k-1 equals 1) Let G = dB + (1-d)R (G is called the Google matrix) The iteration formula w k = (1-d)w 0 + dBw k-1 is equivalent to w k = Gw k-1 so that w is an eigenvector of G

30 Iterate with Damping Iterate: w k = Gw k-1 (d = 0.7) > w 0 w 1 w 2 w 3... converges to... w 0.17

31 Convergence of the Iteration The following results can be proved for the Google matrix, G. (See for example, Langville and Meyer.) The iteration always converges The largest eignenvalue 1 = 1 The value of 2 the second largest eigenvalue, depends on d. As d approaches 1, 2 also approaches 1 The rate of convergence depends on ( 2 \ 1 ) k, where k is the number of iterations

32 Computational Efficiency B is a very sparse matrix. Let average number of outlinks per page = p. Each iteration of w k = Bw k-1 requires O(np) = O(n) multiplications. G is a dense matrix. Each iteration of w k = Gw k-1 requires O(n 2 ) multiplications. But each iteration of w k = (1-d)w 0 + dBw k-1 requires O(n) multiplications. Therefore this is the form used in practical computations.

33 Choice of d Conceptually, values of d that are close to 1 are desirable as they emphasize the link structure of the Web graph, but... The rate of convergence of the iteration decreases as d approaches 1. The sensitivity of PageRank to small variations in data increases as d approaches 1. It is reported that Google uses a value of d = 0.85 and that the computation converges in about 50 iterations

34 Suggested Reading See: Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 1999, for descriptions of all these methods Book: Amy Langville and Carl Meyer, Google's PageRank and Beyond: the Science of Search Engine Rankings. Princeton University Press, Or take: CS/Info 685,The Structure of Information Networks