1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines

2 Administration Modem cards for laptops Collect from Upson 311 Assignment 3 Due April 4 at 10 p.m.

3 Web Crawlers Builds an index of web pages by repeating a few basic steps: Maintains a list of known URLs, whether or not the corresponding pages have yet been indexed. Selects the URL of an HTML page that has not been indexed. Retrieves the page and brings it back to a central computer. Automatic indexing program creates an index record, which is added to the overall index. Hyperlinks from the page to other pages are added to the list of URLs for future exploration.

4 Web Crawlers Design questions: What to collect Complex web sites Dynamic pages How fast to collect Frequency of sweep How often to try How to manage parallel crawlers

5 Robots Exclusion Example file: /robots.txt # robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # Cybermapper knows where to go. User-agent: cybermapper Disallow:

6 Automatic Indexing Automatic indexing at its most basic: millions of pages created by thousands of people with different concepts of how information should be structured. Typical web pages provide meager clues for automatic indexing. Some creators and publishers are even deliberately misleading; they fill their pages with terms that are likely to be requested by users.

7 An Example: AltaVista 1997 Digital library concepts Key Concepts in the Architecture of the Digital Library. William Y. Arms Corporation for National Research Initiatives Reston, Virginia... http://www.dlib.org/dlib/July95/07arms.html - size 16K - 7-Oct-96 - English Repository References Notice: HyperNews at union.ncsa.uiuc.edu will be moving to a new machine and domain very soon. Expect interruptions. Repository References. This is a page. http://union.ncsa.uiuc.edu/HyperNews/get/www/repo/references.html - size 5K - 12-May-95 - English

8 Meta Tags Elements within the HTML

9 Searching the Web Index Web search programs use standard methods of information retrieval: Index records are of low quality. Users are untrained ->search programs identify all records that vaguely match the query ->supply them to the user in ranked order Indexes are organized for efficient searching by large numbers of simultaneous users.

10 Searching the Web Index Difficulties: User interface Duplicate elimination Ranking algorithms

11 Page Ranks (Google) P 1 P 2 P 3 P 4 P 5 P 6 P 1 1 1 1 P 2 1 P 3 1 P 4 1 1 1 1 P 5 1 P 6 1 1 Cited page Citing page Number 2 1 4 1 2 2

12 Normalize by Number of Links from Page P 1 P 2 P 3 P 4 P 5 P 6 P 1 1   P 2  P 3  P 4   1  P 5  P 6   Cited page Citing page Number 2 1 4 1 2 2 = B

13 Weighting of Pages Initially all pages have weight 1 w 1 = 111111111111 Recalculate weights w 2 = Bw 1 = 1  2  1  2   Iterate until w = Bw

14 Google Ranks w is the high order eigenvector of B It ranks the pages by links to them normalized by the number of citations from each page and weighted by the ranking of the cited pages Google: calculates the ranks for all pages (about 450 million) lists hits in rank order

15 Computer Science Research Academic research Industrial R&D Entrepreneurs

16 Example: Web Search Engines Lycos (Mauldin, Carnegie Mellon) Technical basis: Research in text-skimming (Ph.D. thesis) Pursuit free text retrieval engine (TREC) Robot exclusion research (private interest) Organizational basis: Center for Machine Translation Grant flexibility (DARPA)

17 Example: Web Search Engines Google (Page and Brin, Stanford) Technical basis: Research in ranking hyperlinks (Ph.D. research) Organizational basis: Grant flexibility (NSF Digital Libraries Initiative) Equipment grant (Hewlett Packard)

18 The Internet Graph Theoretical research in graph theory Six degrees of separation Pareto distributions Algorithms Hubs and authorities (Kleinberg, Cornell) Empirical data Commercial (Yahoo!, Google, Alexa, AltaVista, Lycos) Not-for-profit (Internet Archive)

19 Google Statistics The central system handles 5.5 million searches daily, increasing 20% per month. 2,500 PCs running Linux; 80 terabytes of spinning disk; an average of 30 new machines per day. The cache holds about 200 million html pages. The aim is to crawl the web once per month. 85 people; half are technical; 14 have a Ph.D. in computer science. Comparison: Yahoo! has 100,000,000 registered users and dispatches 1/2 billion pages to users per day.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

Similar presentations

Presentation on theme: "1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

Similar presentations

Presentation on theme: "1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines."— Presentation transcript:

Similar presentations

About project

Feedback