Searching the World Wide Web S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday
Introduction Analyse the paper Coverage of search engines Size of the Indexable Web Consider search and Internet development from 1998-today The future of searching
Paper Outline Published April 1998, data collected in 1997 Investigates the comparative coverage of the internet by major search engines of the time Attempts to put a figure on the size of the web Important as provide a way to measure the size of the web Introduction Slide: What is the paper about, and why we picked it...
Search Engine Coverage: The Test 575 Queries AltaVista Excite HotBot Infoseek Lycos Northern Light Results Results Results Results Results Results List of unique results from all queries Coverage: Percentage of the unique list that an individual engine returns in its queries
Search Engine Coverage: Results Results of search engine coverage using this test: Search Engine Coverage (%) HotBot 57.5 AltaVista 46.5 Northern Light 32.9 Excite 23.1 Infoseek 16.5 Lycos 4.41 Even the most successful of the engines, HotBot, doesn’t manage to cover two thirds of the result set from all engines
Size of the Indexable Web: Method Estimated on the analysis of the overlap between search engines N Set of indexable web pages Na Set of results returned by search engine A Nb Set of results returned by search engine B N0 Set of results returned by A and B, the overlap An estimate of the fraction of the indexable web covered by an engine a can be calculated: Pa = N0 / Nb From this fraction an estimate for the overall size of the indexable web, N, can be calculated N = Sa / Pa
Size of the Indexable Web: Examples Little overlap shows ignorance of search engines as lots of results are missing therefore not much of the web is covered Big overlap shows the sets are almost complete therefore must contain most of the web Works on the assumption of randomness and independence
Size of the Indexable Web: Results Comparison between pairs of search engines Search Engines Indexable Web (millions of pages) Lycos and Infoseek 90 Infoseek and Excite 220 Excite and Northern Light 230 Northern Light and Altavista Altavista and HotBot 320 Paper selects the largest of these, 320million pages, as an estimate for the size of the indexable web
Paper Summary Paper admits the size is an estimate, the actual figure is probably larger Query terms based upon scientists searching habits, not general public This estimate suggests that previous estimates of as little as 75 million pages are incorrect Results of the test from the paper
Current Technology Newcomers: Google, Yahoo, MSN and Ask Jevees Size of the web has exploded in the last 5 years [1] Dot com boom…
Size of the Web Today Up-to-date and accurate measurement is difficult. But, current figures put the size of the web around 11.5billion pages [2] Currently indexed 9.4 billion pages [2] Google indexes 8 billion pages, but also takes searching further, indexing 880million images [3] Does a bigger index mean better quality results? Larger index could hamper performance [4] Where we got the figures from
Specialized Search Engines With such big search engines providing general results more specialized search engines have resulted:
The Future The Deep Web – refers to databases from which dynamic pages are created from Over 200,000 deep websites exist [5] Examples include eBay and Amazon Deep Web is 400 to 550 times larger than the “surface web” [5]
Conclusion Estimating the size of the web is difficult and as of yet not possible Paper does a good job of showing previous estimates are far too low (even if it's own is low) The inclusion of deep web will only make the problem harder
References 1. Search Engine Sizes, D. Sullivan, January 2005, http://searchenginewatch.com/reports/article.php/2156481 2. The Indexable Web is More than 11.5 Billion Pages, A. Gulli and A. Sigorini, 2005, http://citeseer.ist.psu.edu/gulli05indexable.html 3. Google Product Descriptions, http://www.google.co.uk/press/descriptions.html 4. Accessibility of Information on the Web, S. Lawrence and C. Giles, Nature, 400:107--109, 1999 5. The Deep Web: Surfacing Hidden Value, Michael K. Bergman, 2001, http://beta.brightplanet.com/deepcontent/turtorials/DeepWeb/index.asp