Searching the World Wide Web

1 Searching the World Wide Web
S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday

2 Introduction Analyse the paper
Coverage of search engines Size of the Indexable Web Consider search and Internet development from 1998-today The future of searching

3 Paper Outline Published April 1998, data collected in 1997
Investigates the comparative coverage of the internet by major search engines of the time Attempts to put a figure on the size of the web Important as provide a way to measure the size of the web Introduction Slide: What is the paper about, and why we picked it...

4 Search Engine Coverage: The Test
575 Queries AltaVista Excite HotBot Infoseek Lycos Northern Light Results Results Results Results Results Results List of unique results from all queries Coverage: Percentage of the unique list that an individual engine returns in its queries

5 Search Engine Coverage: Results
Results of search engine coverage using this test: Search Engine Coverage (%) HotBot 57.5 AltaVista 46.5 Northern Light 32.9 Excite 23.1 Infoseek 16.5 Lycos 4.41 Even the most successful of the engines, HotBot, doesn’t manage to cover two thirds of the result set from all engines

6 Size of the Indexable Web: Method
Estimated on the analysis of the overlap between search engines N Set of indexable web pages Na Set of results returned by search engine A Nb Set of results returned by search engine B N0 Set of results returned by A and B, the overlap An estimate of the fraction of the indexable web covered by an engine a can be calculated: Pa = N0 / Nb From this fraction an estimate for the overall size of the indexable web, N, can be calculated N = Sa / Pa

7 Size of the Indexable Web: Examples
Little overlap shows ignorance of search engines as lots of results are missing therefore not much of the web is covered Big overlap shows the sets are almost complete therefore must contain most of the web Works on the assumption of randomness and independence

8 Size of the Indexable Web: Results
Comparison between pairs of search engines Search Engines Indexable Web (millions of pages) Lycos and Infoseek 90 Infoseek and Excite 220 Excite and Northern Light 230 Northern Light and Altavista Altavista and HotBot 320 Paper selects the largest of these, 320million pages, as an estimate for the size of the indexable web

9 Paper Summary Paper admits the size is an estimate, the actual figure is probably larger Query terms based upon scientists searching habits, not general public This estimate suggests that previous estimates of as little as 75 million pages are incorrect Results of the test from the paper

10 Current Technology Newcomers: Google, Yahoo, MSN and Ask Jevees
Size of the web has exploded in the last 5 years [1] Dot com boom…

11 Size of the Web Today Up-to-date and accurate measurement is difficult. But, current figures put the size of the web around 11.5billion pages [2] Currently indexed 9.4 billion pages [2] Google indexes 8 billion pages, but also takes searching further, indexing 880million images [3] Does a bigger index mean better quality results? Larger index could hamper performance [4] Where we got the figures from

12 Specialized Search Engines
With such big search engines providing general results more specialized search engines have resulted:

13 The Future The Deep Web – refers to databases from which dynamic pages are created from Over 200,000 deep websites exist [5] Examples include eBay and Amazon Deep Web is 400 to 550 times larger than the “surface web” [5]

14 Conclusion Estimating the size of the web is difficult and as of yet not possible Paper does a good job of showing previous estimates are far too low (even if it's own is low) The inclusion of deep web will only make the problem harder

15 References 1. Search Engine Sizes, D. Sullivan, January 2005, 2. The Indexable Web is More than 11.5 Billion Pages, A. Gulli and A. Sigorini, 2005, 3. Google Product Descriptions, 4. Accessibility of Information on the Web, S. Lawrence and C. Giles, Nature, 400: , 1999 5. The Deep Web: Surfacing Hidden Value, Michael K. Bergman, 2001,

