Presentation on theme: "1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit."— Presentation transcript:
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands http://www.cwi.nl/cwi/projects/IRT Presented at Internet Librarian International 2000 in London, England, March 2000
2 Fluctuations in document accessibility - summary Search engines are often compared on the basis of their size, i.e. the number of documents indexed in their databases. However, searchers should be aware of the fact that documents cannot be retrieved reliably - in the sense that unexpected and annoying fluctuations exist in the result set of documents retrieved by most search engines. Fluctuations are ideally caused by alterations in the Web (documents come and go). However, in some cases they are caused by changes in indexing policy (indexing fluctuations), and in some cases the origin is more obscure: documents are expected but not retrieved. We have investigated these obscure fluctuations, by searching repeatedly during a year for several identical test documents. The documents were placed on different sites and remained unchanged. The influences of changes in indexing policy of the engines are excluded. We consider two kinds of obscure fluctuations: 1. Document fluctuations appear when test documents disappear from the database with indexed documents (for whatever reason). 2. Element fluctuations appear when test documents, that still exist in the database, do not show up in result sets even when they should. This presentation is the result of our tests from October 1998 until December 1999. We have evaluated 13 engines: AltaVista, EuroFerret, Excite, HotBot, InfoSeek, Lycos, MSN, NorthernLight, Snap, WebCrawler and 3 national Dutch engines: Ilse, Search.nl and Vindex. The outcome of our investigation is in particular important for known-item searches.
4 Internet based information sources: how many? how much? In 2000: about 1 billion = 1000 million unique URLs in the total Internet about 10 terabyte (= 10 000 gigabyte) of text data
5 Internet information retrieval systems in 2000 Several types of systems exist to retrieve information: »Directories of selected sources categorised by subject, made by humans, mainly for browsing. »Search systems, based on databases with machine made indexes, for word-based searching! »Meta-search or multi-threaded search systems. We have studied and compared several well-known international (and a few national) word-based Internet search engines.
6 Internet information retrieval systems: evaluation criteria Many aspects/criteria can be considered in the evaluation of an Internet search engine, including »coverage of documents present on WWW(studies exist) »number of elements of a document, that are indexed to make them usable for retrieval »fluctuations over time in the result sets offered by a search engine We started to study the depth of indexing and we were soon confronted with the fluctuations in the performance that do exist.
7 Internet information retrieval systems: our research group The following persons have been involved in the research: Louise Beijer (Hogeschool van Amsterdam, The Netherlands) Hans de Bruin (Unilever Research Laboratorium, Vlaardingen, The Netherlands) Hans de Man (JdM Documentaire Informatie, Vlaardingen, The Netherlands) Rudy Dokter (PNO Consultants, Hengelo, The Netherlands) Marten Hofstede ( Rijksuniversiteit Leiden, The Netherlands) Wouter Mettrop (CWI, Amsterdam, The Netherlands) Paul Nieuwenhuysen (Vrije Universiteit Brussel, Belgium) Eric Sieverts (Hogeschool van Amsterdam, and RUU, The Netherlands) Hanneke Smulders (Infomare, Terneuzen, The Netherlands) Hans van der Laan (Consultant, Leiderdorp, The Netherlands) Ditmer Weertman (ADLIB, Utrecht, The Netherlands)
8 Internet search engines: research on indexing functionality assessing the indexing functionality »test document »test method conclusions concerning indexing functionality
9 Number of our test documents that were retrieved
10 Internet search engines: elements of test document studied title tag META-tags: keywords, description and author comment tag ALT tag text/URL of a link to a document H3 tag table header text of: an internal link, a reference anchor, a link to a sound file name of a sound file (au/wav/aiff/ra) text of a link to an image name of an image file (gif or jpg; inline or linked to) name of a Java applet (with or without extension class) terms after the first 100 lines in a document (200/…/700) the URL of a document
11 Internet search engines: part of the test document source code Test pagina
12 Number of the studied document elements that were indexed
13 Internet search engines : reachability 14 528 queries sent to 13 search engines 721 times unreachable The percentage of unreachability varies from nearly 0% to nearly 15%. The studied search engines were reachable for 95% of the queries.
14 Search engine indexing functionality: conclusions Not all of the web is indexed. »Not all of our test documents. »Not all HTML elements of our test document. Some of the studied search engines showed changes in the indexing policy. No relation between the number of indexed test documents or HTML elements and the size of a search engine was found during our study.
15 Internet search engines: fluctuations - definition A fluctuation appears when the result set of an observation - i.e. » one query or » set of queries misses documents with respect to a frame of reference - i.e. » other observations and » knowledge about Web reality
16 Internet search engines: detecting fluctuations Through time: comparing result sets of one observation, repeatedly performed » Observation = one query or set of queries » Frame of reference = other observations & web-knowledge One moment: consistency of result sets » Observation = one query in set of queries » Frame of reference = other observations
17 Internet search engines: types of fluctuations Through time: comparing result sets of one observation repeatedly performed » Document fluctuations » Indexing fluctuations One moment: consistency of result sets » Element fluctuations
27 Percentage of documents missed due to fluctuations
28 Internet search engines: fluctuations - quantitative conclusions Many element fluctuations many document and indexing fluctuations and many document elements indexed Many document fluctuations not always many element fluctuations Few document elements indexed few element fluctuations
29 Fluctuations: remarks on correctness Fluctuations can be seen as correct, if they are reflections of alterations in: »(web-) reality then document, indexing and element fluctuations are incorrect »the indexed database of a search engine then only element fluctuations are incorrect Users do not care; they miss documents
30 Fluctuations: remarks on size No relation document / element fluctuations size Percentage missed documents determines (with other reducing effects, such as depth of indexing) the effective size of an engine
31 Internet search engines: conclusions of our research Search engines differ in depth of indexing. Search engines show fluctuations in their result sets: »They are subject to changes in indexing policy. (indexing fluctuations) »They forget documents completely (document fluctuations) »They miss documents in their result sets (element fluctuations).
32 Internet search engines: recommendations related to fluctuations Fluctuations are normal; do not be surprised; do not worry. Do not try to find a simple explanation to fully understand what happens. Known item searchers should repeat the search »when using an engine with many element fluctuations; use other search terms; »when using an engine with many document fluctuations: repeat later. Further research on effective size.
34 Mutual influences (part 1 of 2) Measuring Indexing fluctuations is not influenced by document fluctuations, because of the high number of test documents involved (16) and not influenced by element fluctuations because of the condition we made, that the query did not retrieve any of the test documents during 4 successive rounds. Measuring Document fluctuations cannot be influenced by element or indexing fluctuations because of the high number of test queries. Measuring Element fluctuations cannot be influenced by indexing fluctuations or document fluctuations, because all queries are submitted at the same time.
35 Mutual influences (part 2 of 2) Ad Element fluctuations influenced by document fluctuations: In our test, however, the queries were submitted one after the other. Theoretically a document fluctuation can cause element fluctuations: when it occurs in the 15 hours of one round of 32 queries. It is possible that this happens. But... 1. There is only a very little, negligible chance that the results over 43 rounds contain significant errors in the numbers of occurrences and the sizes of both fluctuations. 2. Moreover, analyzing the data shows that this could have happened only with measuring AltaVista, HotBot, MSN, Snap and Vindex. »In the case of AltaVista there is one fluctuation that could be an element or a document fluctuation. (Round 23) This fluctuation is in our results defined as a series of element fluctuations. When we define it as one document fluctuation the results change a little bit: Element fluctuations: number: 7% -> 5%; size: 5% -> 2% Document fluctuations: number 7% -> 9%; size: 5% -> 7%- »In the case of HotBot, MSN, Snap and Vindex many rounds are seen showing the same fluctuations at the same places. This indicates that structural indexing mistakes are made and that the fluctuations observed, are element fluctuations.