Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University.

Similar presentations


Presentation on theme: "Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University."— Presentation transcript:

1 Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University

2 Web documents They are not like printed/written material They are not like printed/written material If preserved, they last “forever”, e.g. the Code of Hammurabi If preserved, they last “forever”, e.g. the Code of Hammurabi They are not like unrecorded phone calls that disappear in the air They are not like unrecorded phone calls that disappear in the air

3 Web documents Can exist only for a limited amount of time Can exist only for a limited amount of time Can be removed, or moved to a different location Can be removed, or moved to a different location Can undergo changes Can undergo changes CNN’s main page is updated approx. every 15 minutes CNN’s main page is updated approx. every 15 minutes The program page for this conference The program page for this conference Can be temporarily inaccessible Can be temporarily inaccessible Communication/server problems Communication/server problems The Web is dynamic The Web is dynamic

4 The Web On the one hand grows continuously On the one hand grows continuously On the other hand it changes constantly, thus not only new documents are added to it, but On the other hand it changes constantly, thus not only new documents are added to it, but Exiting documents are removed Exiting documents are removed Existing documents undergo changes Existing documents undergo changes content content format format linkage linkage

5 Question: How do documents on the Web evolve? News pages change very frequently News pages change very frequently How about more “academic” topics? How about more “academic” topics? As a case study analyzed the changes occurring to a set of pages containing the search terms informetric or informetrics over a period of five years As a case study analyzed the changes occurring to a set of pages containing the search terms informetric or informetrics over a period of five years Almost no other such long-range studies Almost no other such long-range studies Koehler (JASIST, 2002): a “random”, fixed set of Web pages monitored weekly for a period of four years Koehler (JASIST, 2002): a “random”, fixed set of Web pages monitored weekly for a period of four years

6 Data collection First data collection point (June 1998) First data collection point (June 1998) Data discovery through submission of query to the then existing largest search engines Data discovery through submission of query to the then existing largest search engines AltaVista, Excite, HotBot, InfoSeek, Lycos and NorthernLight - exhaustiveness AltaVista, Excite, HotBot, InfoSeek, Lycos and NorthernLight - exhaustiveness All results retrieved, collated list of URLs created (941 URLs) All results retrieved, collated list of URLs created (941 URLs) URLs downloaded – asap after searching URLs downloaded – asap after searching Content of URLs checked for presence of search terms (866 URLs, 91.9%) Content of URLs checked for presence of search terms (866 URLs, 91.9%)

7 Data collection (cont.) Consecutive data collection points: June 1999, 2002 and 2003 Consecutive data collection points: June 1999, 2002 and 2003 Search engines were queried as before Search engines were queried as before Set of search engines in 2002 & 2003: AltaVista, Fast, Google, HotBot, Teoma and Wisenut Set of search engines in 2002 & 2003: AltaVista, Fast, Google, HotBot, Teoma and Wisenut List of URLs, pages downloaded List of URLs, pages downloaded Previously identified URLs that currently were not retrieved by the search engines were revisited and their contents downloaded Previously identified URLs that currently were not retrieved by the search engines were revisited and their contents downloaded This method allowed us to monitor previously discovered URLs, while adding new (or newly discovered) URLs to the set This method allowed us to monitor previously discovered URLs, while adding new (or newly discovered) URLs to the set

8 The observed growth rate during the study period

9 Not only growth … Until and including 2002, 5034 URLs were discovered Until and including 2002, 5034 URLs were discovered In June 2003 only 2850 were still available and satisfied the query In June 2003 only 2850 were still available and satisfied the query Thus 37.5% of the URLs (1890 URLs) Thus 37.5% of the URLs (1890 URLs) disappeared or disappeared or ceased to satisfy the query (topic shift) ceased to satisfy the query (topic shift)

10 … also modifications Out of the URLs satisfying the query at two consecutive data points, about 50% have undergone some kind of modification Out of the URLs satisfying the query at two consecutive data points, about 50% have undergone some kind of modification Text of the source files compared Text of the source files compared Stable dataset compared to random sets Stable dataset compared to random sets e.g. in Koehler’s random set of 361 URLs, only for 3% no changes were observed e.g. in Koehler’s random set of 361 URLs, only for 3% no changes were observed Unstable set compared to digital libraries Unstable set compared to digital libraries e.g. PubMedCentral, arXiv, CiteSeer – only 3% of the sample disappeared during the one year period of observation e.g. PubMedCentral, arXiv, CiteSeer – only 3% of the sample disappeared during the one year period of observation

11 What is the value of such elusive information??? Dellavalle et al., Science 2003: Going, going, gone: Lost Internet References Dellavalle et al., Science 2003: Going, going, gone: Lost Internet References Bar-Ilan & Peritz, JASIST (to appear): Evolution, Continuity and Disappearance of Documents on a Specific Topic on the Web - A Longitudinal Study of “Informetrics” Bar-Ilan & Peritz, JASIST (to appear): Evolution, Continuity and Disappearance of Documents on a Specific Topic on the Web - A Longitudinal Study of “Informetrics” http://shum.huji.ac.il/~judit/evolution/barilan_peritz_JA SIST_notice.pdf http://shum.huji.ac.il/~judit/evolution/barilan_peritz_JA SIST_notice.pdf Internet Archive – saves “snapshots” of the Web at different points in time. Wayback Machine Internet Archive – saves “snapshots” of the Web at different points in time. Wayback Machine http://archive.org http://archive.org http://archive.org

12

13 Aug 26, 2000

14 March 31, 2001

15 May 28, 2002

16 April 25, 2003


Download ppt "Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University."

Similar presentations


Ads by Google