Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.

Similar presentations


Presentation on theme: "1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum."— Presentation transcript:

1 1 Minerva The Web Preservation Project

2 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum Barbara Tillett Cornell University William Arms Internet Archive Brewster Kahle Scott Kirkpatrick Main Reading Room

3 3 1. Open Access Materials on the Web

4 4

5 5 Partnership with publishers Publishers and libraries as partners Selective collection of open access web Librarianship in a new domain Bulk collection of open access web Automated processes Approaches to Collecting and Preservation of the Web OPEN ACCESS CLOSED ACCESS

6 6 Web Preservation Project Pilot Small number of web sites nominated by selection officers. Three chosen for close study. http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/ Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc. Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system. Trial web site developed to evaluate user access. Discussions with Copyright Office on legal issues.

7 7 Example: The Internet Archive

8 8 Example: National Library of Australia

9 9 Example: National Library of Sweden

10 10 2. Selection and Collection

11 11 Collecting: Making a Snapshot Web site Snapshot Download Archive A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.

12 12 Collecting: Periodic Snapshots Web site Archive At selected time intervals additional snapshots are made. Snapshot 1 Snapshot 2 Snapshot 3

13 13 Very Rough Estimates There are no good estimates of how many Web sites the Library of Congress would wish to collect and preserve. OCLC's Web Characterization Project (February 2000) Public web sites: 2,900,000 Annual increase: 700,000 If the Library of Congress collects 1% Total number of sites:30,000 Annual number new and changed:15,000 But these numbers are very rough estimates (guesses)!

14 14 Selection Decisions Which sites to collect? Bulk -- collect all within a certain category Selective -- collect sites selected by a librarian How often to make snapshots? Monthly, weekly, or depending on circumstances Which content to collect? HTML pages only Text and images only Everything

15 15 Examples of Selection Decisions SelectionFrequencyContent Internet ArchivebulkmonthlyHTML + images Pandoraselectivevariesall Kulturarw 3 bulksweepsall Minervaselectiveirregularall

16 16 Selection Decisions: Recommendations The Library needs a mixed strategy: 1. Selective selection, for known important sites 2. Bulk selection for selected categories (e.g.,.gov sites) 3. Bulk collection without selection for other materials

17 17 3. Use of the Collections for Scholarship and Research

18 18 Analysis by Computer Archive Analysis by computer Computer programs can be used to analyze the snapshot files. Snapshot 1 Snapshot 2 Snapshot 3

19 19 Analysis by Patron Web site Snapshot 1 Archive Snapshot 2 Snapshot 3 Access 1 Access 2 Access 3 Analysis by patron People can study an access version of a site

20 20 Access Decisions Style of access Analysis of snapshot files by computer Analysis of access version by patron Editing No editing (use snapshot files) Minimal editing to make access version Fuller editing to maintain experience Automatic or by hand Policy Who has access to the collections?

21 21 Examples of Access Decisions StyleEditing Internet Archivecomputer no Pandoraresearcher yes Minervaresearcher yes

22 22 Recommendations about the Use of the Collections for Scholarship and Research The Library should support the use of the collection in a variety of ways. 1. Computer analysis of snapshot files 2. Automated editing to create access versions of all selected sites, without human checking. 3. Human editing of a few, very important sites.

23 23 4. Information Discovery

24 24 Options for Information Discovery Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required. Options List of sites (e.g., Internet Archive) Access by URL + date Automatic index (e.g., Web search engines) Catalog (e.g., MARC or Dublin Core) Catalog record for individual site or group of sites Access through Library catalog

25 25 Information Discovery: Web Preservation Project Procedure MARC catalog records created using OCLC's CORC system. Loaded into Library of Congress's ILS. Observations about procedure Cataloguing effort similar to other electronic files. Some similarities to serials. No significant workflow difficulties.

26 26 Cataloguing Observations Detailed information is continually changing. Difficulty in selecting title (HTML is often poor). Problems with identifiers (multiple, changing URLs). Collection level records suitable for special events. It is difficult to evaluate cataloguing strategy because of lack of knowledge of user needs.

27 27 Recommendations about Information Discovery 1. The Library should experiment with various approaches to indexing and cataloguing Web sites, including automated indexing, Dublin Core and MARC cataloguing. 2. The Library will probably not be able to afford individual catalog records for all Web sites that are collected.

28 28 5. Storage and Preservation

29 29 Archive Accession Control Web Crawler Process Catalog External Access Workflow snapshot Analysis by patron Analysis by computer Web site

30 30 Preservation Objective Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future. What is preserved? Preservation of bits Preservation of content Preservation of experience How is it used? Analysis by computer program Viewed by human researcher

31 31 Process of Preservation Version 1 Time 0 Time 1 Time 2 This process may be applied to either the snapshot or the access version. Version 2Version 3

32 32 Storage Decisions: Identification Identification of Web site URL, but Web sites may change their URL URN (e.g., Handle or PURL) Identification and provenance of versions Web site identifier Collection information (date, time, etc.) History of changes Recommendations 1. Assign URN (e.g., Handle) to each Web site. 2. Store provenance metadata with every file.

33 33 Preservation Recommendations 1. Keep the unedited snapshot files by repeated refreshing. 2. Use automated migration of individual files as the basic technique for keeping Web sites (more of less) functional at moderate cost. 3. Use manual editing for a small number of particularly important sites. In general, it is not possible to maintain the experience of using Web sites as technology changes, even with expensive editing.

34 34 6. General Recommendations

35 35 General Recommendations 1. Collection and preservation of Web materials should be seen as a single program. 2. The program needs a full-time team of librarians and technical staff. 3. Some aspects can be subcontracted to specialists (e.g., the Web crawler), but the leadership must come from the Library. 4. The Library should seek partnerships with other libraries and archives. 5. Most processes will be automatic, with skilled attention given to a small number of particularly important sites.

36 36 Demonstration of Pilot System


Download ppt "1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum."

Similar presentations


Ads by Google