1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.

Slides:



Advertisements
Similar presentations
OCLC Digital Archive Overview Judith Cobb LIPA Meeting July 2006.
Advertisements

1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Collaborative Technical Services Team Report GUGM May 15, 2014 Cathy Jeffrey.
1 Building a “Virtual Library Collection” through freely-accessible web sites: ‘Select Web Sites database’ at University of Vermont Wichada SuKantarat.
Moving libraries to Web scale Matt Goldner Product & Technology Advocate 14 June 2011.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
1 Technology-Based Instruction: Planning at the Production Level by Jeremy Rowe ECURE 2001.
1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.
1 Planning And Electronic Records Issues For Electronically Enhanced Courses Jeremy Rowe Nancy Tribbensee
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Access to Digital Materials through the Library of Congress OPAC Presentation by Dr. Barbara B. Tillett Chief, Cataloging Policy and Support Office Library.
1 Planning And Electronic Records Issues For Electronically Enhanced Courses Jeremy Rowe Nancy Tribbensee
OLC Spring Chapter Conferences Metadata, Schmetadata … Tell Me Why I Should Care? OLC Spring Chapter Conferences, 2004 Margaret.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
The FDLP Web Archive Dory Bower Archive-It Partner Meeting November 18, 2014.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Session 7 Selection of Online Resources and Options for Providing Access.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
Copy cataloguing in Finland Juha Hakala The National Library of Finland
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
Office of Strategic Initiatives All Hands Meeting-March 2010 Challenges in Web Archiving: Library of Congress Edition Abbie Grotke, Web Archiving Team.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Web Metadata, what is it? Ora Lassila Visiting Scientist (from Nokia) Definition Applications Current Standardization Efforts.
A rticle L icensing I nformation A vailability S ervice IDS Project Information Delivery Services Mark Sullivan Library Systems Administrator SUNY Geneseo.
1 CS/INFO 430 Information Retrieval Lecture 16 Metadata 3.
International Seminary on Digitisation: Experience and Technology 11 th May 2004 | National Library | Lisbon – Portugal DIGITAL ARCHIVE OF PORTUGUESE ART.
Metadata and Documentation Iain Wallace Performing Arts Data Service.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Information for Scotland 816 Nov 2001 The potential of CORC Gordon Dunsire presented at Information for Scotland 8 16 November 2001, Edinburgh.
Best Practices for Digital Imaging and Metadata Roy Tennant The Library, University of California, Berkeley
ALA Institutional Repository Update ALA Archives at the University of Illinois Urbana-Champaign Chris Prom Cara Bertram Denise Rayman.
MARCIt records for e-journals project to implement MARCIt service McGill University Library Feb
VITAL at the National Library of Wales Glen Robson
1 The NSDL Program Stephen Griffin National Science Foundation.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
The Catalog of the Future: Integrating Electronic Resources By Dana M. Caudle Cataloging Librarian Auburn University Libraries
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digital Repositories Build It & They Will Come Michael J. Bennett Access Services Supervisor C/WMARS,
DESIGN AND IMPLEMENTATION OF LIBRARY AUTOMATION USING KOHA (Open Source Software) AT BHARATHIDASAN UNIVERSITY COLLEGE, PERAMBALUR R.VENUS MLISc- Final.
Archiving & Preserving Digital Content
7th Annual Hong Kong Innovative Users Group Meeting
Building A Repository for Digital Objects
Joseph JaJa, Mike Smorul, and Sangchul Song
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
CS 501: Software Engineering Fall 1999
Mining Digital Archives through OAI, Web Services & Google Indexing
Cataloging the Internet
How to Design and Implement Research Outputs Repositories
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Márton Németh – László Drótos How to catalogue a web archive?
Metadata supported full-text search in a web archive
Presentation transcript:

1 Minerva The Web Preservation Project

2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum Barbara Tillett Cornell University William Arms Internet Archive Brewster Kahle Scott Kirkpatrick Main Reading Room

3 1. Open Access Materials on the Web

4

5 Partnership with publishers Publishers and libraries as partners Selective collection of open access web Librarianship in a new domain Bulk collection of open access web Automated processes Approaches to Collecting and Preservation of the Web OPEN ACCESS CLOSED ACCESS

6 Web Preservation Project Pilot Small number of web sites nominated by selection officers. Three chosen for close study Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc. Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system. Trial web site developed to evaluate user access. Discussions with Copyright Office on legal issues.

7 Example: The Internet Archive

8 Example: National Library of Australia

9 Example: National Library of Sweden

10 2. Selection and Collection

11 Collecting: Making a Snapshot Web site Snapshot Download Archive A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.

12 Collecting: Periodic Snapshots Web site Archive At selected time intervals additional snapshots are made. Snapshot 1 Snapshot 2 Snapshot 3

13 Very Rough Estimates There are no good estimates of how many Web sites the Library of Congress would wish to collect and preserve. OCLC's Web Characterization Project (February 2000) Public web sites: 2,900,000 Annual increase: 700,000 If the Library of Congress collects 1% Total number of sites:30,000 Annual number new and changed:15,000 But these numbers are very rough estimates (guesses)!

14 Selection Decisions Which sites to collect? Bulk -- collect all within a certain category Selective -- collect sites selected by a librarian How often to make snapshots? Monthly, weekly, or depending on circumstances Which content to collect? HTML pages only Text and images only Everything

15 Examples of Selection Decisions SelectionFrequencyContent Internet ArchivebulkmonthlyHTML + images Pandoraselectivevariesall Kulturarw 3 bulksweepsall Minervaselectiveirregularall

16 Selection Decisions: Recommendations The Library needs a mixed strategy: 1. Selective selection, for known important sites 2. Bulk selection for selected categories (e.g.,.gov sites) 3. Bulk collection without selection for other materials

17 3. Use of the Collections for Scholarship and Research

18 Analysis by Computer Archive Analysis by computer Computer programs can be used to analyze the snapshot files. Snapshot 1 Snapshot 2 Snapshot 3

19 Analysis by Patron Web site Snapshot 1 Archive Snapshot 2 Snapshot 3 Access 1 Access 2 Access 3 Analysis by patron People can study an access version of a site

20 Access Decisions Style of access Analysis of snapshot files by computer Analysis of access version by patron Editing No editing (use snapshot files) Minimal editing to make access version Fuller editing to maintain experience Automatic or by hand Policy Who has access to the collections?

21 Examples of Access Decisions StyleEditing Internet Archivecomputer no Pandoraresearcher yes Minervaresearcher yes

22 Recommendations about the Use of the Collections for Scholarship and Research The Library should support the use of the collection in a variety of ways. 1. Computer analysis of snapshot files 2. Automated editing to create access versions of all selected sites, without human checking. 3. Human editing of a few, very important sites.

23 4. Information Discovery

24 Options for Information Discovery Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required. Options List of sites (e.g., Internet Archive) Access by URL + date Automatic index (e.g., Web search engines) Catalog (e.g., MARC or Dublin Core) Catalog record for individual site or group of sites Access through Library catalog

25 Information Discovery: Web Preservation Project Procedure MARC catalog records created using OCLC's CORC system. Loaded into Library of Congress's ILS. Observations about procedure Cataloguing effort similar to other electronic files. Some similarities to serials. No significant workflow difficulties.

26 Cataloguing Observations Detailed information is continually changing. Difficulty in selecting title (HTML is often poor). Problems with identifiers (multiple, changing URLs). Collection level records suitable for special events. It is difficult to evaluate cataloguing strategy because of lack of knowledge of user needs.

27 Recommendations about Information Discovery 1. The Library should experiment with various approaches to indexing and cataloguing Web sites, including automated indexing, Dublin Core and MARC cataloguing. 2. The Library will probably not be able to afford individual catalog records for all Web sites that are collected.

28 5. Storage and Preservation

29 Archive Accession Control Web Crawler Process Catalog External Access Workflow snapshot Analysis by patron Analysis by computer Web site

30 Preservation Objective Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future. What is preserved? Preservation of bits Preservation of content Preservation of experience How is it used? Analysis by computer program Viewed by human researcher

31 Process of Preservation Version 1 Time 0 Time 1 Time 2 This process may be applied to either the snapshot or the access version. Version 2Version 3

32 Storage Decisions: Identification Identification of Web site URL, but Web sites may change their URL URN (e.g., Handle or PURL) Identification and provenance of versions Web site identifier Collection information (date, time, etc.) History of changes Recommendations 1. Assign URN (e.g., Handle) to each Web site. 2. Store provenance metadata with every file.

33 Preservation Recommendations 1. Keep the unedited snapshot files by repeated refreshing. 2. Use automated migration of individual files as the basic technique for keeping Web sites (more of less) functional at moderate cost. 3. Use manual editing for a small number of particularly important sites. In general, it is not possible to maintain the experience of using Web sites as technology changes, even with expensive editing.

34 6. General Recommendations

35 General Recommendations 1. Collection and preservation of Web materials should be seen as a single program. 2. The program needs a full-time team of librarians and technical staff. 3. Some aspects can be subcontracted to specialists (e.g., the Web crawler), but the leadership must come from the Library. 4. The Library should seek partnerships with other libraries and archives. 5. Most processes will be automatic, with skilled attention given to a small number of particularly important sites.

36 Demonstration of Pilot System