1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

Slides:



Advertisements
Similar presentations
Library August, 2002 Diane Kresh, Library of Congress International Federation of Libraries Association August, 2002 Diane Kresh, Library.
Advertisements

OCLC Digital Archive Overview Judith Cobb LIPA Meeting July 2006.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Moving libraries to Web scale Matt Goldner Product & Technology Advocate 14 June 2011.
Toulouse School of Graduate Studies Theses and Dissertations ETDs - Why We Do them –We at UNT believe that electronic theses and dissertations enhance.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
ISP 433/533 Week 8 IR in libraries. Goal Universal Access to Information Vannevar Bush 1945 article Memex A memex is a device in which an individual stores.
Access to Digital Materials through the Library of Congress OPAC Presentation by Dr. Barbara B. Tillett Chief, Cataloging Policy and Support Office Library.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
The FDLP Web Archive Dory Bower Archive-It Partner Meeting November 18, 2014.
Digitization of Historical Materials Dana Logalbo-Baij LIBR559L June 9, 2011.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.
1 FAQ on video editing. 2 1.Is it possible if I look for some video clips (e.g. firework, speech of Obama) from other sources?  Yes, but you need to.
Digital Library Architecture and Technology
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Kulturarw³ The Swedish WWW Archive Eller, att fånga den V ärlds V ida V även
Annick Le Follic Bibliothèque nationale de France Tallinn,
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
The ECHO DEPository Project A project of the University of Illinois at Urbana-Champaign and OCLC in partnership with the Library of Congress ALA Annual.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
The Real At Risk E-Content: University Web Resources EDUCAUSE Joanne Kaczmarek University of Illinois at Urbana-Champaign Taylor Surface OCLC October 12,
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
A rticle L icensing I nformation A vailability S ervice IDS Project Information Delivery Services Mark Sullivan Library Systems Administrator SUNY Geneseo.
OpenWeb: Expanding access to Digital Collections Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Digital library projects in the Nordic national libraries Juha Hakala Helsinki University Library – The National Library of Finland.
1 Archiving Michael J. Levin Harvard Center for Population and Development Studies
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
Developing Policy and Procedure Management System إعداد برنامج سياسات وإجراءات العمل 8 Safar February 2007 HERA GENERAL HOSPITAL.
Planning for Life after OCLC Passport for Cataloging An overview of the new OCLC cataloging service Revised April 2002.
CyberCemetery Preserving At-Risk Government Web Content.
1 The NSDL Program Stephen Griffin National Science Foundation.
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Current Information To help you find current news and information, many search engines and directories include a hyperlink to a "What's new" page. Many.
The Catalog of the Future: Integrating Electronic Resources By Dana M. Caudle Cataloging Librarian Auburn University Libraries
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
Electronic Theses and Dissertations: The bepress Approach Ben Hermalin Interim Dean, Haas School of Business, UC Berkeley & Co-Founder, bepress.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Digitalcommons.unl.edu Archiving Department Records.
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digital Repositories Build It & They Will Come Michael J. Bennett Access Services Supervisor C/WMARS,
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Archiving & Preserving Digital Content
7th Annual Hong Kong Innovative Users Group Meeting
Chapter 8 Browsing and Searching the Web
WorldCat Public Interest Group
CS 501: Software Engineering Fall 1999
DIGITAL LIBRARY.
Márton Németh – László Drótos How to catalogue a web archive?
Presentation transcript:

1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee

2 Open Access Materials on the Web

3 The Library of Congress: the Web Preservation Project Library of Congress collects cultural and intellectual output of today for the benefit of future generations. An ever-increasing amount of this material is born digital. The library has: privileged legal position generous public funding... but cannot do everything! Step 1: Open Access Materials on the Web

4

5

6

7

8 Partnership with publishers Publishers and libraries as partners Selective collection of open access web Librarianship in a new domain Bulk collection of open access web Automated processes Approaches to Preservation of the Web OPEN ACCESS CLOSED ACCESS

9 Example: Web Preservation Project Pilot Small number of web sites nominated by selection officers. Three chosen for close study Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc. Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system. Trial web site developed to evaluate user interfaces.

10 Example: The Internet Archive

11 Example: National Library of Australia

12 Example: National Library of Sweden

13 Selection and Collection

14 Collecting: Making a Snapshot Web site Snapshot Download Archive A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.

15 Collecting: Periodic Snapshots Web site Snapshot 1 Archive At scheduled time intervals additional snapshots are made. Snapshot 2 Snapshot 3

16 Selection Decisions Which sites to collect Bulk -- collect all within a certain category Selective -- collect sites selected by a librarian How often to make snapshots Monthly, weekly, or depending on circumstances Which content to collect HTML pages only Text and images only Everything

17 Examples of Selection Decisions SelectionFrequencyContent Internet ArchivebulkmonthlyHTML + images Pandoraselectivevariesall Kulturarw 3 bulksweepsall Web Preservationselectiveirregularall

18 Legal Issues Legal position of archives that download open access materials is unclear Preservation is in the national interest See the discussion in The Digital Dilemma Crucial factor is economic impact on copyright owners Library of Congress has no special position except via copyright deposit

19 Legal Issues: Thoughts and Actions Presumption is that downloading open access materials is permitted by the publisher unless other indication given, e.g., robot exclusion using robots.txt file Different parties to consider => Library of Congress => other national libraries => partners of the Library of Congress and national libraries => independent archives U.S. Copyright Office has offered to help clarification

20 Access to Collections

21 Access: Analysis by Computer Snapshot 1 Archive Snapshot 2 Snapshot 3 Analysis by computer

22 Access: Analysis by Patron Web site Snapshot 1 Archive Snapshot 2 Snapshot 3 Access 1 Access 2 Access 3 Analysis by patron Analysis by computer

23 Access Decisions Style of access Analysis of snapshot files by computer Analysis of Web access version by patron Editing Minimal editing to make access version Fuller editing to maintain experience Automatic or by hand Policy Who has access to the collections?

24 Examples of Access Decisions StyleEditing Internet Archivecomputernone Pandoraresearchersome Kulturarw 3 ?? Web Preservationresearchersome

25 Information Discovery

26 Options for Information Discovery Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required. Options List of sites (e.g., Internet Archive) => Access by URL + date Automatic index (e.g., Web search engines) Catalog (e.g., Web Preservation Project) => Record for individual site or group of sites => Access through library catalog

27 Information Discovery: Web Preservation Project Procedure MARC catalog records created using OCLC's CORC system. Loaded into Library of Congress's ILS. Observations Catalog effort similar to other electronic files Continual changes between snapshots Some similarities to serials No significant workflow difficulties

28 Storage

29 Storage: Preservation Versions Snapshot 1Access 1 Snapshot 1Access 1 Snapshot 1Access 1 Over time, other versions of a snapshot will be made for preservation.

30 Storage Decisions: Size Each Web site will be stored many times Repeated snapshots Access versions Preservation versions Saving space Many files are repeated (e.g., video clips) Storing a single copy saves space, but leads to more complex computer systems Compressing files save space, but leads to more complex computer systems

31 Very Rough Estimates of Size and Cost Public web sites (OCLC, February 2000) 2,900,000 Library of Congress collects 1%30,000 Average size of site 60 Mbytes Size of 30,000 sites 1.8 terabytes Storage requirements/year (monthly snapshot) 21.6 terabytes Storage requirements (no duplicates) 5.0 terabytes Cost per year ($25,000 per terabyte) $125,000

32 Storage Decisions: Identification Identification of Web site URL, but Web sites may change their URL URN (e.g., Handle or PURL) Identification and provenance of versions Web site identifier Collection information (date, time, etc.) History of changes

33 Archive Accession Control Web Crawler Process Catalog External Access Workflow snapshot Analysis by patron Analysis by computer Web site

34 Preservation

35 Objective Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future. What is preserved? Preservation of bits Preservation of content Preservation of experience How is it used? Analysis by computer program Analysis by human researcher Viewed by human researcher

36 Process of Preservation Version 1 Version 2 Version 3 Time 0 Time 1 Time 2 This process may be applied to either the snapshot or the access version.

37 Preservation: Refreshing Each version is created from the previous by exactly copying the bits. Keeps the exact files for all time Preserves bits, and content but not always in an accessible form Later computers and software are unlikely to support today's protocols, formats, languages, etc. Keeping the unedited snapshot files by repeated refreshing should be a basic part of any preservation strategy.

38 Preservation: Automatic Migration of Individual Files As protocols, formats, languages, etc. become obsolete, convert individual files to new standards. Can be carried out automatically Preserves content and helps toward preservation of experience Effectiveness depends on availability of conversion tools and the complexity and quality of original source Migrated versions will steadily diverge from original Web sites will eventually cease to function Automated migration of individual files is the basic technique for keeping web sites functional at moderate cost.

39 Preservation: Automatic Migration with Manual Editing In conjunction with automatic migration, web sites are reviewed by a librarian and edited as necessary to preserve functionality The only method that can be expected to preserve the experience of using web sites Migrated versions will steadily diverge from original Some web sites will be impossible to edit without changing the experience Manual editing is very expensive and is therefore suitable for only a small number of particularly important sites.

40 Acknowledgements The members of the Web Preservation Project are: Roger Adkin Cassy Ammen William Arms Allene Hayes Melissa Levine Diane Kresh Barbara Tillett