Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy Seneca – California Digital Library
Session Outline Web content – why we’re building archives Web crawling - under the hood Tools available Web Archiving Service Demo Solutions for tough times - collaboration
Quick Tour
Web Content – Why Build Archives Subject AreaAuthorSample Dates Sample SizeHalf-Life Information ScienceGoh & Ng ,5165 years Computer ScienceSpinellis ,3754 years LawRumsey ,4064 years MedicineVeronin years 2003: International Internet Preservation Consortium 2005: NDIIPP funds Web-at-Risk, Web Archives Workbench
Threats to Web Content Delivery model (many people access one copy) Site redesigns Normal maintenance Political change – change of administration – policy changes Format
Researcher’s Perspective Study the topic / event Study site change or web-based communication Create stable citations for publications Locate archived documents via catalog Treat archive as a data set
What Makes an Archive Collection development – site selection Capture – harvesting content Curation – description and QA Publication – end user access
Archive Types Topical Event Domain Document Personal
Under the Hood Heritrix crawler NutchWAX indexer Open Source Wayback viewer Open source tools from the Internet Archive
The Crawler 1.Where do I start? 2.Can I find that URL? 3.Is there a robots.txt? 4.What do I need to render that page? (CSS, graphics) 5.What links can I find? 6.Do those links fit the rules I was given? 7.Do I have a flash / PDF / javascript file? 8.Does that file have any links? 9.For every link that fits the rules, start over! 10.Keep going until I can’t find any more links or I hit my time limit.
What the Crawler Spits Out ARC / WARC files – All of the content lumped together in large files – Keeps the archive simple and manageable – Need special tools to search and display NutchWAX Open Source Wayback Massive amounts of content!
Why Should I Care? When you navigate a web archive, you’re interacting with a very different file structure These tools are constantly improving – Crawler gets better at capturing – Indexer gets better at ranking & scaling The Web is constantly changing – New technologies, new obstacles
Tools Available: Considerations Hosted vs. local Cost Public access Discovery / search options Capture configuration QA / Analysis Tools Metadata options Training & Support Ease of use Limits to: – Users – Archives – Sites – Storage Data Transfer Data Configuration Collaboration Rights management
Tools Available Hosted – Archive-It – Web Archiving Service – OCLC Web Harvester / CONTENTdm – Hanzo Web Local Installation – Web Curator Tool – CONTENTdm – NetArchive Suite
Archive-It Hosted by the Internet Archive User-friendly interface, documentation, training Capture target = entire collection Public access automatic Dublin core metadata at seed level Limits = storage, # collections, # seeds Search full text, not metadata Highlight: “Scope It”
Welcome
Web Curator Tool Developed by National Library of New Zealand with input from the British Library and other IIPC members User-friendly interface, strong user documentation for both technical staff and curators Rights management module Basic capture settings offered with access to all settings if needed Assumes a strong division of labor / specific order of events Capture target is flexible (sites or groups of sites) Dublin Core metadata Highlight: “Prune” tool
Web Archiving Service Hosted by the California Digital Library User-friendly interface, documentation, training Capture target = site (flexible capture settings) Public access (optional) Some rights management features Limits = storage Search full text, not metadata Highlight: “show me all the new PDF files”
Web-based demos User guides
Web Harvester / CONTENTdm Harvester hosted by OCLC Access either hosted or local Flexible metadata Search metadata, not full text (except PDF) Same public access interface as CONTENTdm
NetArchive Suite In use at Danish Royal Library 2004 OS release 2007 Tools developed for large scale and comprehensive domain capture High degree of control over crawlers High degree of in-house expertise required Documentation targets technical staff, not curators Highlight: QA tool that lets you click to grab missing images, files
Why have curatorial tools?
Web Archiving Service Demo
Rights Issues: Section 108 Study Group No advance permission needed to capture freely available web content “Freely available” = no login / fee Content owners can prevent capture via robots.txt and may request take down – Except government agencies Embargo period observed before archives are published
Large Scale Collaboration International Internet Preservation Consortium – Improving capture & display tools – Beginning registry of archives APIs to allow searches against different archives, no matter which archiving tool was used
End-of-Term Harvest Library of Congress, Internet Archive, California Digital Library, University of North Texas, GPO Nomination tool for managing URLs for government agency sites Captures run at 4 institutions Content replicated by partner institutions Public access via Internet Archive
State of California Government Web Archive
Collaboration between State agencies/site owners and libraries Across libraries Librarians and faculty Individual researchers
Questions?