Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

Similar presentations


Presentation on theme: "Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy."— Presentation transcript:

1 Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy Seneca – California Digital Library

2 Session Outline Web content – why we’re building archives Web crawling - under the hood Tools available Web Archiving Service Demo Solutions for tough times - collaboration

3 Quick Tour

4 Web Content – Why Build Archives Subject AreaAuthorSample Dates Sample SizeHalf-Life Information ScienceGoh & Ng1997-20032,5165 years Computer ScienceSpinellis1995-19994,3754 years LawRumsey1997-20013,4064 years MedicineVeronin1998-19991843 years 2003: International Internet Preservation Consortium 2005: NDIIPP funds Web-at-Risk, Web Archives Workbench

5 Threats to Web Content Delivery model (many people access one copy) Site redesigns Normal maintenance Political change – change of administration – policy changes Format

6 Researcher’s Perspective Study the topic / event Study site change or web-based communication Create stable citations for publications Locate archived documents via catalog Treat archive as a data set

7 What Makes an Archive Collection development – site selection Capture – harvesting content Curation – description and QA Publication – end user access

8 Archive Types Topical Event Domain Document Personal

9 Under the Hood Heritrix crawler NutchWAX indexer Open Source Wayback viewer Open source tools from the Internet Archive

10 The Crawler 1.Where do I start? 2.Can I find that URL? 3.Is there a robots.txt? 4.What do I need to render that page? (CSS, graphics) 5.What links can I find? 6.Do those links fit the rules I was given? 7.Do I have a flash / PDF / javascript file? 8.Does that file have any links? 9.For every link that fits the rules, start over! 10.Keep going until I can’t find any more links or I hit my time limit.

11 What the Crawler Spits Out ARC / WARC files – All of the content lumped together in large files – Keeps the archive simple and manageable – Need special tools to search and display NutchWAX Open Source Wayback Massive amounts of content!

12 Why Should I Care? When you navigate a web archive, you’re interacting with a very different file structure These tools are constantly improving – Crawler gets better at capturing – Indexer gets better at ranking & scaling The Web is constantly changing – New technologies, new obstacles

13 Tools Available: Considerations Hosted vs. local Cost Public access Discovery / search options Capture configuration QA / Analysis Tools Metadata options Training & Support Ease of use Limits to: – Users – Archives – Sites – Storage Data Transfer Data Configuration Collaboration Rights management

14 Tools Available Hosted – Archive-It – Web Archiving Service – OCLC Web Harvester / CONTENTdm – Hanzo Web Local Installation – Web Curator Tool – CONTENTdm – NetArchive Suite

15 Archive-It Hosted by the Internet Archive User-friendly interface, documentation, training Capture target = entire collection Public access automatic Dublin core metadata at seed level Limits = storage, # collections, # seeds Search full text, not metadata Highlight: “Scope It”

16 http://webarchive.jira.com/wiki/display/ARIH/ Welcome

17 Web Curator Tool Developed by National Library of New Zealand with input from the British Library and other IIPC members User-friendly interface, strong user documentation for both technical staff and curators Rights management module Basic capture settings offered with access to all settings if needed Assumes a strong division of labor / specific order of events Capture target is flexible (sites or groups of sites) Dublin Core metadata Highlight: “Prune” tool

18 http://webcurator.sourceforge.net/

19 Web Archiving Service Hosted by the California Digital Library User-friendly interface, documentation, training Capture target = site (flexible capture settings) Public access (optional) Some rights management features Limits = storage Search full text, not metadata Highlight: “show me all the new PDF files”

20 http://was.cdlib.org Web-based demos User guides

21 Web Harvester / CONTENTdm Harvester hosted by OCLC Access either hosted or local Flexible metadata Search metadata, not full text (except PDF) Same public access interface as CONTENTdm

22 NetArchive Suite In use at Danish Royal Library 2004 OS release 2007 Tools developed for large scale and comprehensive domain capture High degree of control over crawlers High degree of in-house expertise required Documentation targets technical staff, not curators Highlight: QA tool that lets you click to grab missing images, files

23 Why have curatorial tools?

24 Web Archiving Service Demo

25 Rights Issues: Section 108 Study Group No advance permission needed to capture freely available web content “Freely available” = no login / fee Content owners can prevent capture via robots.txt and may request take down – Except government agencies Embargo period observed before archives are published

26 Large Scale Collaboration International Internet Preservation Consortium – Improving capture & display tools – Beginning registry of archives APIs to allow searches against different archives, no matter which archiving tool was used

27 End-of-Term Harvest Library of Congress, Internet Archive, California Digital Library, University of North Texas, GPO Nomination tool for managing 3000+ URLs for government agency sites Captures run at 4 institutions Content replicated by partner institutions Public access via Internet Archive

28 State of California Government Web Archive

29 Collaboration between State agencies/site owners and libraries Across libraries Librarians and faculty Individual researchers

30 Questions? tracy.seneca@ucop.edu http://was.cdlib.org


Download ppt "Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy."

Similar presentations


Ads by Google