Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

Slides:



Advertisements
Similar presentations
Support.ebsco.com Nursing Reference Center Tutorial.
Advertisements

Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Looking Ahead Archive-It Partner Meeting November 18, 2014.
1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.
Looking Ahead Archive-It Partner Meeting November 12, 2013.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.
University Archives University Archives & Archive-It WebCom
Introducing Symposia : “ The digital repository that thinks like a librarian”
Introduction to Implementing an Institutional Repository Delivered to Technical Services Staff Dr. John Archer Library University of Regina September 21,
Greenstone Digital Library Usage and Implementation By: Paul Raymond A. Afroilan Network Applications Team Preginet, ASTI-DOST.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 Archive-It Training University of Maryland July 12, 2007.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Trimble Connected Community
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital.
Annick Le Follic Bibliothèque nationale de France Tallinn,
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
The web has revolutionized our access to information. Documents and publications that were once difficult to fin are now readily available to anyone. Government.
CNI Fall Task Force, December 2007 International Internet Preservation Consortium Abbie Grotke IIPC Communications Officer Library of Congress & George.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Was.cdlib.org California Digital Library University of California Rosalie Lack
The Real At Risk E-Content: University Web Resources EDUCAUSE Joanne Kaczmarek University of Illinois at Urbana-Champaign Taylor Surface OCLC October 12,
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
Digitization An Introduction to Digitization Projects and to Using the Montana Memory Project.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
By Addison, Jessica, and Lauren. Management The Mountain West Digital Library is a program of the Utah Academic Library Consortium (UALC) Three Governing.
Planning for Life after OCLC Passport for Cataloging An overview of the new OCLC cataloging service Revised April 2002.
Web Archiving Service (WAS) Rosalie Lack Data Curation for Practitioners 2012 Workshop.
CyberCemetery Preserving At-Risk Government Web Content.
Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
The Web-at-Risk NDIIPP Sponsored Project Partners include: California Digital Library – project lead University of North Texas New York University California.
© Ex Libris Ltd. All Rights Reserved. From Library Systems to Information SystemsMetaLib Jenny Walker ICOLC 2001.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
1 Advanced Archive-It Application Training: Crawl Scoping.
May 15, 2009 Jacquie Samples. Objectives  Overview  How to customize  Searching techniques  Where to get help or learn more.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
The Web Archiving Service Spring 2009 Update User’s Council Annual Meeting Tracy Seneca California Digital Library Capture Today’s Web;
Visionary Technology in Library Solutions VITAL Access Portal.
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Here are some things you can do while you wait 1.Open your omeka.net site in your browser (e.g. 2.Open.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
CONTENTdm A proven solution September A complete digital collection management software solution Stores, manages and provides access for all digital.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Web Archiving Service (WAS) Rosalie Lack Data Curation for Practitioners 2012 Workshop.
Archiving & Preserving Digital Content
Workshop on Web Archiving
Joanne Archer University of Maryland Libraries
Introduction to Implementing an Institutional Repository
Latin American Government Documents Archive, LAGDA
Márton Németh – László Drótos How to catalogue a web archive?
Overview of Curriki Site and Features
Presentation transcript:

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy Seneca – California Digital Library

Session Outline Web content – why we’re building archives Web crawling - under the hood Tools available Web Archiving Service Demo Solutions for tough times - collaboration

Quick Tour

Web Content – Why Build Archives Subject AreaAuthorSample Dates Sample SizeHalf-Life Information ScienceGoh & Ng ,5165 years Computer ScienceSpinellis ,3754 years LawRumsey ,4064 years MedicineVeronin years 2003: International Internet Preservation Consortium 2005: NDIIPP funds Web-at-Risk, Web Archives Workbench

Threats to Web Content Delivery model (many people access one copy) Site redesigns Normal maintenance Political change – change of administration – policy changes Format

Researcher’s Perspective Study the topic / event Study site change or web-based communication Create stable citations for publications Locate archived documents via catalog Treat archive as a data set

What Makes an Archive Collection development – site selection Capture – harvesting content Curation – description and QA Publication – end user access

Archive Types Topical Event Domain Document Personal

Under the Hood Heritrix crawler NutchWAX indexer Open Source Wayback viewer Open source tools from the Internet Archive

The Crawler 1.Where do I start? 2.Can I find that URL? 3.Is there a robots.txt? 4.What do I need to render that page? (CSS, graphics) 5.What links can I find? 6.Do those links fit the rules I was given? 7.Do I have a flash / PDF / javascript file? 8.Does that file have any links? 9.For every link that fits the rules, start over! 10.Keep going until I can’t find any more links or I hit my time limit.

What the Crawler Spits Out ARC / WARC files – All of the content lumped together in large files – Keeps the archive simple and manageable – Need special tools to search and display NutchWAX Open Source Wayback Massive amounts of content!

Why Should I Care? When you navigate a web archive, you’re interacting with a very different file structure These tools are constantly improving – Crawler gets better at capturing – Indexer gets better at ranking & scaling The Web is constantly changing – New technologies, new obstacles

Tools Available: Considerations Hosted vs. local Cost Public access Discovery / search options Capture configuration QA / Analysis Tools Metadata options Training & Support Ease of use Limits to: – Users – Archives – Sites – Storage Data Transfer Data Configuration Collaboration Rights management

Tools Available Hosted – Archive-It – Web Archiving Service – OCLC Web Harvester / CONTENTdm – Hanzo Web Local Installation – Web Curator Tool – CONTENTdm – NetArchive Suite

Archive-It Hosted by the Internet Archive User-friendly interface, documentation, training Capture target = entire collection Public access automatic Dublin core metadata at seed level Limits = storage, # collections, # seeds Search full text, not metadata Highlight: “Scope It”

Welcome

Web Curator Tool Developed by National Library of New Zealand with input from the British Library and other IIPC members User-friendly interface, strong user documentation for both technical staff and curators Rights management module Basic capture settings offered with access to all settings if needed Assumes a strong division of labor / specific order of events Capture target is flexible (sites or groups of sites) Dublin Core metadata Highlight: “Prune” tool

Web Archiving Service Hosted by the California Digital Library User-friendly interface, documentation, training Capture target = site (flexible capture settings) Public access (optional) Some rights management features Limits = storage Search full text, not metadata Highlight: “show me all the new PDF files”

Web-based demos User guides

Web Harvester / CONTENTdm Harvester hosted by OCLC Access either hosted or local Flexible metadata Search metadata, not full text (except PDF) Same public access interface as CONTENTdm

NetArchive Suite In use at Danish Royal Library 2004 OS release 2007 Tools developed for large scale and comprehensive domain capture High degree of control over crawlers High degree of in-house expertise required Documentation targets technical staff, not curators Highlight: QA tool that lets you click to grab missing images, files

Why have curatorial tools?

Web Archiving Service Demo

Rights Issues: Section 108 Study Group No advance permission needed to capture freely available web content “Freely available” = no login / fee Content owners can prevent capture via robots.txt and may request take down – Except government agencies Embargo period observed before archives are published

Large Scale Collaboration International Internet Preservation Consortium – Improving capture & display tools – Beginning registry of archives APIs to allow searches against different archives, no matter which archiving tool was used

End-of-Term Harvest Library of Congress, Internet Archive, California Digital Library, University of North Texas, GPO Nomination tool for managing URLs for government agency sites Captures run at 4 institutions Content replicated by partner institutions Public access via Internet Archive

State of California Government Web Archive

Collaboration between State agencies/site owners and libraries Across libraries Librarians and faculty Individual researchers

Questions?