1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

Slides:



Advertisements
Similar presentations
Digital Initiatives at the University of North Texas Libraries Cathy Nelson Hartman University of North Texas Libraries Texas Conference on Digital Libraries.
Advertisements

Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Latin American and Human Rights Web Archiving as part of Research Library Special Collections Kent Norsworthy LLILAS Benson Digital Curation Coordinator,
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
Reference 2.0: Using New Web Technologies to Enhance Public Service Texas Library Association Conference April 17, 2008 Stephen F. Austin State University’s.
University Archives University Archives & Archive-It WebCom
Merrilee Proffitt e(X)literature / Digital Cultures Project April 2003 News from the Digital Library The Metadata Encoding and Transmission Standard; the.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
1 Archive-It Training University of Maryland July 12, 2007.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Web Archiving Life Cycle Model Archive-It Partner Meeting December 3, 2012 Molly Bragg
Annick Le Follic Bibliothèque nationale de France Tallinn,
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
WebArchiv Czech Web Archive IIPC 2007, Paris.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
A Partnership Born of Urgency and Civic Responsibility Preserving Access to Government Websites Through the CyberCemetery Starr Hoffman Librarian for Digital.
The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive.
Fall 2002 DLF Forum RLG Cultural Materials DLF Forum Ricky Erway Digital Resources Manager, RLG.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Persistent Digital Archives and Library System (PeDALS) SC Department of Archives and History.
Annick Le Follic Bibliothèque nationale de France Tallinn,
The Western Waters Digital Library: Building a Resource Through Multi- State Collaboration and Technology Dawn Paschal Assistant Dean, Digital Library.
CNI Fall Task Force, December 2007 International Internet Preservation Consortium Abbie Grotke IIPC Communications Officer Library of Congress & George.
The ECHO DEPository Project A project of the University of Illinois at Urbana-Champaign and OCLC in partnership with the Library of Congress ALA Annual.
Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:
Caught in the Web: Web Archiving at U of A Libraries Geoff Harder and Kenton Good Digital Preservation Seminar | March 5, 2010 | University of Alberta.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Was.cdlib.org California Digital Library University of California Rosalie Lack
Can we be doing more? Beth Tillinghast University of Hawaii at Manoa October 19, 2011 Archive-It Partner Meeting ACCESS TO OUR ARCHIVED WEBSITE COLLECTIONS.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
UKOLN is supported by: Iniciativas de preservación de la Web: una visión actual Michael Day Digital Curation Centre, UKOLN, University of Bath, UK
CyberCemetery Preserving At-Risk Government Web Content.
ALA Institutional Repository Update ALA Archives at the University of Illinois Urbana-Champaign Chris Prom Cara Bertram Denise Rayman.
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
The Boston TV News Digital Library: Partners WGBH Media Library and Archives (WGBH) Northeast Historic Film (NHF) Boston Public Library (BPL)
Memory Masters Preserving Digitized Histories— for today, for tomorrow, and for the future This project is made possible by a grant from the federal Institute.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008.
Archiving & Preserving Digital Content
HathiTrust Digital Library Interface and Services
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Internet Archive & OPENLIBRARY.ORG
Latin American Government Documents Archive, LAGDA
Wisconsin County and Municipal Government Collections in Archive-It
Preserving Our Collective Digital History
Brewster Kahle Director Internet Archive
Presentation transcript:

1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006

2 Agenda RLG Internet Archive Archive-It Challenges The Future Q&A

3 The importance of archiving the web The web contains much of what will be the basis of scholarship in the future –record of events –official publications –personal viewpoints –ephemeral material

4 RLG’s interest RLG mission includes working with its member organizations to enhance their ability to provide research resources RLG members have long been participating in web archiving, but so far, this has been an activity restricted to large organizations

5 Members active in web archiving Bibliothèque Nationale de France British National Library California Digital Library Library of Congress National Library of Australia National Library of New Zealand

6 Archive-It pilot partners Indiana University International Institute of Social History University of Toronto Swarthmore/Haverford College

7 About Internet Archive Founded in 1996 Largest public web archive 60 billion pages, 55 million sites Have expanded to include texts, audio, moving images, and software: 2.6 million downloads a day

8 What do we collect? Web Archive Take a broad snapshot of the web every 2 months 2 billion pages a month Websites from every domain (.org,.com,.edu etc) Content in 21 languages

9 Policy We follow Oakland Archive Policy, 2002 Founded by commercial and non commercial organizations Opt-out policy We collect it all, and make it inaccessible if requested by site owner Site owner directly blocks harvester on website

10 Access to Web Archive Entire archive accessible for free to the public via the website at Receive100 hits/second 60k unique users per day Evolving/Fluid: through public use we hope to find out what is important and to continuously improve

11 Why try to collect and preserve it all? Web has no boundaries, no limits What will be important? What is there today may be gone tomorrow –“Capture now, ask why later” –“Grab it while you can, work it out later” –“Lose as little as possible”

12 Open Source Technology primarily developed by Internet Archive and IIPC Heritrix: web crawler Wayback Machine: access tool for rendering and viewing files Nutchwax: Search engine Arc File: archival record format (ISO work item) How do we collect it?

13 Wayback Machine

14 Preservation Store multiple copies of each Archive 1300 machines/servers Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam) Standard storage boxes, open source design

15 Next Steps Institutions: need to create collections around primary source web material want to do more than broad crawling with specific and complete web archives want a technology partner that could harvest, index, access, store and preserve their collections for them.

16 In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia and France. –Library of Congress collections: Iraq War: 450,000,000 documents and growing U.S. National Elections –2000:131,331,973 documents – 2004: 87,481,265 documents Supreme Court Nomination 2005: 100 Million documents 1. Partner Contract Crawls

17 Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions to expand our archiving services and develop an application that acknowledge resource constraints Developed Archive-It, web based service that allows partners to create, manage, search and store their web archives through an easy to use web interface Does not require technical expertise or infrastructure Pilot launched in September Release in February 1.5 Release in April 2.0 Release in July 2. Archive-It

18 Pilot Partners Center for Research Libraries Research Libraries Group ( U of Toronto, U of Indiana, Haverford and Swarthmore Colleges, IISH) University of Texas Library of Virginia State Archives South Dakota State Archives North Carolina State Archives Alabama Minnesota Historical Society Institut d'Etude Politique de Grenoble

Release in February 1.5 Release in April 2.0 Release in July Archive-It

20 Archive-It Collections Some samples: –Virginia’s political landscape, 2005 (Gov. Mark Warner) –Hurricane Katrina –Jamestown 2007 Commemoration

21 Archive-It Access All collections are accessible for free to the general public, with text search, at: – org –Partners websites with links Plus, member web application with login

22 Demo

23 Dan’s slides Tech

24 Challenges we face Making the collections useful for a variety of end users (i.e. general public, researchers) Making sure we capture the best and most relevant content Continuing to develop our tools for access and harvesting (crawler.archive.org)crawler.archive.org

25 Internet Archive’s priorities Collaboration and Partnerships –Continue to act as a technology partner in providing web archiving services to government and memory institutions –Continue to develop Open Source software –Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium) –Open Content Alliance (OCA) digital books project Multiple copies across the world –Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria

26 RLG’s web archiving program Collaborative collection development. Descriptive metadata for web archives. Usability/user studies Intellectual property concerns Web Archiving 101 Web archiving services and software