Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

Slides:



Advertisements
Similar presentations
WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
Advertisements

Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
August 2005IFLA - CDNL1 The International Internet Preservation Consortium (IIPC)
A survey of Web preservation initiatives Michael Day UKOLN, University of Bath 7 th European Conference on Research and Advanced Technology.
Moving Forward With Digital Preservation at the Library of Congress Laura Campbell Associate Librarian for Strategic Initiatives Library of Congress.
Libraries for Future Generations Martha Anderson Director National Digital Information Infrastructure and Preservation Program The Library of Congress.
Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
The National Digital Stewardship Alliance: Community, Content, Commitment.
The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.
National Digital Information Infrastructure and Preservation Program (NDIIPP) Data-PASS/NDIIPP: A new effort to harvest our history A funder view May 25,
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Enterprise Content Management Pre-Proposal Conference for RFP No. ISD2006ECM-SS December 6, 2006 California Administrative Office of the Courts Information.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.
APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.
The FDLP Web Archive Dory Bower Archive-It Partner Meeting November 18, 2014.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Digitization at the National Archives and Records Administration Doris Hamburg Director, Preservation Programs James Hastings Director, Access Programs.
1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
1 Archive-It Training University of Maryland July 12, 2007.
Promoting Digital Preservation Partnerships at the U.S. Library of Congress April 2004.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Access to Individual Harvested Sites in a Web Archive Tracy Meehleib DLF Fall Forum, Providence, RI November 13th, 2008.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
Tisch Technical Services FY 2011 Planning April 13, 2010.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Web and Twitter Archiving at the Library of Congress Nicholas Taylor Web Archiving Team Library of Congress Web Archive Globalization.
The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive.
Ymchwil Research Ymchwil Research RESAW Ioan Isaac-Richards Ingest Processes Manager Head of Web Archiving
Ask A Librarian and QuestionPoint: Integrating Collaborative Digital Reference in the Real World (and in a really big library) Linda J. White Digital Project.
The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital.
CNI Fall Task Force, December 2007 International Internet Preservation Consortium Abbie Grotke IIPC Communications Officer Library of Congress & George.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Office of Strategic Initiatives All Hands Meeting-March 2010 Challenges in Web Archiving: Library of Congress Edition Abbie Grotke, Web Archiving Team.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
1 Digital Archives - Past, Present & Future Issues Anne Van Camp Manager, Member Initiatives The Research Libraries Group Digital Archives Directions (DADs)
IFAP Special Event: Information and Knowledge for All, Emerging Trends and Challenges Information Preservation 4000 Years of Traditions Challenged by Digital.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
1 Collection Development and Web Publications at the British Library John Tuck Head of British Collections Digital Memory, Session 2, Tallinn 24 th November.
CyberCemetery Preserving At-Risk Government Web Content.
International Atomic Energy Agency 1 Highlights of the 11th INIS – ETDE Joint Technical Committee Meeting The 34 th Consultative Meeting of INIS Liaison.
Digital Accountability: The Line Between Producing and Preserving Digital Government Information Mary Alice Baish Superintendent of Documents Indiana State.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
The 3 M’s: MINERVA, MODS, and METS Allene Hayes (LC) Rebecca Guenther (LC) Leslie Myrick (NYU) DLF -- New Orleans April 20, 2004.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Metadata Extraction and Web Archives: Automating the Record Creation Process Tracy Meehleib Library of Congress, NDMSO NDIIPP June 25, 2009.
Library of Congress Partnerships for Managing Geospatial Data North Carolina Geographic Information Coordinating Council Raleigh, NC November 7, 2007 William.
DigiBoard Curator Tools Fair IIPC GA 2014 Abbie Grotke ~ Library of Congress
Grant Writing for Digital Projects September 2012 IODE Project Office IODE Project Office Oostende, Belgium Oostende, Belgium Sustainability and.
The National Digital Stewardship Alliance: Stewardship, Collaboration, Inclusiveness, Exchange.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
The National Digital Stewardship Alliance: Community, Content, Commitment.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008.
Archiving & Preserving Digital Content
National Digital Stewardship Alliance Web Archiving Survey Update
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Latin American Government Documents Archive, LAGDA
Presentation transcript:

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress Abbie Grotke Web Capture Team Office of Strategic Initiatives CRL Workshop, February 27, 2006

Web Capture team Office of strategic initiatives February 27, 2006 Agenda Why Collect the Web? Web Collections at the Library of Congress Policy Issues and Technical Activities Project: Selecting and Managing Content Capture from the Web Our Partnerships and International Collaborations

Web Capture team Office of strategic initiatives February 27, 2006 Why Collect the Web? Digital Preservation Goals of the Library of Congress Preserve our nation’s history and culture Identify and preserve at-risk digital content Support development of tools, models, and methods for digital preservation

Web Capture team Office of strategic initiatives February 27, 2006 The Early Days Feb 2000: “how do we collect the Web?” led to MINERVA prototype ( Special project team initially formed: cataloging, legal, public services, technology services staff Early partnerships: –Internet Archive ( –WebArchivist.org From project to program… –2003: Web Capture team formed –2004: Some of MINERVA team joined Web Capture

Web Capture team Office of strategic initiatives February 27, 2006 Web Collections *Election 2000: 767 seed urls *September 11 th : 30, Winter Olympics: 70 Sept 11 Remembrance: 1,800 *Election 2002: 3, th -109 th Congress: 588 Iraq War: 300 Election 2004: 2000 Papal Transition:200 Katrina:818 Supreme Court:285 *public access available through

Web Capture team Office of strategic initiatives February 27, 2006 Current Library Collecting Efforts Iraq War (ongoing) 109 th Congress (ongoing) Darfur Over 40 TB of data collected to date!

Web Capture team Office of strategic initiatives February 27, 2006 The Web Capture Process at LC Collection Planning Selection Notification/ Permissions Technical Review Crawl & QACataloging Interface Development Legal ReviewAccess Store & Manage

Web Capture team Office of strategic initiatives February 27, 2006 Technical Activities Current activity in areas of: –Selection and permission gathering Web Collection Management System –Acquisition: crawling and collection Heritrix –Access and display Full text searching, Wayback replacement –Collection analysis and preservation

Web Capture team Office of strategic initiatives February 27, 2006 Policy Issues Need to seek clear and consistent intellectual property protocols for crawling –Section 108 Study Group may provide hope What content should we now be collecting? How long should we collect it? Once we collect it, how do we make it available to our staff and public users? Do we share collecting efforts (costs, time) with partners? If so, how?

Web Capture team Office of strategic initiatives February 27, 2006 Various Web Collection Strategies Entire Web domain -- Internet Archive National domain (.se) –- Sweden, France, others Selective (individual URLs) and thematic – Australia Thematic or event based -- Library of Congress Other strategies LC is exploring Acquire collections gathered by others Establish relationships with producers to acquire their content

Web Capture team Office of strategic initiatives February 27, 2006 Selection LC’s Collection Policy Statements Collection planning defines: –Collection scope Description Types of sites Frequency –Categories of sites X category of site gets Y type permission Reporting Possible other uses – cataloging, access points

Web Capture team Office of strategic initiatives February 27, 2006 Other considerations What does the recommender want? –complete site –single document, page, or section Can we get it and provide access to it? –crawler and access tool limitations –deep web –scoping –permission

Web Capture team Office of strategic initiatives February 27, 2006 Selecting and Managing Content Captured from the Web One-year project to address: –Roles and responsibilities for lifecycle management of archived Web content –Single-site collecting vs. thematic collecting –Copyright permissions and notifications –Exploring how technical aspects of Web sites affect selection criteria –Expanding staff participation

Web Capture team Office of strategic initiatives February 27, 2006 Additional Objectives Learn by doing –Practical experience is key –Collection planning –Permissions planning –Content collection –Quality review: did we get what was wanted? Further document resource requirements and workflow (staff/time) Inform and educate other Library staff

Web Capture team Office of strategic initiatives February 27, 2006 Project Participants Four Content Groups –Darfur –Visual Image –Manuscript Organizations –Single Site Bibliographic and Lifecycle Subgroups Management Oversight Committee

Web Capture team Office of strategic initiatives February 27, 2006 Training Workshops –Selection –Technology of Web Capture –Copyright and Permissions –Access tools overview Tools training –For Recommenders: How to nominate a URL for archiving –For Selection Coordinators: How to use the tool to move through selection and permissions process Ongoing support, refreshers as needed

Web Capture team Office of strategic initiatives February 27, 2006 Some Big Challenges Defining new roles and responsibilities (and actually doing them) Resource limitations: everyone is busy and selection and permissions take a lot of time Finding the geek balance: too much vs. too little technical information Do LC’s traditional selection policies fit Web content?

Crisis in Darfur, Sudan

Web Capture team Office of strategic initiatives February 27, 2006 Crisis in Darfur, Sudan Approximately 200 seed URLs selected –Sampling of news reports –Scholarly reports and studies –Responses of Government Public (Web logs, etc.) Key organizations and their Web sites, some formed in response to crisis –About 25 sites in other languages, mostly Arabic Started crawling February 20, 2006 –Weekly, Monthly, One time

Web Capture team Office of strategic initiatives February 27, 2006 Upcoming tasks Review results of crawl –Technical Team Quality Review –Curator QA Quality Review Initiate permissions and collecting of Manuscript, Visual Image, and Single Site collections Full-text indexing search testing Further explore lifecycle management issues

CDLUNTLCIAUIUCIIPCBLOCLC RLG Archive-it Partners Collecting Partners NLABNFUKWACNYU Collecting Partners Collecting Partners NorwayFinland Denmark Sweden A Web of Archiving Initiatives NARA

Web Capture team Office of strategic initiatives February 27, 2006 National Partnerships and Collaborations University of California Digital Library –The Web at Risk: A Distributed Approach to Preserving our Nation’s Political Cultural Heritage Internet Archive –Testing the storage, data maintenance and access of collected Web content Information sharing with other US government agencies –Government Printing Office –National Archives and Records Administration

Web Capture team Office of strategic initiatives February 27, 2006 International Collaborations International Internet Preservation Consortium (IIPC) –Collect and preserve a rich body of Internet content from around the world –To foster the development and use of common tools, techniques and standards –To encourage and support national libraries everywhere to address Internet collecting and preservation –Share experience and best practices

Web Capture team Office of strategic initiatives February 27, 2006 IIPC Members France (lead) Italy Denmark Finland Iceland Canada Norway Australia Sweden United Kingdom Internet Archive, USA Library of Congress, USA

Web Capture team Office of strategic initiatives February 27, 2006 Upcoming Directions Better tools for supporting selection Improving access tools Better crawl management Large-scale collection storage approach: Repository

Web Capture team Office of strategic initiatives February 27, 2006 Questions? Abbie Grotke