Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

Similar presentations


Presentation on theme: "Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress."— Presentation transcript:

1 Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress Abbie Grotke Web Capture Team Office of Strategic Initiatives CRL Workshop, February 27, 2006

2 Web Capture team Office of strategic initiatives February 27, 2006 Agenda Why Collect the Web? Web Collections at the Library of Congress Policy Issues and Technical Activities Project: Selecting and Managing Content Capture from the Web Our Partnerships and International Collaborations

3 Web Capture team Office of strategic initiatives February 27, 2006 Why Collect the Web? Digital Preservation Goals of the Library of Congress Preserve our nation’s history and culture Identify and preserve at-risk digital content Support development of tools, models, and methods for digital preservation

4 Web Capture team Office of strategic initiatives February 27, 2006 The Early Days Feb 2000: “how do we collect the Web?” led to MINERVA prototype (www.loc.gov/minerva) Special project team initially formed: cataloging, legal, public services, technology services staff Early partnerships: –Internet Archive (www.archive.org) –WebArchivist.org From project to program… –2003: Web Capture team formed –2004: Some of MINERVA team joined Web Capture

5 Web Capture team Office of strategic initiatives February 27, 2006 Web Collections 2000-2006 *Election 2000: 767 seed urls *September 11 th : 30,000+ 2002 Winter Olympics: 70 Sept 11 Remembrance: 1,800 *Election 2002: 3,000 107 th -109 th Congress: 588 Iraq War: 300 Election 2004: 2000 Papal Transition:200 Katrina:818 Supreme Court:285 *public access available through www.loc.gov/minerva

6 Web Capture team Office of strategic initiatives February 27, 2006 Current Library Collecting Efforts Iraq War (ongoing) 109 th Congress (ongoing) Darfur Over 40 TB of data collected to date!

7 Web Capture team Office of strategic initiatives February 27, 2006 The Web Capture Process at LC Collection Planning Selection Notification/ Permissions Technical Review Crawl & QACataloging Interface Development Legal ReviewAccess Store & Manage

8 Web Capture team Office of strategic initiatives February 27, 2006 Technical Activities Current activity in areas of: –Selection and permission gathering Web Collection Management System –Acquisition: crawling and collection Heritrix –Access and display Full text searching, Wayback replacement –Collection analysis and preservation

9 Web Capture team Office of strategic initiatives February 27, 2006 Policy Issues Need to seek clear and consistent intellectual property protocols for crawling –Section 108 Study Group may provide hope http://www.loc.gov/section108/ http://www.loc.gov/section108/ What content should we now be collecting? How long should we collect it? Once we collect it, how do we make it available to our staff and public users? Do we share collecting efforts (costs, time) with partners? If so, how?

10 Web Capture team Office of strategic initiatives February 27, 2006 Various Web Collection Strategies Entire Web domain -- Internet Archive National domain (.se) –- Sweden, France, others Selective (individual URLs) and thematic – Australia Thematic or event based -- Library of Congress Other strategies LC is exploring Acquire collections gathered by others Establish relationships with producers to acquire their content

11 Web Capture team Office of strategic initiatives February 27, 2006 Selection LC’s Collection Policy Statements Collection planning defines: –Collection scope Description Types of sites Frequency –Categories of sites X category of site gets Y type permission Reporting Possible other uses – cataloging, access points

12 Web Capture team Office of strategic initiatives February 27, 2006 Other considerations What does the recommender want? –complete site –single document, page, or section Can we get it and provide access to it? –crawler and access tool limitations –deep web –scoping –permission

13 Web Capture team Office of strategic initiatives February 27, 2006 Selecting and Managing Content Captured from the Web One-year project to address: –Roles and responsibilities for lifecycle management of archived Web content –Single-site collecting vs. thematic collecting –Copyright permissions and notifications –Exploring how technical aspects of Web sites affect selection criteria –Expanding staff participation

14 Web Capture team Office of strategic initiatives February 27, 2006 Additional Objectives Learn by doing –Practical experience is key –Collection planning –Permissions planning –Content collection –Quality review: did we get what was wanted? Further document resource requirements and workflow (staff/time) Inform and educate other Library staff

15 Web Capture team Office of strategic initiatives February 27, 2006 Project Participants Four Content Groups –Darfur –Visual Image –Manuscript Organizations –Single Site Bibliographic and Lifecycle Subgroups Management Oversight Committee

16 Web Capture team Office of strategic initiatives February 27, 2006 Training Workshops –Selection –Technology of Web Capture –Copyright and Permissions –Access tools overview Tools training –For Recommenders: How to nominate a URL for archiving –For Selection Coordinators: How to use the tool to move through selection and permissions process Ongoing support, refreshers as needed

17 Web Capture team Office of strategic initiatives February 27, 2006 Some Big Challenges Defining new roles and responsibilities (and actually doing them) Resource limitations: everyone is busy and selection and permissions take a lot of time Finding the geek balance: too much vs. too little technical information Do LC’s traditional selection policies fit Web content?

18 Crisis in Darfur, Sudan

19 Web Capture team Office of strategic initiatives February 27, 2006 Crisis in Darfur, Sudan Approximately 200 seed URLs selected –Sampling of news reports –Scholarly reports and studies –Responses of Government Public (Web logs, etc.) Key organizations and their Web sites, some formed in response to crisis –About 25 sites in other languages, mostly Arabic Started crawling February 20, 2006 –Weekly, Monthly, One time

20 Web Capture team Office of strategic initiatives February 27, 2006 Upcoming tasks Review results of crawl –Technical Team Quality Review –Curator QA Quality Review Initiate permissions and collecting of Manuscript, Visual Image, and Single Site collections Full-text indexing search testing Further explore lifecycle management issues

21 CDLUNTLCIAUIUCIIPCBLOCLC RLG Archive-it Partners Collecting Partners NLABNFUKWACNYU Collecting Partners Collecting Partners NorwayFinland Denmark Sweden A Web of Archiving Initiatives NARA

22 Web Capture team Office of strategic initiatives February 27, 2006 National Partnerships and Collaborations University of California Digital Library –The Web at Risk: A Distributed Approach to Preserving our Nation’s Political Cultural Heritage Internet Archive –Testing the storage, data maintenance and access of collected Web content Information sharing with other US government agencies –Government Printing Office –National Archives and Records Administration

23 Web Capture team Office of strategic initiatives February 27, 2006 International Collaborations International Internet Preservation Consortium (IIPC) –Collect and preserve a rich body of Internet content from around the world –To foster the development and use of common tools, techniques and standards –To encourage and support national libraries everywhere to address Internet collecting and preservation –Share experience and best practices

24 Web Capture team Office of strategic initiatives February 27, 2006 IIPC Members France (lead) Italy Denmark Finland Iceland Canada http://netpreserve.org/ Norway Australia Sweden United Kingdom Internet Archive, USA Library of Congress, USA

25 Web Capture team Office of strategic initiatives February 27, 2006 Upcoming Directions Better tools for supporting selection Improving access tools Better crawl management Large-scale collection storage approach: Repository

26 Web Capture team Office of strategic initiatives February 27, 2006 Questions? Abbie Grotke abgr@loc.gov


Download ppt "Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress."

Similar presentations


Ads by Google