Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

DRIVER Long Term Preservation for Enhanced Publications in the DRIVER Infrastructure 1 WePreserve Workshop, October 2008 Dale Peters, Scientific Technical.
NetarchiveSuite Meeting, BnF, 24./ Curator Track Austria Michaela Mayr Austrian National Library
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
K12 Web Archiving Program Lori Donovan Coordinator, K12 Web Archiving Program Internet Archive.
Bibliothèque nationale de France Tallinn,
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Título de la presentación NetarchiveSuite at the BNE Juan Carlos García Arratia – Chief of IT Development Service, NLS Mar Pérez Morillo – Chief of Web.
1 L U N D U N I V E R S I T Y a home grown, bespoke institutional Federated Search tool JIBS Conference at The John Rylands University Library,
Looking Ahead Archive-It Partner Meeting November 18, 2014.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Looking Ahead Archive-It Partner Meeting November 12, 2013.
The Invisible Web Definition Searching. The Invisible Web Also called: deep content hidden internet dark matter.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.
1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
1 Archive-It Training University of Maryland July 12, 2007.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Web Archiving at the Innsbruck Newspaper Archive Innsbrucker Zeitungsarchiv / IZA Presentation by Renate Giacomuzzi, Elisabeth Sporer, Armin Schleicher.
Archive-It collection on “Occupy Movement 2011/2012” Archiving Web Content.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
Ask A Librarian and QuestionPoint: Integrating Collaborative Digital Reference in the Real World (and in a really big library) Linda J. White Digital Project.
The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital.
DIY Video Archiving with the Community Media Archive John Hauser, Access Humboldt James Jones, Attelboro Access Cable System Oct 10, 2014 ACM North East.
Annick Le Follic Bibliothèque nationale de France Tallinn,
1 CO1552 Web Application Development The Web Design Process.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:
Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.
Caught in the Web: Web Archiving at U of A Libraries Geoff Harder and Kenton Good Digital Preservation Seminar | March 5, 2010 | University of Alberta.
Office of Strategic Initiatives All Hands Meeting-March 2010 Challenges in Web Archiving: Library of Congress Edition Abbie Grotke, Web Archiving Team.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Curator wishes for the roadmap november 2011 updates.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands.
NetarchiveSuite Meeting, Tallinn, 29./ * Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
16/11/ Semantic Web Services Language Requirements Presenter: Emilia Cimpian
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
1 NetarchiveSuite Workshop Paris November , 2011.
Strategic Agenda We want to be connected to the internet……… We may even want to host our own web site……… We must have a secure network! What are the.
Engaging our students with Web2.0 tools. Teacher delivers content and skills based on government standards Content and skills delivered by the teacher.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
Disclaimer This presentation is for informational purposes only and does not constitute legal advice.
Institution update KB DK
Joanne Archer University of Maryland Libraries
BnF experiences in using NAS 5 And Heritrix 3
Documentation as part of curation in web archiving.
Latin American Government Documents Archive, LAGDA
Web archive data and researchers’ needs: how might we meet them?
Objectives Identify functions.
Metadata supported full-text search in a web archive
Presentation transcript:

Use cases for BnF broad crawls Annick Lorthios

2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the BnF (instead of by Internet Archive) Workflow: Planning Crawl design Crawl monitoring Quality control Lessons learnt Number of harvested URLs 530 to 800 millions Compressed weight20 to 30 TB Duration5 to 10 weeks Goals:

3 Planning Goal is to avoid technical, content or legal risks Beginning of the project to use NAS as our main tool Contract (SLA) between IT team and digital curators Responses to producers requests & complaints

4 Crawl design Constitution of the seed list Use of a special pre-load database Choice of general settings Configuration of the architecture Tests of NAS

5 Crawl monitoring Objectives: make it possible to finish the crawl in the defined time Consultation of NAS monitoring interface Intervention in Heritrix monitoring interface to check certain domains Overview of the Web media

6 Quality control Use of an external tool for statistics & metrics 15 % of 2010 broad crawl collection is video and audio files Use of Wayback Machine to browse the visual result (samples!) Number of harvested URLs 832 millions Compressed weight 23,6 TB Duration12 weeks ARC files240,000

7 Lessons learnt NAS is maintained for the organisation of future crawls NAS made teams invent new forms of relationship between IT team and digital curators NAS is good for configuring harvest definitions but we must be careful not to create too many seed lists YES

Selective crawls at the BnF Peter Stirling

9 Current organisation of selective crawls Selective crawls to complement broad crawls (sites outside.fr, sites to be collected in depth…) Three main kinds: Thematic (BnF departments) Project (elections, blogs…) Cross-departmental (news, large sites) Nomination of sites by a network of curators at the BnF, and external partners Currently handled directly with Heritrix, in process of transferring workflow to NAS

10 A community of librarians Network of correspondents, c. 80 librarians across the different thematic departments of the BnF One coordinator for each department; 1 to 15 correspondents, depending on the size of the department Capitalise on subject knowledge Engage librarians in Web archiving activity, make it business as usual at the BnF In practice: meetings with coordinators to define collection policy, training sessions, workshops…

11 Tools Previous curator tool (GDLWeb) allowed curators to nominate sites For election crawls, a special nomination tool allows remote access and classification of sites nominated Curators define seed, depth, frequency and budget Validation by web archiving team and transfer to IT team for crawling with Heritrix A new curator tool to work directly with NAS will be developed in first semester of 2011

12 Size/frequency of selective crawls Thematic crawls generally performed once a year; other frequencies to be put in place with NAS Project crawls can be more frequent (twice a year, multiple crawls during elections…) c. 20,000 seed URLs across all selective crawls Ranges from c. 50 (Maps Department) to almost 6,000 (Elections 2007) Estimation 2010: 185 million harvested URLs, 12 TB

13 Cross-departmental selections Tests currently underway on crawls of news sites and large sites, to be launched in October Sites that have an interest to the whole library, and that have specific technical needs (daily crawls, crawl in depth) c. 80 news sites, 10 large sites (up to several million URLs) Developments to NAS: monitoring of jobs

14 Monitoring a job with NAS History of job Graphs to show progress List of longest queues (also as CSV file) Access to Heritrix console

15 Tests on news/large sites Positive : Use of NAS to automatically launch daily crawls Monitoring made easier Access to information on jobs within NAS interface Negative  : Queues by domain, not by host Budget management Use of general or domain-specific filters, not possible to filter differently by Harvest Definition Still need for external tools (quality control…)

16 Transfer of existing workflow to NAS Use of other frequencies (demand expressed by curators): continuous crawling How do we keep our current organisation while adapting it to NAS? Use of Harvest Definitions… Further developments NAS: budgets, filters, quality control… Development of curator tool: Management of selections by curators Validation by web archiving team Interaction with NAS