Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1.

Slides:



Advertisements
Similar presentations
Kulturarw³ Capturing the web The Swedish experience
Advertisements

An Introduction To Heritrix
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
August 2005IFLA - CDNL1 The International Internet Preservation Consortium (IIPC)
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
Bibliothèque nationale de France Tallinn,
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
Título de la presentación NetarchiveSuite at the BNE Juan Carlos García Arratia – Chief of IT Development Service, NLS Mar Pérez Morillo – Chief of Web.
14 mai 2007Evolution of Scientific Publications, Colloque de l'Académie des sciences1 Preservation of electronic publications mission Catherine Lupovici.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Looking Ahead Archive-It Partner Meeting November 12, 2013.
Mixing web and digitized archives The future of digital heritage of the World War I Valérie Beaudouin (Telecom ParisTech), Philippe Chevallier (BnF), Lionel.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.
11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.
1 Archive-It Training University of Maryland July 12, 2007.
RECORDS MANAGEMENT AND THE WEB Presented by Jennifer Wright, Archives and Information Management Team and Lynda Schmitz Fuhrig, Electronic Records Division.
Australian web domain harvests 2005, 2006 & 2007.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
European digital repositories: an overview ELAG 2006, Bucharest Juha Hakala Helsinki University Library.
Annick Le Follic Bibliothèque nationale de France Tallinn,
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
A historical perspective of Digital Preservation at The Royal Library, Denmark.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata W/ARC files) Compiles a large set of figures and.
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains.
CyberCemetery Preserving At-Risk Government Web Content.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
1 NetarchiveSuite Workshop Paris November , 2011.
2015 NetarchiveSuite Workshop Eesti Rahvusraamatukogu Tallinn, Estonia January
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Archiving & Preserving Digital Content
IFLA Satellite conference - Helsinki - 10 août 2012
Egyptian Language School General Questions Prep.2
IS1500: Introduction to Web Development
Prepared by: Galya STATEVA, Chief expert
BnF - DLWEB - Umbra & Heritrix 3
Joanne Archer University of Maryland Libraries
The harvest of the Dutch digital fields:
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Internet Research WebSites
Web archive data and researchers’ needs: how might we meet them?
Márton Németh – László Drótos How to catalogue a web archive?
Technical Issues in Sustainability
Urban Statistics – Methodological work
Connecting the unconnected
Metadata supported full-text search in a web archive
Presentation transcript:

Annick Le Follic Bibliothèque nationale de France Tallinn,

BnF needs Objectives Characterize BnF web collections Manage the activity of the digital legal deposit team Describe BnF web data to be preserved Two main kinds of metrics: harvesting and preservation From experimentation… Scripts and Heritrix reports from Internet Archive engineers A dedicated application developed by BnF engineers 2

International environment … to the standardization of the metrics Definition of concepts and standards with an ISO working group Dedicated statistics have to be included in the Library general performance statistics Experience sharing within the IIPC Many libraries have changed from ARC to WARC BnF has developed a specific tool (NAS_qual) for its first internal broad crawl in

Benefits of ISO report for BnF Adoption of strict definitions of terms Main metrics chosen for collection development At a more refined level, collection characterisation 4 StatisticPurposeExample Number of targetsObjectives of the collection8,000 targets Total number of URLQuantity of information in web archive14 billion URL Total compressed size storedOverall size of web archive200 terabytes Number of container filesNumber of conservation units in archive18,000 WARC files StatisticPurposeExample Distribution by top level domainGeographic distribution70 % of collection in.fr TLD Distribution by format typesDocument type characterisation60 % of collection in text/html

BnF method A table lists and characterizes all possible metrics A code and a name for each one The source report The calculation method The needed tool (scripts, NAS_qual…) Main difficulties Difference of scale between broad and selective crawl Compressed or uncompressed size Collected or processed URLs 5

ProductionPreservation SourcesNetarchiveSuite (and Heritrix)SPAR Statistics toolsNAS_qualSPAR indicators ExploitationExcel files 6

Description of top level domain 7 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation Number of TLD Number of unique first top level domain harvested Hosts- report.txt – N files Extract the TLD from the name of the hosts. Add the different TLD with at least one URL harvested. Be careful: a TLD can have several occurrences in several host-reports. To characterize a collection in terms of geographic distribution (e.g. France)

Statistics on top level domains Starting with a seed list of.fr domains, we can see that French scope also includes a large part of.com domains, and also European domains 8 TLDNumber of URL% fr1,050,488, % com952,199, % net105,871, % org104,451, % eu29,396, %

Description of MIME types 9 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation Number of MIME types Number of unique MIME types harvested mimetype- report.txt - N files Add the unique MIME types. Be careful : a MIME type can have several occurrences in several MIME type reports. Get a distribution by content types comparable to other documents in a library Help preservation tasks

Statistics on MIME types We can note that around 1 million audio and video files will need special attention to be preserved 10 MIME type (by categories)Number of URL% text1,34, 647, % image947,101, % application114,668, % video2,120, % audio1,837, %

Description of WARC files volume 11 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation WARC files volume (compressed with metadata WARC) Weight in bytes (and in Go) of all the conservation units produced / data harvested Manifest of storage servers Add the weight of all the WARC files of one or several harvests (configurable). Manage the storage space Help preservation tasks

Statistics on WARC files volume BnF uses a similar way to count ARC and WARC files Tio in Tio for the entire BnF web collections Question: BnF still hesitates to convert bytes to Go or Gio, To or Tio? 12

Communication to users Comments by the digital legal deposit team Describe the web archive Discussion with the IT team Define annual storage volume Define number of crawlers Content librarians network Cooperate on selective crawls BnF managers and readers Disseminate figures on the annual report, the BnF website, the legal deposit observatory metrics 5 metrics 4 metrics 2 metrics or more

Conclusion Some usage limitations Unused metrics Bugs and errors Lack of analysis Some changes in the environment Adaptation to Heritrix 3 Options for new tools Evolution of standards Even though, we are able to compare our production and our collections with other institutions 14