Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1.

Slides:



Advertisements
Similar presentations
Kulturarw³ Capturing the web The Swedish experience
Advertisements

An Introduction To Heritrix
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
Bibliothèque nationale de France Tallinn,
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
Título de la presentación NetarchiveSuite at the BNE Juan Carlos García Arratia – Chief of IT Development Service, NLS Mar Pérez Morillo – Chief of Web.
14 mai 2007Evolution of Scientific Publications, Colloque de l'Académie des sciences1 Preservation of electronic publications mission Catherine Lupovici.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Looking Ahead Archive-It Partner Meeting November 12, 2013.
Mixing web and digitized archives The future of digital heritage of the World War I Valérie Beaudouin (Telecom ParisTech), Philippe Chevallier (BnF), Lionel.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Internet/WWW Technical Background Thanks to Miki Even-Haim and Yoram Dahan.
Ronald C. Jantz Government & Social Sciences Data Librarian Scholarly Communication Center Rutgers University Libraries Delivering Unique Numeric Data.
APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.
11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
Preserving webharvests at the National Library of New Zealand Te Puna Mātauranga o Aotearoa Peter McKinney Digital Preservation Policy Analyst National.
1 Archive-It Training University of Maryland July 12, 2007.
Annick Le Follic Bibliothèque nationale de France Tallinn,
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Tool Academy: Web Archiving Nicholas Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
European digital repositories: an overview ELAG 2006, Bucharest Juha Hakala Helsinki University Library.
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
A historical perspective of Digital Preservation at The Royal Library, Denmark.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata W/ARC files) Compiles a large set of figures and.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains.
CyberCemetery Preserving At-Risk Government Web Content.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
1 NetarchiveSuite Workshop Paris November , 2011.
Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Archiving & Preserving Digital Content
IFLA Satellite conference - Helsinki - 10 août 2012
Egyptian Language School General Questions Prep.2
IS1500: Introduction to Web Development
Institution update KB DK
Prepared by: Galya STATEVA, Chief expert
BnF - DLWEB - Umbra & Heritrix 3
The harvest of the Dutch digital fields:
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Web archive data and researchers’ needs: how might we meet them?
Márton Németh – László Drótos How to catalogue a web archive?
Technical Issues in Sustainability
Connecting the unconnected
Presentation transcript:

Annick Le Follic Bibliothèque nationale de France Tallinn,

BnF needs Objectives Characterize BnF web collections Manage the activity of the digital legal deposit team Describe BnF web data to be preserved Two main kinds of metrics: harvesting and preservation From experimentation… Scripts and Heritrix reports from Internet Archive engineers A dedicated application developed by BnF engineers 2

International environment … to the standardization of the metrics Definition of concepts and standards with an ISO working group Dedicated statistics have to be included in the Library general performance statistics Experience sharing within the IIPC Many libraries have changed from ARC to WARC BnF has developed a specific tool (NAS_qual) for its first internal broad crawl in

Benefits of ISO report for BnF Adoption of strict definitions of terms Main metrics chosen for collection development At a more refined level, collection characterisation 4 StatisticPurposeExample Number of targetsObjectives of the collection8,000 targets Total number of URLQuantity of information in web archive14 billion URL Total compressed size storedOverall size of web archive200 terabytes Number of container filesNumber of conservation units in archive18,000 WARC files StatisticPurposeExample Distribution by top level domainGeographic distribution70 % of collection in.fr TLD Distribution by format typesDocument type characterisation60 % of collection in text/html

BnF method A table lists and characterizes all possible metrics A code and a name for each one The source report The calculation method The needed tool (scripts, NAS_qual…) Main difficulties Difference of scale between broad and selective crawl Compressed or uncompressed size Collected or processed URLs 5

ProductionPreservation SourcesNetarchiveSuite (and Heritrix)SPAR Statistics toolsNAS_qualSPAR indicators ExploitationExcel files 6

Description of top level domain 7 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation Number of TLD Number of unique first top level domain harvested Hosts- report.txt – N files Extract the TLD from the name of the hosts. Add the different TLD with at least one URL harvested. Be careful: a TLD can have several occurrences in several host-reports. To characterize a collection in terms of geographic distribution (e.g. France)

Statistics on top level domains Starting with a seed list of.fr domains, we can see that French scope also includes a large part of.com domains, and also European domains 8 TLDNumber of URL% fr1,050,488, % com952,199, % net105,871, % org104,451, % eu29,396, %

Description of MIME types 9 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation Number of MIME types Number of unique MIME types harvested mimetype- report.txt - N files Add the unique MIME types. Be careful : a MIME type can have several occurrences in several MIME type reports. Get a distribution by content types comparable to other documents in a library Help preservation tasks

Statistics on MIME types We can note that around 1 million audio and video files will need special attention to be preserved 10 MIME type (by categories)Number of URL% text1,34, 647, % image947,101, % application114,668, % video2,120, % audio1,837, %

Description of WARC files volume 11 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation WARC files volume (compressed with metadata WARC) Weight in bytes (and in Go) of all the conservation units produced / data harvested Manifest of storage servers Add the weight of all the WARC files of one or several harvests (configurable). Manage the storage space Help preservation tasks

Statistics on WARC files volume BnF uses a similar way to count ARC and WARC files Tio in Tio for the entire BnF web collections Question: BnF still hesitates to convert bytes to Go or Gio, To or Tio? 12

Communication to users Comments by the digital legal deposit team Describe the web archive Discussion with the IT team Define annual storage volume Define number of crawlers Content librarians network Cooperate on selective crawls BnF managers and readers Disseminate figures on the annual report, the BnF website, the legal deposit observatory metrics 5 metrics 4 metrics 2 metrics or more

Conclusion Some usage limitations Unused metrics Bugs and errors Lack of analysis Some changes in the environment Adaptation to Heritrix 3 Options for new tools Evolution of standards Even though, we are able to compare our production and our collections with other institutions 14

BACKUP 15

16 External tools: QA tool - 1 Batch in Java based on Heritrix reports in metadata W/ARC files Compiles a large set of figures and lists and store them into text files 15 figures: processed URI harvested URI harvested hosts harvested domains non-harvested domains TLD MIME types harvest duration average URL/s average Kb/s average job size in URI average seed number per job average job size in bytes non-harvested URL because of robots exclusion total size

17 External tools: QA tool codehttp_url.txt : URL distribution per HTTP response code. 02-typemime_url_octets.txt : URL and bytes distribution per MIME type. 03-tld_url_octets.txt : URL and bytes distribution per TLD. 04-tld-hotes.txt : hosts distribution per TLD. 05-tld-domaines.txt : domains distribution per TLD. 06-tranches_hotes_url.txt : number of hosts in a given slice of harvested URL. = =100001; 07-tranches_domaines_url.txt : same with domains. 08-tranches_domaines_hotes.txt : same with hosts on domains. 09-tld2ndniveau_url_octets.txt : URL and bytes distribution per second level TLD. 10-tld2ndniveau_hotes.txt : host distribution per second level TLD. 11-top_domaines_url_octets.txt : URL and bytes distribution for the N bigger domains. 12-top_hotes_url_octets.txt : URL and bytes distribution for the N bigger hosts. 13-top_domaines_hotes.txt : list of domains having the largest number of hosts. 14-codereponse_seeds.txt : distribution of seed per response code.