Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.

Similar presentations


Presentation on theme: "11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April."— Presentation transcript:

1 11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April 28th, 2015

2 2 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

3 3 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

4 4 The WARC format  A container format designed to store any kind of digital content –Along with relevant metadata –Extension of the ARC format designed in 1996  WARC improvements –Assigns a unique identifier to each record –New records types: To describe the harvesting process: warcinfo, request, response, metadata records To store information on deduplication: revisit records To store segmented files: continuation records To record outputs of a file format migration: conversion records To record non web material: resource records –New named fields for each records IIPC General Assembly – Stanford – April 28th, 2015

5 5 Usage of WARC format  Widely adopted by the web archiving community –Most institutions have switched from ARC to WARC format –Harvesting: Heritrix, Wget, WARCcreateHeritrixWgetWARCcreate –Data management/preservation: JWAT, Jhove2JWATJhove2 –Indexing and access: SOLR, Open WaybackSOLROpen Wayback  But also adopted beyond web archiving community –To store e-periodicals and e-books: LOCKSS project –To store all files ingested in a long-term repository: Danish Bit Repository  Some usage issues discussed in the WARC implementation guidelinesWARC implementation guidelines IIPC General Assembly – Stanford – April 28th, 2015

6 6 The WARC standard  Published as “ISO 28 500” on May 15 th, 2009 –Standardization process had started in 2006 –Mainly ensured by IIPC members under ISO umbrella  ISO group: TC 46 / SC 4 / WG 12 –TC 46: Information and communication –SC 4: technical interoperability –WG 12: WARC file format  ISO standards generally reviewed after 5 years –ISO members voted in 2014 in favor of the revision IIPC General Assembly – Stanford – April 28th, 2015

7 7 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

8 8 The revision process  A maximum period of 36 months  A two steps approach –IIPC draft / IIPC WG –ISO validated standard / ISO WG  Proposed agenda in 2015 –WARC revision workshop: now! –June: presentation of revision process during TC46 meeting –May-September: first IIPC draft –October (?): ISO WG meeting IIPC General Assembly – Stanford – April 28th, 2015

9 9 The revision process – why?  Amend or improve the current standard, on several topics –clarify potential ambiguities or inconsistencies in the standard; –offer better solutions to record some information, e.g. by adding new named fields or even new record types; –take into account some needs not identified when the original standard was designed (e.g. use of WARC for other documents than web archives); –perform minor editorial revisions.  Afterwards, no change possible until the next revision! IIPC General Assembly – Stanford – April 28th, 2015

10 10 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

11 11 IIPC General Assembly – Stanford – April 28th, 2015

12 12 Revision needs – active discussions  Clarification –Is it allowed to add new named fields? New record types are allowed… But nothing is indicated on new named fields  Two new named fields for deduplication –WARC-Refers-To-Target-URI –WARC-Refers-To-Date  A proposal to record screenshots? IIPC General Assembly – Stanford – April 28th, 2015

13 13 Revision needs – WARC for data mining  WAT: Web Archive Transformation –Specified by Internet Archive to store metadata extracted from WARC files –Metadata (HTML headers, HTML metadata, links…) recorded in metadata records with a JSON structure  WET: WARC Encapsulated Text –Designed by Common Crawl –Contains only text content extracted from WARC files  Official recommendation as informative appendix? IIPC General Assembly – Stanford – April 28th, 2015

14 14 Revision needs – open questions  Is WARC format suited for non-web material?  Is WARC format suited for server side archiving?  How to improve the use of unique IDs? IIPC General Assembly – Stanford – April 28th, 2015

15 15 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

16 16 Next steps  Set up a working group: who’s in? –Should we share the work?  What tools? –Using IIPC Github?  Agenda? –Phone calls? IIPC General Assembly – Stanford – April 28th, 2015


Download ppt "11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April."

Similar presentations


Ads by Google