Presentation is loading. Please wait.

Presentation is loading. Please wait.

WebArchiv Czech Web Archive IIPC 2007, Paris.

Similar presentations


Presentation on theme: "WebArchiv Czech Web Archive IIPC 2007, Paris."— Presentation transcript:

1 http://www.webarchiv.cz WebArchiv Czech Web Archive IIPC 2007, Paris

2 http://www.webarchiv.cz IIPC 2007 WebArchiv – overview The Czech WebArchiv was originally funded by the Ministry of Culture and launched in 2000. Since then the project has been implemented by the National Library in cooperation with the Moravian Library and the Institute of Computer Science of Masaryk University. Both large-scale automated harvesting of the entire Czech national web and selective archiving are being carried out, including thematic, event-based collections (using Heritrix). Due to copyright law, only restricted on-site access from within the library is possible to all files in the archive (using wayback). Archived resources which are covered by a written agreement with their publisher are accessible online using WERA.

3 http://www.webarchiv.cz IIPC 2007 WebArchiv – Workflows Prague:  Resource selection  Cataloguing for the National Bibliography (MARC21)  Providing Dublin Core metadata for interested publishers  Making archive access agreements with publishers Brno:  Running WebArchiv hardware  Software localization, maintenance and development  Pre-harvesting resource analysis  Harvesting, indexing, access Results so far:  4 harvesting rounds of.cz domain (2001, 2002, 2004, 2006)  5 event-oriented harvests  several times per year – harvests of sites under agreements  5.4 TB archive with 136 million files

4 http://www.webarchiv.cz IIPC 2007 WebArchiv – Tools Software tools:  Web Based Dublin Core metadata creator  National Bibliography Number (NBN) generator  Heritrix crawler  NutchWAX, WERA – full text indexing & public archive access  wa-cz – locally developed infrastructure  WayBack – Wayback Machine like interface for whole archive, limited access Hardware:  3 HP ProLiant servers, 5.8 TB SATA disc array  awaiting transfer of the archive files to National Library’s central storage facility (25+ TB, mirrored, FC+SATA) later this year

5 http://www.webarchiv.cz IIPC 2007 WebArchiv – Infrastructure A1 new crawl; A2 end crawl -> index; A3 update fulltext; A4 update host list

6 http://www.webarchiv.cz IIPC 2007

7 http://www.webarchiv.cz IIPC 2007 WebArchiv - Future Work Workflow management application Harvesting of bohemical resources outside the.cz domain  language analysis  feedback from Heritrix about dropped URLs from.cz crawl Adaptive incremental harvesting, incremental indexing Selective harvesting on demand Fulltext indexing of the whole archive Identification of similar documents Permanent linking into the archive (permanent ID) Integration of the archive into planned National Digital Library (selection of software 2008) Long-term preservation (via NDL system) Implementation of digital library standards: OAI-PMH, METS, SRU/SRW

8 http://www.webarchiv.cz IIPC 2007 Archive daily ingest NEDLIB harvesterHeritrix Number of files

9 http://www.webarchiv.cz IIPC 2007 People Librarians, project management:  National Library: 3.5 FTE IT management  Moravian Library – 1 part-time IT  Masaryk University – 6 part-time


Download ppt "WebArchiv Czech Web Archive IIPC 2007, Paris."

Similar presentations


Ads by Google