Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations

Presentation on theme: "BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall 1.10.2004."— Presentation transcript:


2 Themes Web archiving in general: what it is + international perspectives Archiving Finnish web content: legal framework, plans and techniques

3 Web archiving? Storing and preserving content made available on the Internet/www Relates to the general idea of preserving any cultural heritage For the use of future scholars, researchers and common citizens

4 Who is archiving the Web? Usually national libraries: in the context of legal deposit law or some other mandate –E.g. Nordic countries, Australia, France, UK, USA, Italy... Internet Archive (globally) – Other specialized organisations Cooperation: International Internet Preservation Consortium (IIPC) –

5 Basics of web archiving Two approaches –Selection-based identifying web sites that are to be archived –Harvesting-based (crawler-based) using specialised software to collect large amounts of data (e.g. country domain-level) Various challenges in –Presenting and preservation: understanding web publishing technologies –Harvesting, link extraction, version control –Deep web, web databases –Cooperation with web publishers

6 Finnish digital content in the National Library: digital legal deposit Reforming the legislation –A draft Government Bill for new legal deposit act introduced to the Ministry of Education in June 2003 –Still waiting for the parliamentary prosess –Copyright legislation also being reformed according to the EU directive New responsibilities for the National Library –Collect and store web material: ”a representative and versatile sample of publicly available materials on the Internet” –Collect and store off-line media

7 Collecting and Storing the Web: Case Finland Two methods: –crawling/harvesting the national web space by the National Library –If the material cannot be harvested and/or the Nat. Lib. consideres it especially valuable: library notifies the publisher => publisher deposits or provides the ”means for storing” e.g. commercial web-publications, ministry reports, some deep web publications Different workflow and data storage for these two methods –a web archive (harvests) –a document archive (deposits)

8 The Two Methods deposits, publisher provides crawling / harvesting The Finnish Web Space Web Archive Document Archive Full text index DOMS: ENCompass (metadata)

9 The Two Methods Web Archive –Web sites, web pages –National domain (.fi) + other domains with national content –html, gif, jpeg... –Some sites with id + password –Harvesting sets the limits –Full text indexing, little metadata Document Archive –Digital documents that we want to catalogue also to the national bibliography –Documents that cannot be harvested (some restricted documents, deep web materials) –Documents that are deemed quality publications (e.g. research series, e-books) –Rich metadata used for the management and indexing, currently no full text indexing available

10 Web archive Harvesting: Heritrix Indexing: FAST Search User Interface: Nordic Web Archive Toolset Data storage –Currently only around 1,5 TB of data stored in a tape robot –Currently storing the web data in a ARC-format (Heritrix) –Negotiations for a large scale storage - hopefully in production 2005

11 Web Archive 2 Archiving policies are being formed in Finland Will be part of the overall collection develoment in the National Library Current thinking: –1-2 times a year: a wide sweep (all that we can find) –More frequent harvests for certain sites (e.g. news & media) –Theme harvests (e.g. elections)

12 Document archive and the DOMS Digital Object Management System (DOMS) –Purchased from Endeavor Inc. = ENCompass for Digital Collections - not yet in production –Will be a centralised system for Universities –Allows the description of digital objects and building of collections –Metadata customisable –Access restrictions customisable –Search UI customisable for each collection Digital objects stored in a document archive or anywhere else –E.g. central digital repository or a web server –Linking from a Encompass metadata record to the object (URI, path)

13 Legal deposit web content in DOMS/ENCompass How to deal with incoming web deposits? –Workflows –Formats –Collection building Metadata schema to support the management and preservation of objects –Technical metadata –Administrative metadata Utilise existing metadata –E.g. MARC-records, data from publisher, OAI Convert ENCompass metadata for other use –E.g. to MARC-records in national bibliography

14 Off-line content and ENCompass Finnish Music publications –Digitised + digital CD-ROMs and Games? –For preservation purposes needs an emulation environment Video => Finnish Film Archive Other digitised collections regarded as legal deposit content

15 Access to the legal deposit collections (web archive and document archive) Based on the draft bill –For researchers ”and other users” –On-site only: legal deposit libraries (currently 7) + The Finnish Film Archive –Researcher workstations User interfaces –Web archive UI –DOMS UI

16 Anything else? Questions? Thank You!


Similar presentations

Ads by Google