Presentation is loading. Please wait.

Presentation is loading. Please wait.

Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, 18-19 June 2001 Warwick Cathro,

Similar presentations

Presentation on theme: "Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, 18-19 June 2001 Warwick Cathro,"— Presentation transcript:

1 Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, 18-19 June 2001 Warwick Cathro, Colin Web and Julie Whiting National Library of Australia

2 Introduction The Web: an essential global mechanism for information access and delivery The “stewardship role” of libraries, museums and archives The challenge of digital preservation

3 The need for action We simply do not have the luxury that our predecessors in their curation of printed materials enjoyed, of being able to rely on ‘benign neglect’ as a plank of our digital preservation strategy. Action has to be taken at the outset to ensure viable and sustainable access to digital content –Lynne Brindley, December 2000

4 The PANDORA archive (1) The first step: ensure the resource survives “Preserving and Accessing Networked DOcumentary Resources of Australia” A selective archive Significant web sites of Australian academic, government, commercial and community organisations Several of these have already disappeared from the live Internet

5 The PANDORA archive (2) In June 2001: –1250 web sites –a third have been captured multiple times –now growing at 500 new titles and 400 re- gathers per year –5 million files –670,000 directories –134 Gb of storage

6 The PANDORA archive (3) A routine part of the Library’s collection management processes Gathering is undertaken by the Library and three partner institutions Permission is sought from the publishers Every archived site is checked for quality and completeness Every archived site is catalogued

7 The context for PANDORA PANDORA is part of an overall framework for digital collection management at NLA The framework: –Covers the processes of selection, acquisition, storage, resource discovery, delivery, access control and preservation –Covers digitised and born digital resources –Is set out in the Digital Services Project Information Paper –Includes projects to acquire new infrastructure and software

8 Online demonstration

9 Selection issues Selective acquisition: not that different to the situation with printed resources Influenced by resource constraints Based on a set of guidelines In general, we don’t collect if there is a print equivalent We recognise that there are valid arguments for the alternative (non- selective) approach

10 Acquisition issues (1) PANDAS developed during 2000-2001 Functions: –Initiates the gathers –Manages the administrative metadata –Manages the quality checking –Prepares item for public access –Manages access restrictions –Produces reports Built using WebObjects as an ADE Requires IE 5.5

11 Acquisition issues (2) Administrative metadata: –Descriptive information –Permission status (granted, denied) –Type of resource –Status (selected, rejected, monitoring) –Subjects –Collection name –Restriction information –Gathering schedule (once only, or a frequency, specified dates)

12 Acquisition issues (3) Gathering process: –Pull rather than push –Current gathering tool is HTTrack, but others can be plugged in –Wide range of options for the scope and timing of the gathering process –Technical metadata is also gathered

13 Acquisition issues (4) Quality checking process: –Linkbot used to check for missing files –Manual check of the site by Electronic Unit staff –Some problems are referred to IT specialists for fixing

14 Acquisition issues (5) Preparation for public access: –External links are disabled –Title entry page is created which: shows gathering schedule links to publisher’s site lists each archived instance informs user of required plug-ins shows access restrictions links to copyright statements –Subject and title listings are updated

15 Resource discovery Resources in PANDORA can be discovered through: –The Library’s catalogue –The National Bibliographic Database (Kinetica service) –Title and subject lists at PANDORA web site –Full text index is being planned

16 Persistent identifiers (1) Files are cited: –In other web documents –In scholarly articles –In catalogues and databases –In bookmark files These links break when the resource is moved NLA is implementing persistent identifiers

17 Persistent identifiers (2) PANDORA will use the form: – – – – This standard will ensure: –A persistent unique identifier for each file –Sufficient granularity to support preservation actions –Grouping and relating of versions Still problem of relating archival Persistent Identifier to original resource

18 Access control Commercial titles: –Prefer onsite use for specified period, then open access Other restricted categories: –Some indigenous material –Opinions in closed lists –Potentially libellous material –Adult material Some sites are accessible by password only, or by place (IP address)

19 Preservation (1) The ultimate purpose Combination of strategies: –Technology preservation –Migration, where there are compatible new formats –Emulators, where practicable –Refreshing files and waiting for a strategy

20 Preservation (2) Analysis of PANDORA in mid 2000 revealed: –127 tags dead in HTML 4.0 –7 million tags which will be non-standard in later HTML versions –14 million tags with deprecated attributes Project plans include a file migration project

21 Preservation metadata Metadata is essential for a successful preservation process NLA developed a draft set of digital preservation metadata (1999) NLA is participating in the RLG/OCLC Working Group on Preservation Metadata White Paper (January 2001) has taken account of approaches of NEDLIB, CEDARS, NLA, Harvard University

22 Need for cooperative action The LC21 report called for greater cooperation: –in digital collection building –between collecting institutions, electronic publishers, research community Need to share information about best way to migrate specific obsolete file types Need for more scientific research and for better information exchange between researchers and archiving institutions

23 Conclusions PANDORA is: –A selective approach to web archiving –A routine part of NLA collection management –A strongly representative sample of Australian web publishing –Already keeping several sites that have disappeared –Based on quality control and metadata –Is an important first step in preserving access by future Australians to today’s web resources

Download ppt "Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, 18-19 June 2001 Warwick Cathro,"

Similar presentations

Ads by Google