Presentation is loading. Please wait.

Presentation is loading. Please wait.

PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November.

Similar presentations


Presentation on theme: "PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November."— Presentation transcript:

1 PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November 2006 Paul Koerbin Manager Digital Archiving National Library of Australia pkoerbin@nla.gov.au

2 PANDORA and Beyond Context and background PANDORA – selective archiving PANDAS – a web archiving system Domain harvesting Now and beyond

3 PANDORA and Beyond – Context - Legislation National Library Act, 1960 Functions of the National Library –Maintain and develop a national collection of library material, including a comprehensive collection of library material relating to Australia and the Australian people –To make library material in the national collection available … in the national interest –‘Library material’ ~ books, periodicals, newspapers, manuscripts, films, sound recordings, musical scores, maps, plans, pictures, photographs, prints and other recorded material …

4 PANDORA and Beyond – Context - Legislation Copyright Act, 1968 – Sect 201 Delivery of library materials to the National Library –‘Library material’ ~ book, periodical, newspaper, pamphlet sheet of letter-press, sheet of music, map, plan, chart or table, being a literary, dramatic, musical or artistic work or an edition of such a work … Enabling and supportive legislation does not address the collection of digital content Copyright Amendment (Digital Agenda) Act, 2000 –some support for digital preservation actions

5 PANDORA and Beyond – Context – Web Publishing World Wide Web: a new publishing medium, 1995→ Defining a publication for our purpose: A publication is information, regardless of its format or method of delivery, that is made available to the general public, or to an identified public, either free of charge or for a fee. Definition from: PANDORA Selection Guidelines http://pandora.nla.gov.au/selectionguidelines.html#pubdefinition Content rendered through a web browser Email – only as delivery mechanism (e.g. PDF) Databases – yes, but more problematic

6 PANDORA and Beyond – Context – Web Publishing Enormous growth and volume of material Everyone can be creators and publishers Virtually instantaneous publication Dynamic content and format Multiplicity of formats Technology dependent Hyperlinked and interconnected Highly accessible but hard to identify Ephemeral Interactivity, re-use, personalisation (web 2.0)

7 PANDORA and Beyond – Context – Some Objectives Fulfil the functions of the National Library Identify published content to collect Manage content for long term preservation –Integrity of the data streams –Maintain access to authentic content Provide persistent access to the content Incorporate collection and preservation of web content into routine Library processes Efficient and sustainable

8 PANDORA and Beyond – The PANDORA Archive PANDORA Archive 1996→PANDORA Archive Began as proof-of-concept project Now a routine process within NLA Currently 10 participants – NLA, state libraries (not Tas), NFSA, AWM, AIATSIS Selective, content focused (bibliocentric) –simple documents to whole websites PANDAS workflow management system, 2001→

9 PANDORA and Beyond – PANDORA – Web Archiving What is web archiving? Identifying and selecting Seeking permission to collect and make accessible Recording metadata Crawling/harvesting (including scheduling) Processing for quality assurance (best effort) Storing and maintaining the data Preparing and rendering for public display Creating resource discovery metadata

10 PANDORA and Beyond – PANDAS PANDAS – PANDORA Digital Archiving System Web based workflow management system Developed specifically to manage the web archiving processes at the National Library of Australia Used by PANDORA’s participants located throughout Australia (mainland state libraries, AWM, NFSA, AIATSIS) Also used by UKWAC

11 PANDORA and Beyond – PANDAS Developed in-house at the NLA Replaced multiple non-integrated systems used between 1996 and 2001 Written in Java on Apple WebObjects application development platform Presentation, application, business and data layers Version 1 released June 2001 Version 2 released August 2002 Version 3 due early 2007

12 PANDORA and Beyond – PANDAS

13 Developed in-house at the NLA Replaced multiple non-integrated systems used between 1996 and 2001 Written in Java on Apple WebObjects application development platform Presentation, application, business and data layers Version 1 released in June 2001 Version 2 released August 2002 Version 3 due early 2007

14 PANDORA and Beyond – PANDAS Record administrative metadata about titles selected (or considered) for archiving Schedule and initiate harvesting –but not a crawler; currently use HTTrack Manage quality assurance checking and problem fixing workflow Prepare and deliver archived copies for public display through the PANDORA home page –dynamically from PANDAS database Manage access restrictions Facilitates management reporting

15 PANDORA and Beyond – Persistent URIs Running number generated by PANDAS Persistent URL applied to title entry page http://nla.gov.au/nla.arc-21220 Logically extended to any resource in the Archive http://nla.gov.au/nla.arc-21220-20030822- www.ipjp.org/september2002/schweitzer- ed.html Citation generator on public interface

16 PANDORA and Beyond – PANDORA Statistics Indicative statistics as at October 2006 13,000+ titles 26,000+ archived instances 33.5+ million files* 1.2+ Terabytes data* * These figures are for the display copy only. Three preservation copies are actually maintained: a preservation master, an access master and a metadata master.

17 PANDORA and Beyond – Domain Harvesting Crawl conducted by the Internet Archive for the NLA 1 st harvest June/July 2005 –4 weeks, 185m files, 6.69 TBs 2 nd harvest Aug/Sept 2006 –5 weeks, 516m files, 19.04 TBs Harvest of the.au top level domain –plus, non.au hosts identified through geoPI lookup as being hosted in Australia Domain harvesting – obvious choice?

18 Comparative statistics PANDORA (c. 6% of 2006 DH) Files:33 million Size:1.2 TB HTML:67% Image files:28.5% PDF files:1.6% MS Word files: 0.2% DH MIME types Domain Harvest20052006 Unique files185,549,662516,280,205 Hosts crawled811,5231,046,038 Size6.69 TB19.04 TB

19 PANDORA and Beyond – Domain Harvesting – Pros and Cons Convergence of resources, technology, collaborations, and purpose in 2005 Some pros – –Retains linkages and context –Large scale – more bytes for the buck –Less selectively discriminate Some cons – –High dependence on the crawler technology –Domain and geo-location bias (.au, geoIP) –Limitations in timeliness, quality assurance, scoping, site complexity, deep web –Legal and access issues to resolve

20 PANDORA and Beyond – Now 10 years selective web archiving for PANDORA –publicly accessible web archive 2 years domain harvesting –large scale archival content PANDAS –production workflow system Tangible outcomes from pragmatic approach Doing (what we can) with limited resources Developing experience, knowledge and skill through practical engagement in the tasks

21 PANDORA and Beyond – Future Strategies Renewed focus on strategic thinking Collaborations, relationships, partnerships –International Internet Preservation Consortium Internet Archive –Open source tools, standards (IIPC) –Institutional and trusted repositories (universities and e-presses) –Government & academic sectors (APSR, ARROW) –‘research information infrastructure’ services that support the discovery and management of research resources and research outputs by and for the current and future research community

22 PANDORA and Beyond – Future Strategies Preservation planning and infrastructure Sustainable resourcing and workflows Push for legislation for collecting in the digital age Understanding the territory –Personal web archiving (HanzoWeb); archive crawlers (Warrick); advanced bookmarking (spurl.net) Strategic use of selective and domain harvesting Architecture, systems and workflows for efficient management of and access to web archive collections

23 PANDORA Australia’s Web Archive http://pandora.nla.gov.au/


Download ppt "PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November."

Similar presentations


Ads by Google