Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.

Similar presentations


Presentation on theme: "Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library."— Presentation transcript:

1 Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library of Australia

2 Web Archiving at the NLA  Background  History  Organisation  Participants  Approaches to web archiving  PANDORA selective archiving  Whole domain harvesting  Skills and operational tasks  Workflows and systems  PANDAS

3 History: web archiving at the NLA  April 1996: ‘Electronic Unit’ established  Part of Acquisitions Branch  3 staff, 6 months to develop selection (scope) guidelines and identify resources  September 1996: ‘Australian Serials and Electronic Unit’ established  Technical services restructure, multi-tasking, matrix management  October 1996 first titles harvested  November 1996: PANDORA born as ‘proof of concept project’  As at June 1997, 30 titles harvested

4 History: web archiving at the NLA  May 1998: public access to PANDORA titles  July 1998: first PANDORA ‘partner’ began participation  11 th participant joined in 2010  October 1998: first ‘Certified Agreement’ commenced in the Library  Change to staffing classifications; professional librarian streams abolished  June 2001: PANDAS v.1 released  Web archiving workflow system developed by NLA  2002: Digital Archiving Branch  Our own identity at last!  Began first trial of ‘mainstreaming’ web archiving in Serials and Govt Deposit sections

5 History: web archiving at the NLA  August 2002: PANDAS v.2 released  July 2003: joined IIPC  2004: PANDORA added to UNESCO Australian Memory of the World Register  July 2005: first.au domain harvest  Subsequent harvests in 2006, 2007, 2008 & 2009  December 2006: ‘Web Archiving and Digital Preservation Branch’  July 2007: PANDAS v.3 released  2010: PANDORA search moved to Trove  May 2010: Whole-of-govt ‘opt-out’ arrangements endorsed by SIGB

6 Manager Web Archiving (base level executive) Team Leader (senior librarian) Web Archiving Section Team Member (APS5) Web Archiving Section Team Member (APS4) Web Archiving Section Team Member (APS4) Web Archiving Section

7 Digitisation DIVISION 1 – COLLECTIONS MANAGEMENT Australian Collection Develop’t Special Materials Cataloguing, Standards & Training WEB ARCHIVING AND DIGITAL PRESERVATION BRANCH Web Archiving Imaging Services Jakarta Office SERIALS BRANCH DIVISION SUPPORT UNIT Overseas Collection Development Section MONOGRAPHS BRANCH DIGITAL COLLECTIONS MANAGEMENT BRANCH ASIAN COLLECTIONS BRANCH ILMS Section Serials Section Preservation Standards Collections Preservation PRESERVATION BRANCH ASSISTANT DIRECTOR-GENERAL BIBLIOGRAPHIC STANDARDS AND STRATEGIES BRANCH Digital Preservation Newspaper Digitisation Project Australian Newspaper Plan Acquisition & Access RDA

8 PANDORA Participants  11 participants including the NLA  All state and territory libraries (except Tasmania and ACT)  Major heritage institutions  National Film and Sound Archive  Australian War Memorial  Australian Institute of Aboriginal and Torres Strait Islander Studies  National Gallery of Australia

9 PANDORA Participants  Memorandum of Understanding  Respective obligations (NLA and Agencies)  Adherence to policy and procedures  Curatorial and collection management (operational staff)  Selection – participant guidelines  Permissions  Harvesting – scoping and quality checking  Cataloguing  Publishing – access through PANDORA

10 What is web archiving?  A web archive is not the same as the live web  Brings a different value to web content  Creating artefacts from the web  Preserved snapshots, slices, gobbets of time  Challenge of timeliness  At certain times some things are more interesting and valuable  Focus on the future and long term access (preservation objective)

11 Approaches to web archiving?  Selective (specific targets)  websites  single publications  Domain  Country domains (e.g..au or.id)  Sub-domains(e.g..gov.au)  Thematic  Scoped around topics, events, forms of publishing  Seed lists

12 12 PANDORA - Australia’s Web Archive  Selective approach – Australian content  Collaboration with participating agencies  No legal deposit  Permissions based collecting  Timely and scheduled collecting  Quality checked  Described and indexed (searchable)  Accessible to the public  Modest in size

13 13 Australian web domain harvests  Annual domain harvests 2005-2009  Working with the Internet Archive  Covers.au top level domain and a bit more …  No legal deposit  Permissions not sought  No public access (yet)  Quantity over quality (not QA action)  Full text indexed (searchable) not catalogued  Opportunistic rather than timely

14 14 Comparative statistics PANDORA Files:94 million Size:4.23 TB Domain Harvest 20052006200720082009 Unique files 185 million596 million516 million1 billion765 million Hosts crawled 811,5231,046,0381,247,6143,038,6581,074,645 Size TBs6.6919.0418.4734.5524.29 Domain Harvests Files:3 billion Size:103 TB

15 Skills and tasks  Operational, Library’s ‘core business’ staff:  Librarians, web curators, web archivists, cataloguers … by any other name  Perform all associated tasks:  Selection, permissions, acquisition (harvesting) processes, quality checking, cataloguing, publishing (resource discovery)

16 Operational skills and tasks  Collection development  Selection expertise in ‘new media’  Corporate objectives, priorities, resources  Collection management  Cataloguing: MARC, LCSH, Dewey  PANDORA subjects  Technical skills  Scoping gather filters and settings  Harvesting and code problem analysis and resolution (HTML, JavaScript, stylesheets)  Understanding web technologies  Experience and self-learning  New technologies, Web 2.0, timely collecting, always new challenges

17 IT commitment and support  All infrastructure maintained at NLA  Systems and applications  Storage of archival content  Continuous development of systems from 1997-2007  3 version releases of PANDAS  Technical support for applications and systems  Expertise to assist with harvesting problems  Support for domain harvests

18 Overview of PANDORA procedures  PANDAS (PANDORA Digital Archiving System)  Workflow management system  Httrack harvesting software  Agencies (PANDORA participants)  Users  Administrators (PANDAS and Agency)  Standard user  Informational user  ‘Worktrays’ manage individual and agency workflow

19 Overview of PANDORA workflows  Some concepts:  Titles  The target entity: a single document, a website, and everything in-between  Publishers  Permissions  Instances  Each instance of an archived title  Users (‘owners’)  Belong to Agencies and own titles  Manage workflow among different agencies/people

20

21 Worktrays - Selection  Nominating titles  Shared agency worktray  Before selection decision is made  Selection statuses:  Selected  Rejected  Monitored

22

23 Worktrays - Permission  Requesting publisher permission  Licence under Copyright Act  Copy, preserve and make accessible  Manage and record publisher contact  Record permission status  Title level permission  Publisher level permission (‘blanket’)

24

25 Worktrays – Gather (Harvest)  Set harvesting schedules  Regular, specific days, gather now  Define harvesting parameters  Seed URLs, filters, gather settings  View gathering titles  Pause, view, modify, stop  Statistics

26

27 Worktrays - Preserve  Manage quality checking process  Not yet archived – working area  Analyse harvested instance:  Completeness  No unwanted content  Functionality  Fix problems (or ‘refer to IT’)  WebDAV, FTP and Samba access to files  Decision on instance: Archive or Delete

28

29 Worktrays - Publish  Manages the public access to archived instances  Set up Title Entry Pages  Add notes  Issues  Copyright statements  Browse listings

30

31 Worktrays - Catalogue  Add ANBD number  Automatically creates AGLS metadata for Title Entry Page

32 Administration  Manage Agency information  Add users  Manage user access  Run reports  Agency statistics and totals  Titles and instances selected, process and archived for specified period  New title instances archived  Scheduled gathers

33 33 http://pandora.nla.gov.au


Download ppt "Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library."

Similar presentations


Ads by Google