PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia

Slides:



Advertisements
Similar presentations
Adding OAI-ORE Support to Repository Platforms Alexey Maslov, Adam Mikeal, Scott Phillips, John Leggett, Mark McFarland Texas Digital Library TCDL09.
Advertisements

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.
© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
Web Plus Overview Division of Cancer Prevention and Control National Center for Chronic Disease Prevention and Health Promotion CDC Registry Plus Training.
1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
PANDORA and Beyond: Managing Web Archiving at the National Library of Australia Digital Preservation Seminar National Library of Australia, 21 November.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
Integrating an MLE with Voyager Paul Hudson Learning Technology Development Unit University of Hertfordshire.
1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.
APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Systems Oceanography: Observing System Design. Why not hard-wire the system? Efficiency of interface management –Hard-wire when component number small,
Introducing Symposia : “ The digital repository that thinks like a librarian”
Data Warehouse success depends on metadata
SESSION 9 THE INTERNET AND THE NEW INFORMATION NEW INFORMATIONTECHNOLOGYINFRASTRUCTURE.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
E-Business: Intra-Business E-Commerce
Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, June 2001 Warwick Cathro,
Web archiving at the NLA ‘ Archiving the music web’ Music Council of Australia Annual Assembly 28 September 2009 Paul Koerbin Manager Digital Archiving.
Developing PANDORA Mark Corbould Director, IT Business Systems.
Improving access to digital resources: a mandate for order mandate: managing digital assets in tertiary education craig green,
National Aeronautics and Space Administration Implementing DSpace at NASA Langley Research Center 1 Greta Lowe Librarian NASA Langley Research Center
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
1 Archive-It Training University of Maryland July 12, 2007.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
Putting it all together for Digital Assets Jon Morley Beck Locey.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
WebArchiv Czech Web Archive IIPC 2007, Paris.
5-7 November 2014 ADLSN - ADLC Practical Digital Content Management from Digital Libraries & Archives Perspective.
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
Why Open-Source? No Vendor-Locking In a proprietary software --- Your supports lock with it. freedom to customize and improvements in software needs,
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Persistent Digital Archives and Library System (PeDALS) SC Department of Archives and History.
Data management in the field Ari Haukijärvi 2nd EHES training seminar.
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Geospatial Technical Support Module 2 California Department of Water Resources Geospatial Technical Support Module 2 Architecture overview and Data Promotion.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Implementing the Standard on digital recordkeeping.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.
ETD Software: Toward the Future with Retrospective Hindsight Gail McMillan Digital Library and Archives, Virginia Tech ETD 2008: 10th International Symposium.
International Seminary on Digitisation: Experience and Technology 11 th May 2004 | National Library | Lisbon – Portugal DIGITAL ARCHIVE OF PORTUGUESE ART.
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson.
Building a Framework to Support Scholarly Journal Publishing at the University of Pittsburgh Vanessa Gabler Electronic Publications Associate, Office of.
ALA Annual Meeting Claire Cocco Global Product Manager CONTENTdm Users Group June 30th, 2008.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Where are my files? Discoveries in establishing a digital archive workflow Sally McDonald Archivist/Librarian Western History/Genealogy, Denver Public.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Libraries in the digital age Collection & preservation for generational access part two The LOCKSS Program.
William J Nixon Setting up a Repository. Introduction Key Features to consider (and review) Wide Range of Technology Available –Best fit for purpose –Clear.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Maintaining and Updating Windows Server 2008 Lesson 8.
Developing a Dark Archive for OJS Journals Yu-Hung Lin, Metadata Librarian for Continuing Resources, Scholarship and Data Rutgers University 1 10/7/2015.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Architecture Review 10/11/2004
7th Annual Hong Kong Innovative Users Group Meeting
Moving on : Repository Services after the RAE
Joseph JaJa, Mike Smorul, and Sangchul Song
DIGITAL LIBRARY.
Márton Németh – László Drótos How to catalogue a web archive?
Presentation transcript:

PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia

PANDORA Australia’s Web Archive 1.Background and approach to web archiving 2.The management system (PANDAS) 3.Workflows and procedures 4. Issues and future directions

PANDORA Australia’s Web Archive 1.Background and approach to web archiving in Australia - PANDORA

PANDORA Australia’s Web Archive Beginnings Name originally an acronym for: ‘Preserving and Accessing Networked Documentary Resources of Australia’ Now: ‘Australia’s Web Archive’ Began in mid-1996 (selecting) Began archiving in late 1996-early 1997

PANDORA Australia’s Web Archive Approach Practical and pragmatic Began as: Proof-of-concept project Now: Routine National Library activity Achieving outcomes while continuing to develop and extend processes and systems Best use of available resources and infrastructure

PANDORA Australia’s Web Archive Resources Existing technical services staff - librarians Digital Archiving Branch has the business responsibility Information technology staff from within the Library for development and support PANDORA partner institutions (10 including the NLA)

PANDORA Australia’s Web Archive Mandate and responsibilities National Library of Australia’s statutory responsibilities National Library Act, 1960 Maintain and develop a national collection of ‘library material’ Comprehensive collection relating to Australia and the Australian people

PANDORA Australia’s Web Archive Mandate and responsibilities National Library has a leadership role for the Australian library community Legal deposit Legal deposit in the federal jurisdiction in Australia does not cover electronic resources

PANDORA Australia’s Web Archive Some key characteristics Selective approach to archiving online resources Scalable to available resources and do-able Negotiate permission to archive Apply manual quality assurance processes to harvested resources Provide access to the archived resources

PANDORA Australia’s Web Archive Shortcomings of selective approach Can’t collect everything that future researchers may want Labour intensive tasks Does not retain the full complexity of the linking structure of the Internet

PANDORA Australia’s Web Archive Indicative statistics as at August ,500+ titles 13,000+ archived instances 21 million files* 680 gigabytes* *These figures are for the display copy only. Two more preservation copies plus preservation metadata are maintained.

PANDORA Australia’s Web Archive 2. The management system: PANDAS

PANDORA Australia’s Web Archive PANDAS – PANDORA Digital Archiving System Integrated web based system Workflow management system Developed specifically to manage the web archiving processes at the National Library of Australia Used by PANDORA’s partners located throughout Australia

PANDORA Australia’s Web Archive Developed in-house at the NLA Replaced multiple non-integrated systems used between 1996 and 2001 Written in Java on Apple WebObjects application development platform First version released in June 2001 Second version released August 2002 Ongoing enhancement and development program

PANDORA Australia’s Web Archive

PANDAS system architecture consists of 4 layers 1) Presentation layer – client applications for visual presentation to the end user 2) Application layer – the core application functionality such as PANDAS and PANDORA

PANDORA Australia’s Web Archive PANDAS system architecture consists of 4 layers 3) Business layer – application access to the data storage and communication infrastructure 4) Data layer – third party infrastructure products, e.g. Oracle database and WebDAV accessible files servers

PANDORA Australia’s Web Archive Nomenclature PANDORA – the whole enterprise PANDAS – the whole management system PANDAS – the system component providing a web-based user application to manage workflows PANDORA – the system component that creates the public interface

PANDORA Australia’s Web Archive

PANDAS is used to: Record administrative metadata about titles selected (or rejected or monitored) for archiving Schedule and initiate harvesting Manage quality assurance checking and problem fixing

PANDORA Australia’s Web Archive PANDAS is used to: Prepare items for public display through the PANDORA home page Manage access restrictions Generate management reports

PANDORA Australia’s Web Archive PANDAS is a workflow system that: Connects with and utilises other software and protocols for specific functions Provides an interface to the harvesting software – currently this is HTTrack (

PANDORA Australia’s Web Archive PANDAS is a workflow system that: Uses WebDAV protocol to provide content managers with remote access to the harvested files Uses Z39.50 protocol to access the National Bibliographic Database to extract metadata from the MARC record

PANDORA Australia’s Web Archive PANDORA public interface component Title and subject listings and title entry pages are generated ‘on-the-fly’ from PANDAS metadata Some static web pages (documents, information) Search engine

PANDORA Australia’s Web Archive Persistent identifiers and URLs Running number generated by PANDAS Persistent URL applied to title entry page Logically extended to any resource in the Archive Citation generator on public interface

PANDORA Australia’s Web Archive 3. Workflows and procedures

PANDORA Australia’s Web Archive Identifying and selecting Recording administrative metadata Harvesting Quality assurance processing Archiving Preparing for public display Creating resource discovery metadata Reporting

PANDORA Australia’s Web Archive Identifying and selecting Selection guidelines – each partner has their own guidelines Just guidelines … not rules nor ideology Selection priorities in guidelines (NLA) Notification networks – indexing agencies, staff, publishers, public NLA selection guidelines available at:

PANDORA Australia’s Web Archive Selection – what sort of publications? Titles – the entities to be archived Defined during the selection process Document-like publications, e.g. PDF Whole web sites Parts of web sites

PANDORA Australia’s Web Archive Selection – what sort of publications? Focus on content – substantial, unique Special events or issues Format or potential technical problems are not, in principle, a selection consideration One-off archiving Scheduled archiving – whole entity, not an update

PANDORA Australia’s Web Archive Recording administrative metadata Four types of records –Title –Publisher –Indexer –Collection Selection status Additional details associated with status (standing)

PANDORA Australia’s Web Archive Administrative metadata Publisher details Archiving permission status Access restrictions Notes Assigning ownership of titles Transfer titles between agencies

PANDORA Australia’s Web Archive Harvesting Mostly harvesting from the Web Also able to upload from local drives (WebDAV protocol) Third party software – HTTrack PANDAS interface to set up harvesting rules

PANDORA Australia’s Web Archive Harvesting Define extent of selected resource to be archived Set gather filters and gather settings Set gather schedule Initiate harvesting

PANDORA Australia’s Web Archive Scheduling harvesting Significant function of PANDAS Regular schedules, e.g. weekly, monthly, annual Specific dates Harvest now Combination of scheduling options

PANDORA Australia’s Web Archive Harvesting - filters and settings Default settings Ignore robot.txt rules because permission to archive has been obtained from publisher Gather sub-directories Gather ‘near files’, e.g. linked images Limit on depth – sufficient for any web site but to prevent abuse of host server

PANDORA Australia’s Web Archive Harvesting - filters and settings Gather filters are critical Selection based on specific content Archiving permission for specific content Efficient use of resources (bandwidth, storage)

PANDORA Australia’s Web Archive Quality assurance Important process for PANDORA Owner of title notified when harvest is complete Visual, manual checking process Check for completeness and functionality Check that content is new (if previously archived) Check that there is no extraneous material

PANDORA Australia’s Web Archive Quality assurance Harvested files in a working area – not ‘archived’ at this stage WebDAV (protocol) access to the working area Problem analysis and fixing Missing files, broken links Complex problems referred to IT support through PANDAS error reporting module

PANDORA Australia’s Web Archive Quality assurance Problems due to limitations of harvesting software Excessive use of JavaScript Deep web resources Traps such as metafiles, absolute links Other methods of acquisition (CD, FTP) Business decision whether or not to accept the harvested instance

PANDORA Australia’s Web Archive Archiving Harvested instance is accepted One-click process for PANDAS user Transfers instance from working area to Digital Object Storage System Creates preservation and display copies Perl scripts – e.g. re-write external links

PANDORA Australia’s Web Archive Archiving – preservation master copies Preservation master – incl. harvest log files Display master – includes changes made to the harvested instance (manual and scripts) Metadata master – http header responses Gzip compressed TARball (Tape Archive format) on Digital Object Storage System (DOSS) Access (display) copy on web server

PANDORA Australia’s Web Archive Preparing for public access – title entry pages Generated ‘on-the-fly’ from content of PANDAS database Partner branding Link to publisher’s site Links to dated archived instances Manual additions – notes, links to serial issues, copyright statement

PANDORA Australia’s Web Archive Preparing for public access – listings and collections Subject listings Title listings Partner views Collections – events, sampling over specific time period

PANDORA Australia’s Web Archive Public access – restrictions Period Date Authentication IP addresses/subnet mask (i.e. physical locations such as a single PC in the NLA main reading room) PANDAS manages automatically – can be manually enabled/disabled

PANDORA Australia’s Web Archive Creating resource discovery metadata MARC record for each title National Library of Australia OPAC National Bibliographic Database Metadata derived from the catalogue record is embedded in the title entry pages Indexing/abstracting services’ citations

PANDORA Australia’s Web Archive Reporting Pre-defined reports from PANDAS UI Statistical and data reports SQL query on Oracle database (not through PANDAS interface) ProClarity for user defined data cube reporting and analysis LinkScan for broken publisher URL links

PANDORA Australia’s Web Archive 4. Issues and future directions

PANDORA Australia’s Web Archive Current issues Commitment to selective, quality assessed, accessible web archiving Efficient identification – automated selection Legal deposit (when?) Blanket permission – government agencies

PANDORA Australia’s Web Archive Current issues Ongoing development and enhancement of PANDAS Improve robustness of system Re-engineer PANDAS software Need to achieve greater efficiencies and increase scale of web archiving activity

PANDORA Australia’s Web Archive Future directions Automatically ingest and process larger volume of online publications and associated metadata – batches Comply with international standards and adopt standard tools – IIPC Incorporate other collection methods – domain harvesting, deep web, deposit

PANDORA Australia’s Web Archive Future directions Automate collection of more preservation metadata and develop metadata management interface Improve access and discovery paths to the Archive’s resources as it continues to grow

PANDORA Australia’s Web Archive More information PANDORA home page Key documents (background, technical, PIs) PANDAS manual Papers and presentations

PANDORA Australia’s Web Archive Questions?

PANDORA Australia’s Web Archive