Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.

Similar presentations


Presentation on theme: "1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28."— Presentation transcript:

1 1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28 November 2008

2 2 Requirements Manage, store and organise millions of digital newspaper pages behind the scenes. Manage, store and organise millions of digital newspaper pages behind the scenes. Manage the entire digitisation workflow from scanning to public delivery. Manage the entire digitisation workflow from scanning to public delivery.

3 3 How? Current NLA Digital Content Management System cannot cope with volume of digital newspapers or complex structure of newspapers Current NLA Digital Content Management System cannot cope with volume of digital newspapers or complex structure of newspapers No ‘off the shelf’ product available that meets requirements No ‘off the shelf’ product available that meets requirements Need the system now (March 2007) Need the system now (March 2007)

4 4 Solution NLA team to develop a software solution NLA team to develop a software solution Ensure the system uses open source software Ensure the system uses open source software System to be standalone and not bolted into other systems System to be standalone and not bolted into other systems Possibility of sharing system in future/providing as open source to other libraries Possibility of sharing system in future/providing as open source to other libraries

5 5 Software Development Agile method of development used Agile method of development used Modules designed in stages as required Modules designed in stages as required Stage 1 – Receipt and checking of scanned images Stage 1 – Receipt and checking of scanned images Stage 2 – Quality Assurance Modules Stage 2 – Quality Assurance Modules Stage 3 – Sending/receiving items from OCR Stage 3 – Sending/receiving items from OCR Stage 4 – System Administration and Statistics Stage 4 – System Administration and Statistics Stage 5 – Interface Design and Usability of System Stage 5 – Interface Design and Usability of System

6 6 Progress Software development March 2007 – June 2008 Software development March 2007 – June 2008 First module in use May 2007 First module in use May 2007 CMS in use for 18 months CMS in use for 18 months CMS in final stages of completion (Jan – June 2009) CMS in final stages of completion (Jan – June 2009) Further development required to enable acceptance of contributors content Further development required to enable acceptance of contributors content Simple user interface yet to be designed Simple user interface yet to be designed

7 7

8 8 Australian Newspapers CMS Screenshots of system follow and explanation of workflows. Screenshots of system follow and explanation of workflows.

9 9 Preparing for Digitisation Preparing for Digitisation Creation of digital images Creation of digital images Adding metadata and Quality Assurance Adding metadata and Quality Assurance Optical Character Recognition Optical Character Recognition Quality Assurance Quality Assurance Statistics and Admin Statistics and Admin Workflow Summary

10 10 Identify title to be digitised Identify title to be digitised Source master microfilm from owner Source master microfilm from owner Send master microfilm to scanning contractors Send master microfilm to scanning contractors Add title to Content Management System Add title to Content Management System Preparing for Digitisation

11 11 CMS - Add Title

12 12 Microfilm converted to digital images

13 13 Image Reception Images received from scanning contractor on LTO2 Tape Images received from scanning contractor on LTO2 Tape Tapes added to tape robot and extracted Tapes added to tape robot and extracted Reels automatically added to Content Management System Reels automatically added to Content Management System Reel details are checked Reel details are checked Images ingested into Content Management System Images ingested into Content Management System

14 14 CMS - Check Reel Details

15 15 CMS - Ingest Reels

16 16 CMS - Tasks 1 and 2 Task 1 – Add metadata (dates and page numbers) Task 1 – Add metadata (dates and page numbers) Supervisor reviews marked pages Supervisor reviews marked pages Task 2 – Define batches Task 2 – Define batches Task 2 – Resolve duplicates Task 2 – Resolve duplicates Task 2 – Create missing page targets Task 2 – Create missing page targets

17 17 Identify title to be worked on

18 18 Identify reel

19 19 CMS - Adding Metadata Date and Page Sequence number added Date and Page Sequence number added

20 20 Supervisor Review Supervisor reviews pages marked for attention Supervisor reviews pages marked for attention

21 21 CMS - Define Batches Batches defined by date Batches defined by date Each batch contains 2-3000 images Each batch contains 2-3000 images Batches are automatically assigned a number Batches are automatically assigned a number

22 22 CMS - Resolve Duplicates Duplicate pages compared and the best copy is selected Duplicate pages compared and the best copy is selected

23 23 Missing page targets are generated Missing page targets are generated Missing Pages

24 24 Optical Character Recognition (OCR) Complete batches are added to a tape Complete batches are added to a tape Tapes are generated and written Tapes are generated and written Tapes sent to OCR contractor Tapes sent to OCR contractor Contractor completes OCR processes Contractor completes OCR processes OCR data (not images) is returned via FTP OCR data (not images) is returned via FTP

25 25 CMS - Tapes Created Completed batches added to a tape Completed batches added to a tape

26 26 Optical Character Recognition (OCR) of pages and article zoning

27 27 OCR Data Reception (Automated process) OCR contractor advises NLA server that a batch has been completed OCR contractor advises NLA server that a batch has been completed NLA server downloads the batch NLA server downloads the batch Batch is ingested into Content Management System Batch is ingested into Content Management System Checks are performed on data validity Checks are performed on data validity QA Derivatives are generated QA Derivatives are generated Articles may now be searched, but are not yet publicly accessible Articles may now be searched, but are not yet publicly accessible

28 28 CMS - Batch information

29 29 Quality Assurance (QA) A random sample of Issues and Articles are checked A random sample of Issues and Articles are checked Volume and Issue number are checked for accuracy Volume and Issue number are checked for accuracy Sample articles are checked against agreed Quality Acceptance Criteria (QAC) Sample articles are checked against agreed Quality Acceptance Criteria (QAC) Error rates calculated against QAC on the fly Error rates calculated against QAC on the fly Supervisor checks final results Supervisor checks final results

30 30 CMS - Selecting the batch

31 31 Volume & Issue Number Check

32 32 Article checked against QAC

33 33 Re-keyed fields checked for accuracy

34 34 Supervisor checks results (auto or manual accept/reject)

35 35 QA Results Automated email sent to supplier advising the result Automated email sent to supplier advising the result Emails for rejected batches include a summary of errors Emails for rejected batches include a summary of errors Summary of errors saved for all batches Summary of errors saved for all batches Accepted batches are immediately accessible in public search system Accepted batches are immediately accessible in public search system

36 36 Batch History and details retained

37 37

38 38 Search or Browse articles within CMS

39 39 Statistics Stats for content received, QA’d and delivered to the public generated by the Content Management System Stats for content received, QA’d and delivered to the public generated by the Content Management System (Stats for usage of public search system collected using Google Analytics) (Stats for usage of public search system collected using Google Analytics)

40 40 CMS - Content Statistics

41 41 CMS - Work Statistics

42 42 Access Public access to digital newspapers is provided through Australian Newspapers Search and Delivery System Public access to digital newspapers is provided through Australian Newspapers Search and Delivery System Users can search or browse newspapers Users can search or browse newspapers Search results can be refined using filters Search results can be refined using filters Users can browse by Newspaper title or Date. Users can browse by Newspaper title or Date.

43 43 http://ndpbeta.nla.gov.au/ndp/del/home


Download ppt "1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28."

Similar presentations


Ads by Google