Download presentation
Presentation is loading. Please wait.
Published byArron Mitchell Modified over 8 years ago
1
Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
2
digitalnewspapers.org
3
UDN Overview Run by J. Willard Marriott Library Entirely a “soft money” program –Raised $3.5 million in local, state, federal funds Launched in December, 2002 –3 titles, 30K pages –On CONTENTdm (version 2) Current holdings –60 titles / 89K issues / 960K pages / 10.6 million articles oOne-millionth page next month –27 of 29 Utah counties represented oCan’t find a long newspaper run from either Wayne or Daggett County –Covering 1850 (Deseret News) – 1982 (Vernal Express) Participant in NEH’s National Digital Newspaper Program –Charter member since 2005
4
Constants For 8 Years We have always had –Article-level metadata oHeadlines, article type, full text Headlines and article classification captured manually (overseas) Full text generated by OCR software –A 3-tiered compound object structure oIssue / Page / Article In the early days, we had text also in the page items, but we removed them because searches returned “double” hits –Images of both full pages and individual articles oVery nice for viewing but bad for the database Millions of PDF files –CONTENTdm oUDN is the largest Cdm server 11.6 million items
5
Article-Level Question: Is article-level metadata worth the time and expense to create? –We (UDN) believe it delivers a higher-quality user experience oPay extra ($0.30/pg.) to have it –NDNP (NEH / LC) believes cost outweighs the benefit Headlines, sub-headings –Keyed manually, nearly 100% accurate oDouble-keyed and reconciled –Contain important keywords –Search accuracy is more critical as newspaper databases grow Article types –Mastheads, advertisements, news –Birth, marriage, death announcements oGenealogical info is especially important to our users 62% visit for genealogy Full text – generated from OCR
6
Indexing Newspaper Content In the early days, our newspapers wouldn’t ingest using the Acquisition Station (now the Project Client) –Complex metadata –3-tier compound objects In 2002 DiMeMa (now OCLC) developed specialized software to import our newspaper content –The “Indexer” –Allowed us to use Cdm platform for rapid expansion of newspapers oEventually purchased 2 nd license (unlimited) just for UDN
7
Some History 2002-2004: DiMeMa indexed UDN content with the Indexer –Delivered Cdm-ready files to us o$0.15 per page –We loaded into Version 3 Over time became problematic –DiMeMa was a software company, not in the “production” business –We wanted to run it ourselves and reduce our costs 2005 - DiMeMa gave us the Indexer software –One caveat – it was never made “production-ready” oOnly an internal operation-type software o“Rough around the edges” Continued running indexer for newspaper content ever since –It’s a work horse oProcessed 1 million pages oBut the old, gray mare…………..
8
V3 Indexer – Today Runs on Windows 2000 server –Microsoft is dropping support next month OCLC no longer provides support for it –Cannot enhance the functionality –Cannot install a 2 nd server; only running one instance Slow and complex –Major bottleneck in the process –Error messages are difficult to understand Command line indexing –Web indexing times out V3 indexing fails when collection gets too big –Requires entire collection to be re-indexed Afterwards –Correct some metadata and add other metadata –Convert to V4
9
Data Formats Receive data in 2 distinctly different formats –In-state projects oCdm V3 format for the Indexer oIngest into Cdm V4 Very long, complex process oMigrating to Cdm V5 will add additional steps –NDNP oNDNP METS/ALTO format oSend batches to LC as required oCannot ingest into UDN Indexer cannot process NDNP batches –iArchives has to support both formats
10
Recent Developments NEH’s National Digital Newspaper Program (NDNP) –Newspaper programs are launching all over the country oMore than 22 states now participate at the national level –NDNP standards are rapidly becoming “the” standard –Does not provide funding for article-level processing oMajor barrier for some states to implement article-level –NDNP spec requires article coordinates, however oFor highlighting in the viewer JPEG2000 –Tiles enable online viewers to make a smaller “clip” from a larger image oe.g., a newspaper article can be clipped out of a newspaper page –We no longer need separate article images oAlthough we still need article metadata
11
Dilemma How do we: Get new, fully supported ingestion software for newspapers Migrate to Cdm V5 Continue “full” article-level metadata Receive only one file format (NDNP) –Move away from the Cdm V3 file deliveries
12
Bridging the Gap Idea – extend the NDNP xml spec to include article data –Create a new, separate xml file for articles We, OCLC, and iArchives developed spec for article xml files –Similar xml formatting as NDNP –Included “on the side” with NDNP batch files –Deliverables now are oStandard NDNP batch oArticle.xml file for each issue in the batch containing the article metadata –iArchives has provided a script that collates each article xml file into its respective issue folder
13
Old Article Metadata Now at Page-Level Full text –Per NDNP, searchable text is stored at the page-level oA part of page metadata oEach word and article have their own coordinates Can be highlighted by the viewer in the page image Article images no longer required –12 PDFs per page (on average) replaced by single, higher-res jp2 of full page –Reduces the number of image files by 90% oAlthough actual file space is increasing 12 PDFs ~ 2 MB’s / 1 jp2 ~ 4 MB’s
14
New Cdm Flex Loader Combines the benefits of article-level metadata with page-level processing and file structure Processes standard NDNP batch with article xml files in each issue folder –Supports article xml created by either iArchives or CCS Loads directly into Cdm V5 –Approve and index like any other collection Compound object contains –Issue / compound object metadata –Page images and metadata within each issue –Article metadata is stored internally in article xml files oIn the collection’s “supp” folder Will have article xml search-ability in a later release –Scheduled for later this year
15
New Features Will be a standard Cdm release with full product support –No cost extension of the software “Extendable” beyond newspapers –Can support any content with similar xml structure Very small client application –Most processing done by connecting to a “web service” Nice user interface –Tabs for entry and mapping of metadata oHighly configurable Can process tiff’s or jp2’s –jp2’s recommended for speed oRemember: newspaper processing is voluminous and speed can be very important
16
Features - 2 Loads into Approval Queue –Per normal for any collection Pretty fast (in beta testing) –45 issues in an hour on my (slow) desktop –Should be able to speed this up Can continue to load into existing collection –Eliminates need to create many separate collections and merge them together later
17
(demo of loader)
18
Questions? John Herbert Head-Digital Technologies J. Willard Marriott Library john.herbert@utah.edu (801) 585-6019
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.