IFLA Newspapers pre-conference Geneva, Arturs Zogla

Slides:



Advertisements
Similar presentations
Managing References : Mendeley
Advertisements

History Study Center Primary and secondary sources documenting global history 2010.
About «Cross Border E-archive» Conference «Digital archives and historical cross border heritage» 19 June 2014, Riga, Latvia.
Services Digitisation & Content Management. 600 People – India.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
On the Two Sides of the Pond By Hans-Jörg Lieder, Head of the Department of Bibliographic Services – Union Catalogue of Serials Staatsbibliothek zu Berlin.
Building The Rare book Collection at Rijeka University Library in the Digital Age Ines Cerovac, Senka Tomljanović, Rijeka University Library Seminar The.
Use the buttons on the top to navigate through the presentation 1 Next Menu.
JSTOR User Services l February 2009 Using the JSTOR Interface User Services, February 2009.
Advanced Accessible PDF Document Training Adobe Acrobat 11.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Page 1 June 2, 2015 Optimizing for Search Making it easier for users to find your content.
Managing references : Mendeley
NOBLE Digital Library. How does it work? The NOBLE Digital Library uses the DSpace platform. Image files and metadata are imported into DSpace using.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
Introduction to Current Contents Connect. What is CCC? A multidisciplinary current awareness resource –Browse and search journals, books and websites.
Searching and Accessing the Cultural Heritage in a Digital World Yoram Elkaim International Conference on Intellectual Property & Cultural Heritage in.
Digitisation of Newspapers The South African Experience Patricia Liebetrau IFLA Newspaper Conference, New Delhi, February 2010.
New Innovative Access to Educational and Cultural Multimedia Contents Yuka Egusa Educational Resources Research Center, National Institute for Educational.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Digitisation of Cultural Heritage at the National Library of Latvia: Past and Future Uldis Zariņš Head of Strategic Development National Library of Latvia.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Web based METS creation Ralf Stockmann case study.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
25-27 June 2003Clearing House Workshop, Paris1 Direct access to UNESCO Documents UNESDOC.
Live Search Books University of Toronto – Scholar’s Portal Forum 2007 January 2007.
Mass digitisation? Astrid Verheusen Projectmanager Research & Development Division National library of the Netherlands LIBER-EBLIDA Workshop on Digitisation.
1 Helping communities access and explore their newspaper heritage. Rose Holley – Manager Newspaper Digitisation Program
1 JACoW Joint Accelerator Conferences Website Presented by J. Vigen on behalf of John Poole, JACoW.
RSC eBook Collection April 2007 RSC eBook Collection Over 700 Books c. 8,000 chapters c. 250,000 pages 10,000 items - tables.
1 Using Digital Technologies to unlock history for researchers. Rose Holley – Manager Newspaper Digitisation Program Australian Academy of the Humanities.
OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,
Datasets of the KB Steven Claeyssens – 19 September 2013.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
E-Books Presentation. Hard Copy (Book) Scanning OCR Text Document HTML Conversion Text Formatting Linking Image Insertion Final QC Soft Copy (JPG/TIFF)
Archimer Ifremer’s institutional repository Fred Merceur IAMSLIC's 32nd annual conference Every Continent, Every Ocean October 8-12, 2006 Portland, Oregon,
SPRINGER ONLINE
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.
1 Overview of Progress Cathy Pilgrim – Director ANDP Presentation to NSLA 19 February 2009, National Library of Australia Australian Newspapers Digitisation.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
1 Australian newspaper digitisation program Bronwyn Lee National Library of Australia Presentation to 13 th IASI World Congress – 13 March 2009 Sports.
+ Introduction to the Digitization of Hanguk Bulgyo Chonso Bo Kwang Han, Young Sik Hong, Keum Suk Lee, Yong Kyu Lee, Soon Il Hwang, Jae Soo Lee Institute.
1 Australian Newspapers Beta Summary of Usage and Feedback August – November 2008 ANPlan-ANDP Workshop,
OARE Module 5A: Scopus (Elsevier)
Learning Resource Management and Development System
MSU Libraries’ Course Materials Program:
Text-To-Speech System for English
KB Lab: Exploring the National Library of the Netherlands’ digital treasure trove Lotte Wilms
Table of Contents: Part B
Corpus Linguistics I ENG 617
Turfgrass Information Center Michigan State University Libraries
Text.
DIGITAL LIBRARY.
Information needed for citing sources:
Searching EIT, Author Gay Robertson, 2017.
Family Search and the scanning of OCPL’s historical book collection.
Accessing journals by Language 4
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Using Old Streets to Make New Inroads to Data: Part 1
Current Challenges in Digitization
CMP Creating Your Personal and Small Business Web Sites
Quick and Dirty: the art of OCR
Presentation transcript:

Next steps in newspaper digitization: making use of digitized texts at NLL IFLA Newspapers pre-conference Geneva, 13.08.2014. Arturs Zogla Head of Digital Library National Library of Latvia arturs.zogla@lnb.lv

Newspaper digitization projects at NLL 2000-2006: Image-only PDFs (no OCR) 2007-2009: OCR with article-level segmentation 40 titles/350 000 pages 2010-2012: Mass digitization with OCR and segmentation ~3 million pages (1760-2011)

Heritage-1 http://data.lnb.lv/digitala_biblioteka/laikraksti/

OCR pilot project [1] Content: Olive Software: 40 titles/350 000 pages Periodicals published: 1920-1940 Antiqua script/out of copyright Olive Software: OCR (without manual corrections) and article level segmentation Website with search, browse and advanced search

OCR pilot project [2] Website no longer exists

Mass-digitization Funded by European Regional Development Fund Time frame: 2010.-2012. Scope of the project: Periodicals: ~3 million pages (~1000 titles) Books: ~1.4 million pages (~7000 books) Portal of searchable historical texts User interactivity

Digitization process Selection of materials Scanning Segmenting QA QA Selection of materials Scanning Segmenting Publishing on Web National Library of Latvia QA Selection of materials Scanning & Segmenting Publishing on Web National Library of Australia, National Library of Netherlands

Scanning Outsourced to 2 shipments per month: ~46 000 pages

Scanning output JPEG 2000 file for each page File size: 3-100 MB Books, magazines: RGB Newspapers: Greyscale Resolution: 400 dpi File size: 3-100 MB

Segmentation Outsourced to LETA (CCS docWorks) Zoning: OCR: Articles/titles Images/captions Authors Tables Advertisments OCR: ABBYY Finereader engine Manual correction of titles and image captions For long titles/captions only Named Entities corrected

Segmentation facts Output: Required work force: 1 METS file per issue 1 PDF file per issue 1 ALTO file per page 1 JPEG file per page 1 OCR file per article Required work force: up to 60 operators Three shifts

OCR quality [1] Antiqua script Gothic script Most cases: 98-100% 108% case where OCR engine corrected a typo in original text Gothic script ~80-90% characters recognized correctly

OCR quality [2] OCR Original Correct characters: 685/739 (~92.7%) Correct words: 49/90 (only ~54,4%)

Portal www.periodika.lv

Portal Viewer based on a tool developed at National Library of Luxembourg

What next?

Interactivity OCR corrections Comments Favourite articles Sharing on Twitter, Facebook, etc. Private collections

OCR corrections [1]

OCR corrections [2] TOP 20 OCR error correctors

OCR corrections [3] Reasons for low interest in OCR correction: Relatively high levels of OCR quality. Lack of publicity of OCR correction features. Requires registration.

Use of digitized newspapers by third parties

“Baricadopedia” [1] Project that collects materials on Latvia’s struggle for independence in 1980ies. NLL: provided newspapers “Baricadopedia” team: Manually corrected all OCR errors Tagged each article

“Baricadopedia” [2] http://www.barikadopedija.lv/

Twitter ANANASS or ANANĀSS? The correct spelling of “pineapple” in Latvian. Is it: ANANASS or ANANĀSS? 137 mentions in periodika.lv 32 mentions in periodika.lv

Old word “modernization” service

A problem Gothic texts Low hopes for crowd-sourcing contain a lot of OCR errors; are written in obsolete orthography; contain obsolete words. Low hopes for crowd-sourcing

Solution A faulty OCRed word Correction of OCR errors becllaws Web service Correction of OCR errors “Modernization” of orthography Finding contemporary synonims bedlams bedlam Contemporary word with no errors asylum

Transliteration rules Type m  w OCR error / optional f  s w  v Orthography / mandatory ah  ā tsch  č ee  ē ee  ie ee  ee Orthography / context-sensitive

Results On average: 2.89 word variants for an input word Generated word variants Word is found in a dictionary On average: 2.89 word variants for an input word 92.45% probability that the correct one is among them

Three computer linguistics experiments

Three experiments Newspaper text corpus Named Entity Recognition Time-sensitive dictionaries

Newspaper text corpus Advanced indexing of newspaper text Advanced full-text search query language 4.5 billion tokens = largest public text corpus for Latvian language

Named Entity Recognition [1] 7 types of tags: Person Location Organization Facility Event Product Time 21 subtypes

Named Entity Recognition [2] Stanford CFR classifier Ground truth: manually tagged sample documents with ~150 000 words Thesauri used: Persons Places Institutions UN = United Nations

Named Entity Recognition [3] How to tag? “President of Republic of Latvia Guntis Ulmanis met with Minister of Culture today.” “President of Republic of Latvia Guntis Ulmanis met with Minister of Culture today.” “President of Republic of Latvia Guntis Ulmanis met with Minister of Culture today.” “President of Republic of Latvia Guntis Ulmanis met with Minister of Culture <of Republic of Latvia> today.”

Named Entity Recognition [4] Result 26 000 entities that were found at least 10 times >80% for good quality OCR texts

Time-sensitive dictionaries [1] «President of USA» J. F. Kennedy in 1959 article Barack Obama in 2014 article «Bombay» in 1970ies = «Mumbai» today 1 dollar in 1933 ≠ 1 dollar in 2014 Street names in a city change over time

Time-sensitive dictionaries [2] What address from 1930ies advertisment is this today?

Time-sensitive dictionaries [3] Web service that Converts old NE to current versions Converts modern NE to a version for a particular year Interprets keywords (like, «president of USA») for a particular year

Arturs Zogla arturs.zogla@lnb.lv Thank you! Arturs Zogla arturs.zogla@lnb.lv