Beyond the Early Modern OCR Project or, What are We Going to do With Ourselves Now? Matthew Christy, Laura Mandell, Elizabeth Grumbach.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

PUBLICATIONS BOARD REPORT Joe Konstan SGB Publications Advisor.
Eighteenth Century Collections Online The Worlds Largest Scholarly Research Community.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Standard Grade Notes General Purpose Packages. These are Software packages which allow the user to solve a range of problems.
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Making online claims for OCR Nationals A step-by-step guide for centres.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
Connected Histories Sources for Building British History, Funded under the JISC eContent Capital Programme for 18 months Partners:  Prof. Tim.
An Online Microsoft Word Tutorial & Evaluation Begin.
Overview of PubWEST Patent and Trademark Depository Library Training Seminar April 2006.
Commercial Data Processing Lesson 2: The Data Processing Cycle.
EMOP Post-OCR Triage Diagnosing Page Image Problems with Post-OCR Triage for eMOP Matthew Christy, Loretta Auvil, Dr. Ricardo Gutierrez- Osuna, Boris Capitanu,
Slide 1 Word Processing. Slide 2 What is a word processor? A word processor is a computer that you use for writing, editing and printing text. A dedicated.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Create your own WebPages Workshop Trainers: Jackie Balchin and Sarah Beer, Communications and Promotion Officers, OCVA.
Information Retrieval in Practice
SCANNING BASICS From a Windows Perspective OFFERED BY INSTRUCTIONAL COMPUTING UNIVERSITY OF MISSOURI – ST.LOUIS.
Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom
Modules, Hierarchy Charts, and Documentation
High Volume Production of Alternative Text: Supporting a Statewide System The Alternative Media Access Center.
Overview of Search Engines
Online resources in TCD Library:
Session 803: Processing PDF Files Gaeir Dietrich Director High Tech Center Training Unit
Online the Library Michaelmas Term 2011 Trinity College Library Dublin 1 1.
October 23, Expanding the Serials Family Continuing resources in the library catalogue.
Adobe Acrobat. Overview Basic Skills – Updating – Making – Updating – Sending Advanced – Form creation – Data Exportation Help Resources.
An Introduction to Microsoft Word. Microsoft Word This program allows you to type letters, papers, reports and even books. It is available through the.
TH-OCR NK. content introduction go to next page background assumptions overall structure chart IPO for overall structure dataflow diagram of overall structure.
Online Resources From Oxford University Press This presentation gives a brief description of University Press Scholarship.
Luc Audrain Hachette Livre Head of digitalization
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
© 2012 Boise State University1 WordPress Training February 14, 2013.
Project Builder and MediaMatrix: Redefining Access in the Digital Age Dean Rehberger and Michael Fegan MERLOT August 7-10, 2006 New Orleans, LA.
Using OCR for Census Data Capture in China National Bureau of Statistics of China.
The Early Modern OCR Project Big Data in the Humanities Matthew Christy, Laura Mandell, Elizabeth Grumbach.
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
Kentuckiana Digital Library: A Digital Archive of Kentucky History Eric Weig Head, Digital Programs Special Collections & Digital Programs Division University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
An Introduction to Microsoft Word. Microsoft Word This program allows you to type letters, papers, and other documents. This program allows you to type.
1 Using Digital Technologies to unlock history for researchers. Rose Holley – Manager Newspaper Digitisation Program Australian Academy of the Humanities.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Microsoft Office 2007 Access Chapter 3 Maintaining a Database.
WISER: Humanities Early printed materials part II Isabel D. Holowaty, History Librarian.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
ARMS Advanced Risk Management System User Documentation.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Working with Data Lists.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
On Line Microsoft Word Tutorial & Evaluation Begin.
Digital Archives You Can Do It! The Collective - March 2016 Paul Kelly - Digital Archivist - The Catholic University of America.
Library Starter Kit Compiled by Helene van der Sandt.
Moving to a 1-Step Process for Student Submission to the Graduate College, Library, Institutional Repository and ProQuest Abstract In September 2013, Texas.
Graduate Student Name (Replace copy with your information) Title of Action Research Study Department of XXXXXXXXXXXXXXXX, College of XXXXXXXXXXXXXXXXXX,
12 things that you need to know about Open Access, the REF and the CRIS Rowena Rouse Scholarly Communications Manager June 2016.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
Theses and Dissertations Workshop
Information Retrieval in Practice
Todd Quinn – Business & Economics Librarian
IFLA Newspapers pre-conference Geneva, Arturs Zogla
The University of Delaware Higher Education Consortia
Search Techniques and Advanced tools for Researchers
Part 1: Editing and Publishing Files
Quick guide < Keyword search >
Introduction to Historical Texts
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
New Platform to Support Digital Humanities in the Czech Republic
An Introduction to Microsoft Word
Quick and Dirty: the art of OCR
Presentation transcript:

Beyond the Early Modern OCR Project or, What are We Going to do With Ourselves Now? Matthew Christy, Laura Mandell, Elizabeth Grumbach

 emop.tamu.edu/ emop.tamu.edu/  TCDL 2015 – Beyond eMOP  esentations/#TCDL15 esentations/#TCDL15  eMOP Workflows  emop.tamu.edu/workflow s emop.tamu.edu/workflow s  Mellon Grant Proposal  idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf eMOP Info eMOP WebsiteMore eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop 3 April 27, 2015TCDL15 - Beyond eMOP

 The Early Modern OCR Project (eMOP) is an  Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to  develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents  from the hand press period, roughly  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 4 TCDL15 - Beyond eMOP Goals April 27, 2015

The Numbers Page Images  Early English Books online (Proquest) EEBO : ~125,000 documents, ~13 million pages images ( )  Eighteenth Century Collections Online (Gale Cengage) ECCO : ~182,000 documents, ~32 million page images ( )  Total : >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP : ~46,000 double-keyed hand transcribed docuemnts  44,000 EEBO  2,200 ECCO 5 April 27, 2015TCDL15 - Beyond eMOP

6 April 27, 2015

7 PRImA (Pattern Recognition & Image Analysis Research) Lab at the University of Salford, Manchester, UK SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois, Urbana-Champaign PSI (Perception, Sensing, and Instrumentation) Lab at Texas A&M University The Academy for Advanced Telecommunications and Learning Technologies at Texas A&M University The Brazos High Performance Computing Cluster (HPCC) Our Partners April 27, 2015TCDL15 - Beyond eMOP

The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking, bleedthrough  Old, low-quality, small tiff files  Noise, skew, warp, 8 April 27, 2015TCDL15 - Beyond eMOP

Page Images TCDL15 - Beyond eMOP 9 April 27, 2015

Results  ECCO  Avg. Scores  309,328 pages  86% correct  57% correct on 1 st pages  Comparison to Prime Recognition’s OCR  93% claim  (still running analysis)  EEBO  Avg. Scores  1,475,026 pages  68% correct  No previous OCR to compare to April 27, 2015TCDL15 - Beyond eMOP 10 Using Google’s Tesseract open-source OCR engine with eMOP created training Using Google’s Tesseract open-source OCR engine with eMOP created training

eMOP Outcomes - Github April 27, 2015TCDL15 - Beyond eMOP Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach 11

April 27, 2015TCDL15 - Beyond eMOP 12 Outcomes – Github Franken+ Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach 1.Windows based tool that uses a MySQL DB. 2.Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3.Designed to be easily used by eMOP Undergraduate student workers 4.Takes Aletheia's output files as input. 5.Outputs the same box files and TIFF images that Tesseract's first stage of native training. early-modern-ocr.github.io/FrankenPlus/

April 27, 2015TCDL15 - Beyond eMOP 13 Outcomes – Github emop-dashboard Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach early-modern-ocr.github.io/emop-dashboard/

April 27, 2015TCDL15 - Beyond eMOP 14 Outcomes – Github hOCR deNoising Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach early-modern-ocr.github.io/hOCR-De-Noising/

April 27, 2015TCDL15 - Beyond eMOP 15 Outcomes – Github hOCR deNoising Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach BeforeAfter

April 27, 2015TCDL15 - Beyond eMOP 16 Outcomes – Github page evaluator Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach early-modern-ocr.github.io/page-evaluator/  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1.Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2.Spell check tokens >1 character 3.Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4.Also consider length of tokens compared to average for the page

April 27, 2015TCDL15 - Beyond eMOP 17 Outcomes – Github page corrector Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach early-modern-ocr.github.io/page-corrector/ 1.Preliminary cleanup  remove punctuation from begin/end of tokens & empty lines  combine hyphenated tokens at end of lines  retain cleaned & original tokens as “suggestions” 2.Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e 3.Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3-gram dataset tbat I thoughc Ihe Was

April 27, 2015TCDL15 - Beyond eMOP 18 window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844, volCount: 1474) Outcomes – Github page corrector Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach tbat I thoughc Ihe Was window: l thoughc Ihe Candidates used for context matching: ■l -> Set(l) ■thoughc -> Set(thoughc, thought, though) ■Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497, volCount: 486) ContextMatch: l thought she (matchCount: 1538, volCount: 997) ContextMatch: l thought the (matchCount: 2496, volCount: 1905) window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121, volCount: 120) ContextMatch: though ike was (matchCount: 65, volCount: 59) ContextMatch: though she was (matchCount: 556,763, volCount: 364,965) ContextMatch: though the was (matchCount: 197, volCount: 196) ContextMatch: thought ice was (matchCount: 45, volCount: 45) ContextMatch: thought ike was (matchCount: 112, volCount: 108) ContextMatch: thought she was (matchCount: 549,531, volCount: 325,822) ContextMatch: thought the was (matchCount: 91, volCount: 91) that I thought she was

April 27, 2015TCDL15 - Beyond eMOP 19 Outcomes – Github juxta-CL Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach early-modern-ocr.github.io/Juxta-cl/  Juxta-CL(command line)  created for eMOP  based on JuxtaCommons tool (juxtacommons.org/)juxtacommons.org/  several different comparison algorithms to choose from and other options  Levenshtein / Jaro-Winkler / Juxta  ignore: punctuation, caps, hyphens

20 Outcomes- Github emop-controller Powered by the eMOP DB Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu Run by emop-controller.py April 27, 2015TCDL15 - Beyond eMOP Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach

April 27, 2015TCDL15 - Beyond eMOP 21 Outcomes - Tesseract Training Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach early-modern-ocr.github.io/TesseractTraining/  Font Training  3 different types  Roman, Italic, Blackletter  12 different printers   40 different typeface combinations  Even more combined typefaces training files  Dictionaries  from multiple sources with alternate spellings  More  Franken+ training files  common OCR error file

April 27, 2015TCDL15 - Beyond eMOP 22 Outcomes – DB of EM Printers Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach  Parsing the imprint lines of all EEBO & ECCO docs to create the ImprintDB  Gathering:  Printed By  Printed For  Sold by  Location(“at the Rose and Crown in St. Paul's Church-Yard”, “at the signe of the Traytors head”)  Place (London, Cambridge…)  (Dates)  Those docs with ESTC numbers will be available via a public database

April 27, 2015TCDL15 - Beyond eMOP 23 Outcomes – Fulltext Search of TCP Phase 1 Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach 18thconnect.org/search?a=EEBO&o=fulltext  Phase 1 of TCP EEBO hand transcriptions will be available for fulltext searching in 18thConnect  over 25,000 documents

April 27, 2015TCDL15 - Beyond eMOP 24 Outcomes – Editable EEBO OCR Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach  Can already use TypeWright in 18thConnect to edit ECCO docs (without a subscription)  Will soon be able to edit EEBO docs in TypeWright  TCP hand transcriptions  eMOP OCR transcriptions  When doc is fully corrected, users can request text and/or XML versions of corrected transcriptions for scholarly use.  First time EEBO transcriptions will be available for over 80,000 docs.

April 27, 2015TCDL15 - Beyond eMOP 25 Outcomes – Collection Evaluation Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach  We’ve collected over 6 TB of data  Our post-processing algorithms collect data:  amount of skew  amount of noise  number and location of text columns  page quality  correctability  corrections made  Saved to the eMOP DB  First time this has been done on these collections  Pre-process and re-OCR pages  several iterations will identify bad page images

April 27, 2015TCDL15 - Beyond eMOP 26 Outcomes – Outreach Code Repo Tesseract Training ImprintDB TCP EEBO Phase 1 EEBO OCR Collection Evaluation Outreach  2013  ASECS (Amer. Soc. of Eighteenth Century Scholars)  KIAS (Kule Institute for Advanced Studies) : Around the World Symposium  DocEng (ACM Symposium on Document Engineering)  2014  TxDHC  TCDL  DH (5 papers & a poster)  SAA (pre-conference workshop on OCR)  DHCS  2015  AAAI: (Association for the Advancement of Artificial Intelligence)  TAMU Big Data Workshop  Downstream from the Digital Humanities  Penn St. OCR Workshop  IDHMC CE Classes  TxDHC  TCDL  DHSI (July)

The Future of eMOP April 27, 2015TCDL15 - Beyond eMOP 27  We want eMOP to be viable long- term  Continue to be developed/suppo rted  open-source repos  outreach  Looking for partners  We want eMOP to be viable long- term  Continue to be developed/suppo rted  open-source repos  outreach  Looking for partners from Kill Bill 2, 2004, Miramax

Future - Partners  Hathi Trust  Talking about possible “next-step” grant application  Notre Dame Libraries  Helping to recreate eMOP workflow on their systems to OCR Cobbett's Complete Collection of State Trials  Swapping labor  Opportunity to test the robustness of our workflow and tools as a whole unit  Penn St. Libraries  Held a workshop on OCR’ing with open source tools  Opportunity to test the robustness of our workflow and tools as discrete parts April 27, 2015TCDL15 - Beyond eMOP 28

Future - Partners  UT Austin  Sub-awardee on an NEH grant to OCR Primeros Libros collection [primeroslibros.org/]primeroslibros.org/  $$ for continued development of workflow and improved hardware  Opportunity to add another OCR engine and further test the robustness of the workflow on other document sets  step towards a voting algorithm  Adam Matthew Digital  Going to OCR some of their EM collection to see how we do; maybe more if good results  Opportunity to test robustness on similar collection; establish a relationship with another industry leader  possibly acquire more data to ingest into ARC nodes and TypeWright April 27, 2015TCDL15 - Beyond eMOP 29

Future - Partners  Austin Fanzine Project  Small, local project; personally interesting  Opportunity to test tools and workflow on a different collection and whole other type of documents (not that dissimilar)  Hathi Trust Research Center  Talking about possibility of using our algorithms on their imprint data to expand and share Imprint DB  Opportunity to acquire more EM publishing data and make available in one place April 27, 2015TCDL15 - Beyond eMOP 30

Call to Libraries April 27, 2015TCDL15 - Beyond eMOP 31  Use the Code  Give us feedback  Develop the Code  Create branches in Git  Commit improvements/changes  Contact Us  Consultation  Partnership  for Labor or Funds

The end For eMOP questions please contact us at : 32 TCDL15 - Beyond eMOPApril 27, 2015