Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

B2PDF b2pdf is the new and innovative release of our powerful command line tool for PDF customization b2pdf is a robust stand alone PDF file generation.
End-to-end document capture, indexation, OCR to Microsoft SharePoint
DREAMWEAVER Welcome to our website!
DDA and metadata handling Questions Variables Study description Adresses Administrative data related to studies.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
MS-Word XP Lesson 1.
CHS GRAPHICS GDP UNIT 01 FILE FORMATS Understanding File Formats.
Sharpdesk Overview Desktop Composer Search Imaging      
Enterprise Integration Solutions SharePoint Imaging.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.
Reference and Instruction Automated Statistics Gathering and Reporting System Members: Patrick Chen (pyc7) Soo-Yung Cho (sc444) Gregg Herlacher (gah24)
XP Introduction1 Succeeding in Business Applications with MS Office 2003 Introduction to Problem Solving with Microsoft Office 2003 “You’ve got to seize.
Tutorial 8 Sharing, Integrating and Analyzing Data
AN OVERVIEW OF MAC PDF TOOLS 1. PDF Tools for Mac PDF files can be used either in Windows, Unix or Apple’s Mac OS operating system commonly. It still.
Overview of Search Engines
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
CPSC 203 Introduction to Computers Lab 39, 40 By Jie (Jeff) Gao.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Web Servers Web server software is a product that works with the operating system The server computer can run more than one software product such as .
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
What’s new in Fireworks 8 Optimization Integrated workflow Create without complexity Workflow Improvements.
Analysis of SQL injection prevention using a proxy server By: David Rowe Supervisor: Barry Irwin.
Digitization of the Federal Depository Library Program Judith C. Russell Superintendent of Documents & Managing Director, Information Dissemination “Electronic.
XML Publisher Business Applications Government Forms.
Mike Spence General appearance of map Ease of use Export capabilities Additional features.
© 2005 IBM Corporation IBM Printing Systems IBM OUTPUT ENVIRONMENT IPPD How workflow techniques can by implemented using IPPD Simon Jones 14 th September.
Confidential, I.R.I.S. © 2005, All rights reserved Discover… The most robust solution to structure, index, compress and convert all your documents into.
Copyright © IBM Corp., All rights reserved. This presentation is licensed under Creative Commons Att. Nc Nd 2.5 license. OpenDocument Format.
Session 1 SESSION 1 Working with Dreamweaver 8.0.
1 What’s the difference between DocuShare 3.1 and 4.0?
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
Office 2003 to Office 2007 Transition. What’s New?  Improved GUI  Bigger spreadsheets  1,048,576 rows x 16,384 columns  Improved memory and multi.
SQL Reporting Services From a Developers Perspective Adam Calderon Principal Engineer Interknowlogy LLC
E-Books Presentation. Hard Copy (Book) Scanning OCR Text Document HTML Conversion Text Formatting Linking Image Insertion Final QC Soft Copy (JPG/TIFF)
IAEA International Atomic Energy Agency International Nuclear Information System (INIS) 2.3 Digital Preservation Activities 36 th Consultative Meeting.
Convert PDF files to PowerPoint slides Extract specific PDF pages to PowerPoint - Support to convert encrypted PDF files - Convert PDF to PowerPoint 2003/2007/2010.
Producing a high-impact web experience by integrate Macromedia Flash and ASP By Katie Tuttle CS 330: Internet Architecture and Programming Project.
IST 220 – Intro to Databases Lecture 2 Touring Microsoft Access.
GDP-1 Understanding File Formats. Native format Is it possible to open a.wmv or a.mov file in Illustrator and edit the file? Why or why not? Can you open.
CSEM Experience with Community Modeling Tamas Gombosi.
Recent CMA Enhancements Java-based Scroller Component Sample Layout Fixed problem with Component Modifier when previewing Select List components Fixed.
B Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Working with PDF and eText Templates.
Extend Office clients across platforms using web technologies. Office Add-ins.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Getting Started with Quick Fields LAB 103 Jonathan Lai.
Overview Background System Design and Implementation Input and Output Extractor Recognizer Inserter Web Service Interface Testing Summary.
TEXFIRS Summary Data Reports. NFIRS 5.0 Web-based Summary Output Reports Tool Run summary and statistical calculations on the data saved to the national.
Apache Cocoon – XML Publishing Framework 데이터베이스 연구실 박사 1 학기 이 세영.
Enhance Your Page Load Speed And Improve Traffic.
What module are you interested in? New Features newsWorks 6.4 Standalone CCS | January 2016 newsClipnewsClip web newsCorr newsPress Start.
Long Term Preservation of Digital Data Raymond A. Lorie JCDL ‘01 June 24-28, 2001.
 After completing this lesson, you will be able to:  Describe the page setup options.  Describe how to insert page numbers and page breaks in a document.
How To Use This Document (Print this page for reference if needed)
z/Ware 2.0 Technical Overview
MARKETING PROCESS.
Creating Visual Effects and Animation
Transact™ Mobile SDK Quickly bring capture-enabled mobile applications to market with open-ended backend integrations.
Introduction & Getting ready to work
Office Edition Overview (Dec. 2018).
HTML Text editors and adding graphics
Introduction & Getting ready to work
New Plot Dialog Overview and Demo Kenny Gruchalla and Brian Eyster.
Web creation: File Structure Page Title Page Description
Exploring Microsoft® Office 2016 Series Editor Mary Anne Poatsy
Current Challenges in Digitization
Tutorial 8 Sharing, Integrating, and Analyzing Data
Presentation transcript:

Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce HTML files with content extracted from PDF

Current GPO Workflow

Enhanced GPO Workflow

System Diagram

Input and Output Input PDF file (text and images) HTML file (text only) Output HTML file Original text Images extracted from PDF Text OCRed from the images Image Over Text PDF (IOT)

Extractor Extract images from PDF Extract text before and the after images

Image Extraction CCITTFAXDECODE filter Extract to TIFF image directly DCTDECODE filter Extract to JPG images directly Other filters Decode first then re-encode to PNG images

Recognizer OCR text in images and store them in text files (img2txt) OCR images in PDF and produce Image Over Text PDF(pdf2iot)

OCR Product Selection 40 OCR products on the market 4 finalist selected after extensive accuracy testing Winner: OmniPage SDK Among the top two in terms of accuracy Best in terms of preserving original layout Capable of producing IOT Supports Linux platform

Problem of IOT

Improve IOT Quality OCR only the pages that contain images Split the pages of the PDF into image pages and text pages. OCR the image pages Combine the text pages and the IOT image pages

Inserter Insert the extracted images and the OCRed text into an HTML file Insert by Marker Insert by Text

Insert by Marker

Insert by Text Locate the insertion point by text matching Text extracted from the PDF file Text contained in the HTML file UNITED STATES Government Printing Office U N I T E D S T A T E S Government Print- ing Office Text in HTML Text Extracted from PDF

Text Matching Tokenize the text extracted from PDF into words, store them in WordSet1 Removes invalid words Tokenize the HTML file line by line Store the words in WordSet2 At each line check the percentage of WordSet1 contained in WordSet2 Insert the image if the percentage is greater than a threshold

Insert OCRed Text

Other Functionality Configurability Each component can be enabled/disabled Image markers can be added using a text file Configuration parameters can be changed during run time SOAP Web Service interface

Current Status System is fully functional Extractor is completed and tested Recognizer is completed and tested Inserter is undergoing optimization Web service interface implementation is in progress