Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce.

Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce HTML files with content extracted from PDF

Current GPO Workflow

Enhanced GPO Workflow

System Diagram

Input and Output Input PDF file (text and images) HTML file (text only) Output HTML file Original text Images extracted from PDF Text OCRed from the images Image Over Text PDF (IOT)

Extractor Extract images from PDF Extract text before and the after images

Image Extraction CCITTFAXDECODE filter Extract to TIFF image directly DCTDECODE filter Extract to JPG images directly Other filters Decode first then re-encode to PNG images

Recognizer OCR text in images and store them in text files (img2txt) OCR images in PDF and produce Image Over Text PDF(pdf2iot)

OCR Product Selection 40 OCR products on the market 4 finalist selected after extensive accuracy testing Winner: OmniPage SDK Among the top two in terms of accuracy Best in terms of preserving original layout Capable of producing IOT Supports Linux platform

Problem of IOT

Improve IOT Quality OCR only the pages that contain images Split the pages of the PDF into image pages and text pages. OCR the image pages Combine the text pages and the IOT image pages

Inserter Insert the extracted images and the OCRed text into an HTML file Insert by Marker Insert by Text

Insert by Marker

Insert by Text Locate the insertion point by text matching Text extracted from the PDF file Text contained in the HTML file UNITED STATES Government Printing Office U N I T E D S T A T E S Government Print- ing Office Text in HTML Text Extracted from PDF

Text Matching Tokenize the text extracted from PDF into words, store them in WordSet1 Removes invalid words Tokenize the HTML file line by line Store the words in WordSet2 At each line check the percentage of WordSet1 contained in WordSet2 Insert the image if the percentage is greater than a threshold

Insert OCRed Text

Other Functionality Configurability Each component can be enabled/disabled Image markers can be added using a text file Configuration parameters can be changed during run time SOAP Web Service interface

Current Status System is fully functional Extractor is completed and tested Recognizer is completed and tested Inserter is undergoing optimization Web service interface implementation is in progress

Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce.

Similar presentations

Presentation on theme: "Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce.

Similar presentations

Presentation on theme: "Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce."— Presentation transcript:

Similar presentations

About project

Feedback