Presentation is loading. Please wait.

Presentation is loading. Please wait.

Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce.

Similar presentations


Presentation on theme: "Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce."— Presentation transcript:

1

2 Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce HTML files with content extracted from PDF

3 Current GPO Workflow

4 Enhanced GPO Workflow

5 System Diagram

6 Input and Output Input PDF file (text and images) HTML file (text only) Output HTML file Original text Images extracted from PDF Text OCRed from the images Image Over Text PDF (IOT)

7 Extractor Extract images from PDF Extract text before and the after images

8 Image Extraction CCITTFAXDECODE filter Extract to TIFF image directly DCTDECODE filter Extract to JPG images directly Other filters Decode first then re-encode to PNG images

9 Recognizer OCR text in images and store them in text files (img2txt) OCR images in PDF and produce Image Over Text PDF(pdf2iot)

10 OCR Product Selection 40 OCR products on the market 4 finalist selected after extensive accuracy testing Winner: OmniPage SDK Among the top two in terms of accuracy Best in terms of preserving original layout Capable of producing IOT Supports Linux platform

11 Problem of IOT

12 Improve IOT Quality OCR only the pages that contain images Split the pages of the PDF into image pages and text pages. OCR the image pages Combine the text pages and the IOT image pages

13 Inserter Insert the extracted images and the OCRed text into an HTML file Insert by Marker Insert by Text

14 Insert by Marker

15 Insert by Text Locate the insertion point by text matching Text extracted from the PDF file Text contained in the HTML file UNITED STATES Government Printing Office U N I T E D S T A T E S Government Print- ing Office Text in HTML Text Extracted from PDF

16 Text Matching Tokenize the text extracted from PDF into words, store them in WordSet1 Removes invalid words Tokenize the HTML file line by line Store the words in WordSet2 At each line check the percentage of WordSet1 contained in WordSet2 Insert the image if the percentage is greater than a threshold

17 Insert OCRed Text

18 Other Functionality Configurability Each component can be enabled/disabled Image markers can be added using a text file Configuration parameters can be changed during run time SOAP Web Service interface

19 Current Status System is fully functional Extractor is completed and tested Recognizer is completed and tested Inserter is undergoing optimization Web service interface implementation is in progress


Download ppt "Objective Enhance the document production workflow at US Government Printing Office (GPO) Extract images from PDF OCR the extracted images/PDF Produce."

Similar presentations


Ads by Google