Autonomous Cleaning of Corrupted Scanned Documents A Generative Modeling Approach Zhenwen Dai Jӧrg Lücke Frankfurt Institute for Advanced Studies, Dept. of Physics, Goethe-University Frankfurt
A document cleaning problem
What method can save us? Optical Character Recognition (OCR)
OCR Software ? ? vs. input OCR (FineReader 11) Character Segmentation Character Classification
What method can save us? Optical Character Recognition (OCR) Automatic Image Inpainting
Automatic Image Inpainting
Automatic Image Inpainting Unable to identify the defects because corruption and characters consist of same features solution requires knowledge of explicit character representations
What else? Optical Character Recognition (OCR) Automatic Image Inpainting Image Denoising? … Problem requires a new solution!
Our Approach training data is only the page of corrupted document no label information a limited alphabet (currently) input our approach
How does it work without supervision? Characters are salient self-repeating patterns. Corruptions are more irregular. Related to Sparse Coding input our approach
The Flow of Our Approach b a y s e Learning A Character Model on Image Patches Cut into Image Patches Character Detection & Recognition
A Probabilistic Generative Model Show a character generation process. A character representation (parameters) Feature Vectors (RGB color) mask param.
Pixel-wise Background A Tour of Generation Select a character. Translate to the position. Generate a background. Overlap character with background according to mask. Prior Prob. 0.2 0.2 0.2 0.2 0.2 masks features Pixel-wise Background Distribution Translation by [12,10]T Learning
Maximum Likelihood Iterative Parameter Update Rules from EM: prior prob. posterior tn t2 t1 t0 parameter set std A posterior distribution is needed for every image patch in the update rules.
Posterior Computation Problem A posterior distribution is needed for every image patch in the update rules. Similar to template matching A pre-selection approximation Which character? A ? B ? C ? D ? E ? inference Where? ? ? ? hidden space (truncated variational EM) pre-selection (Lücke & Eggert, JMLR 2010) (Yuille & Kersten, TiCS 2006)
An Intuitive Illustration of Pre-selection Select some local features according to parameters. Very few features A number of good guesses A B C D E B C A E D Features in image patches B D (Lücke & Eggert, JMLR 2010) (Yuille & Kersten, TiCS 2006)
Learn the Character Representations Input: image patches (Gabor wavelets) A learning course: (about 25 mins) chars mask feature std chars mask feature std feature std 1 4 2 5 3 6 (heat map) (heat map)
Learn the Character Representations Input: image patches (Gabor wavelets) A learning course: (about 25 mins) chars mask feature std chars mask feature std feature std 1 4 2 5 3 6 (heat map) (heat map)
Document Cleaning How to recognize characters against noise? Character segmentation fails. Our model – one char per patch It is a non-trivial task. Try to explore from the model as much as possible.
Document Cleaning Procedure Inference of every patch with the learned model Paint a clean character at the detected position. Erase the character from the original document. Accept original Fully visible=1 Clean Characters from the Corrupted Document reconstructed reconstructed
Document Cleaning Procedure Inference of every patch with the learned model Iterate until no more reconstruction. Accept Reject original reconstructed Fully visible=1 Fully visible=0 more than one character per patch iteration 2 Reject Accept reconstructed Fully visible=0 Fully visible=1 reconstructed iteration 1 (about 1 min per iteration)
Before Cleaning
After Iteration 1
After Iteration 2
After Iteration 3
More Experiments More characters (9 chars) Rotated, random placed More characters (9 chars) Unusual character set (Klingon) Irregular placement (randomly placed, rotated) Occluded by spilled ink 9 chars Klingon Occluded original reconstructed
Recognition Rates
False Positives
Not only a Character Model Detect and count cells on microscopic image data in collaboration with Thilo Figge and Carl Svensson
Summary Addressed the corrupted document cleaning problem. Followed a probabilistic generative approach. Autonomous cleaning of a document is possible. Demonstrated efficiency and robustness. The dataset will be available online soon. Future directions: Extended to large alphabet by incorporating prior knowledge of documents. Extended to various different applications.
Acknowledgement http://fias.uni-frankfurt.de/cnml
Thanks for your attention!
Learned Character Representations Cut the document into small patches. Run the learning algorithm.
Performance “bayes” 9 chars Klingon Randomly placed Occluded Recognition Rates OCR 56.5% 75.4% 0.8% 41.6% Our algorithm 100% 97.4% False Positives 297 285 231 86 413 3 6
Document Cleaning Procedure Character vs. Noise? MAP inference can only choose among learned characters. Define a novel quality measure. y a MAP mask param. mask posterior difference Threshold: 0.5