Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wednesday 19th of November 2008

Similar presentations


Presentation on theme: "Wednesday 19th of November 2008"— Presentation transcript:

1 Wednesday 19th of November 2008
Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR Mathieu Delalandre CVC, Barcelona, Spain DAG Meeting Wednesday 19th of November 2008

2 Introduction Text/graphics documents
Text/graphics documents are used in a variety of fields like geography, engineering, social sciences … Some examples are architectural drawing utility map geographic map Huge amount of data exist, two main sources digitized documents (modern and old) web images

3 Introduction OCR of text/graphics documents
Character recognition system working with text/graphics documents # First related work [Brown’1979] # More than 50 references on this topic today [Fletcher’1988] [Zenzo’1992] [Goto’1999] [Adam’2000] … Text/Graphics separation full image of text-lines Problematics - letter segmentation - multi-font recognition - scale variation - text/graphics separation - rotation variation - text-line detection - no reading order - no dictionary Text-line detection general to any documents images of single text-line Character segmentation specific to text/graphics documents images of single character Character recognition ASCII

4 Introduction About performance evaluation
The case of general OCR [Kanungo’1999] More than 40 references on the topic [Kanungo’1999] Several standard databases exist (NIST, MARS, CD-ROM English, …) Annual evaluation reports [Rice’1992] [Rice’1993] Black-box evaluation: The evaluation considers the OCR system as an indivisible unit and evaluates it from its final results (i.e. OCR output vs. ASCII transcription of the text using string edit distances). White-box evaluation: The evaluation aims to characterize the performance of individual sub-modules of the OCR system (skewing, letter segmentation, block identification, character recognition, etc.). Characterisation Groundtruth Groundtruthing Results Performance evaluation System Documents The case of text/graphic document OCR [Wenyin’1997] Only 1 reference on the topic No standard databases None complete evaluation done through 20 years of research

5 Introduction Scope of the proposed work
Text/graphics separation Text-line detection Character segmentation Character recognition Groundtruthing Characterization Performance evaluation of text/graphics document OCR # white-box evaluation # groundtruthing step # datasets for text/line detection and character recognition # generation algorithms are “simple”, the main purpose of the talk will concern the setting contributions

6 Plan Groundtruth definition Datasets for character recognition
Datasets for text-line detection In progress datasets

7 Groundtruth definition
2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Character level ASCII code font (name, size, style) location point orientated bounding box orientation (ϴ) scale () Text level first location point groundtruth of characters characters/word positions char H e l o W r d p-word 1 p-char 2 3 4

8 Datasets for character recognition (1/2)
1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Problematics Published experiments How to generate single character images ? Which number of class ? Which image resolution ? Which size for the datasets ? Which fonts ? Etc …. image size class size learning font(s) rotation scaling Brown’1981 682 ??/10 20 000 × yes Zenzo’92 ?? ??/62 72 000 Takahashi’1992 242 6 400 50% Adam’2000 282 51/62 15 000 33% Chen’2003 26/26 1 000 14% 1 no Choisy’2004 80% Hase’2004 322 ??/26 3 000 3 Pal’2006 40/62 18 000 2 Roy’2008 8 000 many (1) (2) (3) (4) (5) Main conclusions The real sizes of characters can be only estimated. The confusion problem (e.g. 6 vs 9) is not still well defined, the 62 class problem (a-z A-Z 0-9) is the main goal. It is not possible to fix a standard size for the training/test sets, this information is still well defined, several thousands of images are mandatory for the training. The impact of fonts is few studied and should be take into account in the evaluation The invariance to rotation and scaling is the final goal, they are few studied independently.

9 Datasets for character recognition (2/2)
1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Generation setting Datasets Geometry invariance letter class 62 a-z; A-Z; 0-9 font class 30 fonts with lower and upper case, no cursive basic fonts 3 times, courier, arial character size 322 pixels max dxdy of font symbols dataset size 5 000 / font 62 classes; 40 samples/class; 50%/50% training free ranked files allow a training specification 20% training on [file-4001 – file-5000] character scaling 1.0 to 2.0 with a gap of 1/1000 character rotation 0 to 2×π with a gap of π/500 tests scaling rotation font(s)/test fonts images 3 no 1 15 000 yes Font adequacy tests scaling rotation font(s)/test fonts images 30 yes 1 Generation algorithm font manager, centering, scale and rotation processes Font scalability tests scaling rotation font(s)/test fonts images 4 yes 3; 6; 9; 12 12

10 Datasets for text-line detection (1/2)
1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Problematics use-case images text-lines curved font/img scaling Roy’2008 geographic map ?? 5 000 yes many Pal’2004 artistic document 1 521 Loo’2002 poster, newspaper 2 118 Park’2001 poster, publicity 30 1265 Goto’1999 Japanese form 170 9 831 Tan’1998 map 8 96 no He’1996 drawing 1 16 Burgue’1995 cadastral map 4 150 Deseilligny’1995 3 1 250 (1) (2) (3) How to generate single character images ? Which number of word per image ? Which image size ? Which size for the datasets ? Which number of font ? Etc …. Main conclusions The use-cases are heterogeneous, the sizes and resolutions of images are few provided, the text density is then difficult to estimate, images with significant text content are preferred. Depending the use-cases, not all the methods work on curved text, a combination of curved and straight text is necessary. All the methods use context to extract the text-line (i.e. font type, character size, line model). The size of characters could change a lot, the number of font is generally small (less to ten).

11 Datasets for text-line detection (2/2)
1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Setting Datasets Text-line density dictionary 422 text-lines countries and capitals font class 30 fonts with lower and upper case, no cursive character size 322 pixels max dxdy of font symbols image size 6402 10-50 text-lines per image dataset size 100 images text scaling 1.0 to 1.5 with a gap of 1/1000 text rotation -π/2 to +π/2 with a gap of π/500 test text-line/img scaling curved font(s)/test words 1 low yes no 3 in progress medium high Font context test text-line/img scaling curved font(s)/test words 1 medium no 9 in progress 6 3 Generation algorithm The insert algorithm step 1 step 2 B1 ejects B2 of dx,dy l2 l1 l3 dy dx d θ B1 B2 Size context test text-line/img scaling curved font(s)/test words 1 medium no in progress yes

12 In progress datasets 1. Groundtruth definition and setting
2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets

13 Conclusions Conclusions # in progress work … # character recognition datasets are ready # bags of words still under packaging, but will be ready soon. Perspectives # middle term, experimentations with standard feature extraction methods [Roy’2008] [Valveny’2007] # long term, experimentations with bags of word and text/graphics documents [Delalandre’2007] [Wenyin’1997]

14 References (1/2) R. Brown and M. Lybanon and L. K. Gronmeyer. Recognition of Handprinted Characters for Automated Cartography: A Progress Report. Proceedings of the SPIE, Vol. 205, 1979. L.A. Fletcher & R. Kasturi. A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images. Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol (10), pp , 1988. S.D. Zenzo; M.D. Buno; M. Meucci & A. Spirito. Optical recognition of hand-printed characters of any size, position, and orientation. IBM Journal of Research and Development, vol (36), pp , 1992. H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document Analysis and Recognition (IJDAR), vol (2), pp , 1999. S. Adam; J.M. Ogier; C. Cariou; R. Mullot; J. Labiche & J. Gardes. Symbol and Character Recognition : Application to Engineering Drawings. International Journal on Document Analysis and Recognition (IJDAR), vol (3), pp , 2000. T. Kanungo; G.A. Marton & O. Bulbu. Performance evaluation of two Arabic OCR products. Workshop on Advances in Computer-Assisted Recognition (AIPR) , SPIE Proceedings, vol (3584), pp , 1999. S.V. Rice J. Kanai & T.A. Nartker. A Report on the Accuracy of OCR Devices. Information Science Research Institute, University of Nevada, USA, 1992. S.V. Rice; J. Kanai & T.A. Nartker. An Evaluation of OCR Accuracy. Information Science Research Institute, University of Nevada, USA, 1993. L. Wenyin & D. Dori. A Protocol for Performance Evaluation of Line Detection Algorithms. Machine Vision and Applications, vol (9), pp , 1997. R.M. Brown. Handprinted Symbol Recognition System: A Very High Performance Approach To Pattern Analysis Of Free-form Symbols. Conference Southeastcon , pp. 5-8 , 1981. H. Takahashi. Neural network architectures for rotated character recognition. International Conference on Pattern Recognition (ICPR) , pp , 1992. Q. Chen. Evaluation of OCR algorithms for images with different spatial resolutions and noises. School of Information Technology and Engineering, University of Ottawa, Canada, 2003. C. Choisy; H. Cecotti & A. Belaid. Character Rotation Absorption Using a Dynamic Neural Network Topology: Comparison With Invariant Features. International Conference on Enterprise Information Systems (ICEIS) , pp , 2004.

15 References (2/2) H. Hase; T. Shinokawa; S. Tokai & C.Y. Suen. A robust method of recognizing multi-font rotated characters. International Conference on Pattern Recognition (ICPR) , vol (2), pp , 2004. U. Pal; F. Kimura; K. Roy & T. Pal. Recognition of English Multi-oriented Characters. International Conference on Pattern Recognition (ICPR) , vol (2), pp , 2006. P.P. Roy; U. Pal & J. Llados. Multi-oriented character recognition from graphical documents. International Conference on Cognition and Recognition (ICCR) , pp , 2008. U. Pal & P. P. Roy. Multi-oriented and curved text lines extraction from Indian documents. IEEE Transactions on Systems, Man and Cybernetics- Part B, vol (34), pp , 2004. P.K. Loo & and C.L. Tan. Word and Sentence Extraction Using Irregular Pyramid. Workshop on Document Analysis System (DAS) , Lecture Notes in Computer Science (LNCS), vol (2423), pp , 2002. H.C. Park; S.Y. Ok; Y.J. Yu & H.G. Cho. Word Extraction in Text/Graphic Mixed Image Using 3-Dimensional Graph Model. International Journal on Document Analysis and Recognition (IJDAR), vol (4), pp , 2001. H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document Analysis and Recognition (IJDAR), vol (2), pp , 1999. C.L. Tan & P.O. Ng. Text extraction using pyramid. Pattern Recognition (PR), vol (31), pp , 1998. S. He, N. Abe & C. L. Tan. A clustering-based approach to the separation of text strings from mixed text/graphics documents. International Conference on Pattern Recognition (ICPR) , pp , 1996. M. Burge & G. Monagan. Extracting Words and Multi Part Symbols in Graphics Rich Documents. International Conference on Image Analysis and Processing (ICIAP) , 1995. M. Deseilligny; H. Le Men & G. Stamon. Characters string recognition on maps, a method for high level reconstruction. International Conference on Document Analysis and Recognition (ICDAR) , pp , 1995. E. Valveny; S. Tabbone; O. Ramos & E. Philippot. Performance Characterization of Shape Descriptors for Symbol Representation. Workshop on Graphics Recognition (GREC) , 2007. M. Delalandre; T. Pridmore; E. Valveny; E. Trupin & H. Locteau. Building Synthetic Graphical Documents for Performance Evaluation. Workshop on Graphics Recognition (GREC) , Lecture Note in Computer Science (LNCS), vol (5046), pp , 2008.


Download ppt "Wednesday 19th of November 2008"

Similar presentations


Ads by Google