Download presentation
Presentation is loading. Please wait.
Published byWendy Ryan Modified over 6 years ago
1
Anand Gupta, Devendra Tiwari, Priyanshi Gupta, Ankit Kulshreshtha
Hello everyone, we are here to present the framework CDIA-DS which has been proposed for efficient reconstruction of a compound document image using a data structure. We will first introduce you to the basic concept behind our research, the need for such a system, and then elaborate each step taken to achieve the same. CDIA-DS: A Framework for Efficient Reconstruction of Compound Document Image using Data Structure Anand Gupta, Devendra Tiwari, Priyanshi Gupta, Ankit Kulshreshtha
2
EXTRACTING SEGMENT WISE FEATURES FROM A SCANNED DOCUMENT IMAGE
temperatures and high humidity is making the environment conducive for the H1N1 virus to proliferate. Back to back weather systems this season have kept the humidity levels high and have also influenced the wind pattern across the plains of northwest India and adjoining areas, resulting in swine flu virus to sustain for a longer period. Swine flu tightens its grip over India Swine flu has tightened its grip over India, with the death toll reaching close to Fresh cases have been reported from across the country, including Delhi, Rajasthan, Gujarat, Uttar Pradesh, Jammu and Kashmir, West Bengal, Nagaland and Bihar. Gujarat Rajasthan Madhya Pradesh Maharashtra Delhi Casualties 265 261 153 143 10 Affected 4368 5528 1010 1735 2891 PDF FILE OF DOCUMENT Our research aims to extract and structure the information from image of a compound document, such that its reconstruction into an editable format is facilitated. For instance, consider this document, in an editable word format or saved in the PDF format. Moreover, we have a hard copy in the printed format format of the same. Considering the perspectives of document processing such as storage, retrieval, modification and transfer, paper-based documents are far less-efficient than electronic medium. Hence, we have most of our sensitive documents saved in electronic formats, rather than print. Consecutive Western Disturbances and induced cyclonic circulations will give rainy spell across north and Northwest India, commencing from Saturday evening till March 3. This has raised serious concerns over H1N1 influenza, as weather has played major role in intensifying the flu this season. In a bid to curtail spread of deadly virus, it is very necessary for temperatures to rise and drop in humidity levels, which is not expected to happen anytime soon. Weather is a key factor in letting the virus sustain and spread. The virus survives comfortably in the winter season and even during the spring, since the temperature does not shoot up much. Low
3
WHAT IF THE ORIGINAL PDF OR DOCX FILE GETS DAMAGED OR LOST!!??
Now consider an instance such that the document gets damaged or lost. Huge loss? What if we had a mechanism to convert the image of the printed format into an editable electronic format? Methods for transformation of such images into computer-revisable e-document format, are based on either manual data-entry or automated conversion mechanism such as document image analysis
4
Essential Steps in reconstruction
CDIA-DS framework C COMPOUND Essential Steps in reconstruction Our motivation CDIA-DS Framework Experiments Conclusion References D DOCUMENT I IMAGE A ANALYSIS The work here is based upon the latter. <<Organisation of the presentation document.>> D DATA S STRUCTURE
5
STEPS REQUIRED: EXTRACTION OF POSSIBLE REGIONS CONTAINING INFORMATION
SEGMENTING INFORMATION AS TEXT/TABLE/IMAGE STRUCTURING THE EXTRACTED INFORMATION Essentialy STEPS REQUIRED FOR THE COMPLETE COMPOUND DOCUMENT IMAGE ANALYSIS would be as shown. The next animation describes each of the steps as performed over a test image.
6
SCANNED DOCUMENT IMAGE FILE
IDENTIFY TEXT IDENTIFY IMAGE IDENTIFY TABLE Swine flu tightens its grip over India Swine flu has tightened its grip over India, with the death toll reaching close to Fresh cases have been reported from across the country, including Delhi, Rajasthan, Gujarat, Uttar Pradesh, Jammu and Kashmir, West Bengal, Nagaland and Bihar. TEXT Consecutive Western Disturbances and induced cyclonic circulations will give rainy spell across north and Northwest India, commencing from Saturday evening till March 3. IMAGE This has raised serious concerns over H1N1 influenza, as weather has played major role in intensifying the flu this season. In a bid to curtail spread of deadly virus, it is very necessary for temperatures to rise and drop in humidity levels, which is not expected to happen anytime soon. Weather is a key factor in letting the virus sustain and spread. The virus survives comfortably in the winter season and even during the spring, since the temperature does not shoot up much. Low TABLE temperatures and high humidity is making the environment conducive for the H1N1 virus to proliferate. Back to back weather systems this season have kept the humidity levels high and have also influenced the wind pattern across the plains of northwest India and adjoining areas, resulting in swine flu virus to sustain for a longer period. Steps for reconstruction SCANNED DOCUMENT IMAGE FILE Gujarat Rajasthan Madhya Pradesh Maharashtra Delhi Casualties 265 261 153 143 10 Affected 4368 5528 1010 1735 2891
7
Our motivation to work:
Figures Text + Figures Text Text +Table + Images Our work draws motivation from the fact that work has been done in each of the following areas to some extent, however extraction-identification-segmentation have not been covered with in a document image which is compound in nature, that is, containing a combination of one or more sets of text, table and image regions. For more information regarding independent researches done in each of the constituent area, you can refer to the extensive research section of our paper. Text + Table Tables
8
CDIA-DS FRAMEWORK Scanned Document Image
Stage I – Document Image Analysis Stage II – Content Management Data Structure Document Reconstruction The entire system consists of two stages. The first stage Document Image Analysis (DIA) addresses identification and segmentation issues. Stage II, Content Management Data Structure(CMDS) addresses the representation issues using the proposed data structure
9
Stage I – Document image analysis
Scanning Pre-processing Connected Component Analysis OCR Analysis
10
Stage I – Document image analysis
Text Chunking Image Text Filtering Hierarchial Contour Detection Table and Figure region identification
11
Stage ii – content management data structure
Unstructured Extracted Dataset Construction of the Data Structure (CDIA-DS) Reconstruction of Document from Data Structure The extracted information from the output of Document Image Analysis reveals a level of abstraction in their properties (Height, Width and Size). Therefore, an abstract class, called View, placed hierarchically as the super-class of specific Views is introduced. Views appearing in the same orientation group (Horizontally/Vertically) are grouped together under a Layout class, following a parent-child node relationship in a tree.
12
TERMS AND NOTATIONS View – The identified blocks of extracted information in a document image can be considered as linear (vertical/horizontal) arrangement of block(s). These blocks grouped into text/table/image entities as identified through their feature extraction, are hereby referred to as the abstract atomic class, View. Layout – A linear combination of two or more such views arranged in a singular row/column pattern has been termed as a Layout. A Layout is a parent node to its containing Views/Layouts. TextView – A derived class of View containing Text component specific properties is termed as TextView. FigureView – A derived class of View containing Figure-component specific properties is termed as FigureView. TableView – A derived class of View containing Table-component specific properties is termed as TableView
13
Construction of the data structure
Finding Nearest Neighbor: A layout/View can be combined with any other neighboring Layout/View from the complete set of L.V. Checking Validity of Combination: A Layout/View can be combined with another Layout/View if and only if it is possible to enclose them in a rectangle such that both the combining regions are completely enclosed, and no other View/Layout is partially or completely present inside. Checking-Combined Orientation: We compare the orientation (Horizontal/Vertical) of the combined layout with the combining pair of Layout/View.
14
Reconstruction of document from data structure
A depth first traversal of the Layout-View tree structure provides the resulting structure of the document, facilitating the recreation process. At each level we check the orientation of the Layout/View, and create rectangular boxes as described by the height/width value of the class. Inherited features and extracted content information are then supplied, to recreate the document into the desired editable format. DEPTH FIRST TRAVERSAL OF N-ARY TREE
15
Experimental results: STAGE 1
The data in Figure 7 (a) shows a proportional relationship between the area contained by a A+C (TextView and TableView), and the time consumed in their feature extraction. This trend can be attributed to the time consumed by text-chunking step of the process flow, involving OCR analysis. It is safe to deduce that the time taken in this step is highly dependant on the amount of text content in the image. Moreover as seen in Figure 7(b), as the percentage of TableView area occupied in the image increases, step-2 involving hierarchical contour detection takes more time. This indicates that the time taken in processing is directly dependant on the area occupied by the document image content. Experiments reveal that the time taken in execution of the framework is dependent on: I. The number of Views detected in the scanned/photographed image. II. The amount of text content, and table edges present in the image.
16
experimental RESULTS: STAGE 2
Hence the actual complexity lies in this range for an average number of views/layouts found in the image, which can be approximated to: Each layout/view needs to find its pair for the NEAREST NEIGHBOUR STEP. If we have N layouts/views, in worst case the complexity becomes: O(n2) O(n*log(n)) However, maintaining a preferential list of closely centered views/layouts, the best time complexity can be reduced to: O(n/2)
17
Conclusion 1. In this paper, we have analysed and organised the information contained in a compound document image, through a two stage CDIA-DS framework. 2. The CDIA-DS framework can be adopted to analyse and organise compound document images enabling their exactly replicated recreation in electronic format. 3. As a future course of action, techniques can be developed to work on an even more comprehensive dataset.
18
REFERENCES [1] T. Bayer, J. Franke, U. Kressel, E. Mandler, M. Oberl¨ander, and J. Sch¨urmann, “Towards the understanding of printed documents,” in Structured Document Image Analysis. Springer, 1992, pp. 3–35. [2] S. R. Choudhury, P. Mitra, A. Kirk, S. Szep, D. Pellegrino, S. Jones, and C. L. Giles, “Figure metadata extraction from digital documents,” in Document Analysis and Recognition (ICDAR), th International Conference on. IEEE, 2013, pp. 135–139. [3] C. Sumathi, T. Santhanam, and G. G. Devi, “A survey on various approaches of text extraction in images,” International Journal of Computer Science and Engineering Survey, vol. 3, no. 4, p. 27, 2012. [4] Y. Liu, K. Bai, P. Mitra, and C. L. Giles, “Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines,” in Document Analysis and Recognition, ICDAR’09. 10th International Conference on. IEEE, 2009, pp –1010. [5] G. Babu, P. Srimaiyee, and A. Srikrishna, “Text extraction from hetrogenous images using mathematical morphology.” Journal of Theoretical & Applied Information Technology, vol. 16, 2010. [6] F. Liu, X. Peng, T. Wang, and S. Lu, “A density-based approach for text extraction in images,” in Pattern Recognition, ICPR th International Conference on. IEEE, 2008, pp. 1–4. [7] S. Tupaj, Z. Shi, C. H. Chang, and H. Alam, “Extracting tabular information from text files,” EECS Department, Tufts University, Medford, USA, 1996.
19
REFERENCES [8] T. Kasar, P. Barlas, S. Adam, C. Chatelain, and T. Paquet, “Learning to detect tables in scanned document images using line information,” in Document Analysis and Recognition (ICDAR), th International Conference on. IEEE, 2013, pp. 1185– [9] S. Simske and X. Lin, “Creating digital libraries: content generation and re-mastering,” in Document Image Analysis for Libraries, Proceedings. First International Workshop on. IEEE, 2004, pp. 33–45. [10] L. Cinque, S. Levialdi, and A. Malizia, “A system for the automatic layout segmentation and classification of digital documents,” in Image Analysis and Processing, Proceedings. 12th International Conference on. IEEE, 2003, pp. 201–206. [11] K.-H. Lee, Y.-C. Choy, and S.-B. Cho, “Logical structure analysis and generation for structured documents: a syntactic approach,” Knowledge and Data Engineering, IEEE Transactions on, vol. 15, no. 5, pp. 1277– 1294, 2003. [12] G. Nagy, S. Seth, and M. Viswanathan, “A prototype document image analysis system for technical journals,” Computer, vol. 25, no. 7, pp. 10–22, 1992. [13] W. Zhang and T. L. Andersen, “Using artificial neural networks to identify headings in newspaper documents,” in Neural Networks, Proceedings of the International Joint Conference on, vol. 3. IEEE, 2003, pp. 2283–2287. [14] Intel Corporation, “Opencv,” June 2000, [Online; Accessed ]. [15] R. Smith, “Tesseract-ocr,” , [Online; Accessed ].
20
Thank you!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.