Aniko T. Valko, Keymodule Ltd.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
IMAGE Semi-automatic 3D building extraction in dense urban areas using digital surface models Dr. Philippe Simard President SimActive Inc.
Chapter 7 Creating Graphics. Chapter Objectives Use the Pen tool Reshape frames and apply stroke effects Work with polygons and compound paths Work with.
DIGITAL GRAPHICS & ANIMATION Complete LESSON 4 ADDING TEXT TO GRAPHICS.
Creating Vectors – Part Two 2.02 Understand Digital Vector Graphics.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Information Retrieval in Practice
System Design and Analysis
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Processing Digital Images. Filtering Analysis –Recognition Transmission.
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
Introducing…. Business Problem Are you working as an individual, in a workgroup or with an enterprise having time restraints, limited resources and want.
Overview of Search Engines
InDesign CS3 Lesson 3 Working with Frames. Using Frames Frames are containers in which you place graphics or text. Frames can also be used as graphic.
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
Technology to make Scientific Documents Accessible Masakazu SUZUKI, Kyushu University (Professor emeritus) Katsuhito YAMAGUCHI, Nihon University InftyProject.
1 Two-dimensional Context-Free Grammars: Mathematical Formulae Recognition Daniel Průša, Václav Hlaváč Center for Machine Perception Faculty of Electrical.
Introduction to Systems Analysis and Design Trisha Cummings.
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Aniko T. Valko, Keymodule Ltd.
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
CS 6825: Binary Image Processing – binary blob metrics
Digital Image Processing & Analysis Spring Definitions Image Processing Image Analysis (Image Understanding) Computer Vision Low Level Processes:
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
Word Ch 4 Review. Can you shade only some cells in a table rather than the entire table? Yes.
1 Digital Image Processing Dr. Saad M. Saad Darwish Associate Prof. of computer science.
Lecture 3 The Digital Image – Part I - Single Channel Data 12 September
1 Cheminformatics David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
Review of Data Capture. Input Devices What input devices are suitable for data entry? Keyboard Voice Bar Code MICR OMR Smart Cards / Magnetic Stripe cards.
Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Two Organizing Data.
Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
NLP&CC 2012 报告人:许灿辉 单 位:北京大学计算机科学技术研究所 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 Integration of.
Optical Character Recognition
Visual Information Processing. Human Perception V.S. Machine Perception  Human perception: pictorial information improvement for human interpretation.
Computer Representation of Venn and Euler Diagrams Diunuge B. Wijesinghe, Surangika Ranathunga, Gihan Dias Department of Computer Science and Engineering,
Information Retrieval in Practice
A Generic Toolkit for Electronic Editions of Medieval Manuscripts
Chapter 2: The Visual Studio .NET Development Environment
Search Engine Architecture
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
S.Rajeswari Head , Scientific Information Resource Division
POLYGON MESH Advance Computer Graphics
from scientific literature Principal Scientist (Chemoinformatics)
TIPS & TRICKS.
Word Processing.
Investigating the Hausdorff Distance
2.02 Understand Digital Vector Graphics
Introduction to Systems Analysis and Design
Creating Vectors – Part Two
ECE 692 – Advanced Topics in Computer Vision
Interactive Input Methods & Graphical User Input
Dr. István Marosi Recosoft Ltd., Hungary
Interactive Input Methods & Graphical User Input
Chapter 4: documenting information systems
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Creating Vectors – Part Two
presented by Thomas L. Packer
Creating Accessible PDF’s
Jiwon Kim Steve Seitz Maneesh Agrawala
Presentation transcript:

Aniko T. Valko, Keymodule Ltd. Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Chemical structure Diagrams Chemical structure diagrams are a form of representation of chemical compounds. Information contained in a structure diagram can be divided into three areas: Atom information Bond information chemical elements, functional groups, generic elements, Structural information bond orders, bond styles, bond labels vertex label, charge, atomic weight, hybridization, etc. atom information, bond information, overall charge, structure label

What is chemical OCR for? All chemical information is lost! chemical structure diagrams are converted to images 29 31 0 0 0 0 0 0 0 0999 V2000 -1.9417 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3542 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.9417 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1792 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.0042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.1208 1.6794 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 1.0961 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.0927 2.4763 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 2.2628 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5292 1.0961 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9417 0.3816 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 Publication process chemical OCR Manual reproduction automatic extraction of chemical information from chemical structure depictions 20-90 seconds per page slow and prone to errors

CLiDE Pro A chemical OCR software tool The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. [1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344. [2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England. [3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.

Features Converts chemical images into connection tables Loads PDF documents, as well as TIFF and BMP image files Exports chemical information into MDL MOL files Supports document-oriented processing as opposed to page-oriented processing The whole document is loaded and processed at once rather than individual pages. Handles various difficult drawing features Interprets generic structures Operates in interactive or batch mode Tools for structure and text editing

Three main problems involved in chemical OCR Identification of chemical images within a document. Compilation of chemical graphs of individual molecules from chemical images. Interpretation of complex objects such as generic structures using the retrieved chemical graphs.

Document image segmentation CLiDE Pro’s solutions to Problem 1 Document image segmentation Identification of connected components Digitized image of a document page of a patent Segmented document highlighting recognized text blocks and graphic blocks Bottom-up layout analysis by building the tree structure of the page Problem 1: Identification of chemical images within a document

CLiDE Pro’s solutions to Problem 2 1 Chemical image 4 Vectorization 2 Classification of connected components 5 Construction of atom labels 2 Classification of connected components into basic groups: characters lines dashes graphics Construction of dashed bonds based on the Hough transform method [4] 3 1 A chemical image Construction of atom labels: OCR Grouping characters into atom labels Recognition of superatoms 6 5 3D molecular structure after exporting the constructed CT into SDF file in 2D and converting the structure from 2D to 3D Construction of connection table: Connecting lines to atoms Joining lines to form implicit Carbon atoms Vectorization based on a polygon approximation method [5] 4 3 Construction of dashed bonds 6 Construction of connection table [4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1. Problem 2: Extraction of connection tables from chemical images [5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331.

CLiDE Pro’s solutions to Problem 3 1 Generic text interpretation (GTI) R-groups, substitution values, labels Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents. 2 Association the generic text block to the structure by matching R-groups present in both the text and the structure However, combined assignment to R-groups are handled successfully. Problem 3: Interpretation of generic structures

Alignment of Atom Labels Two types of alignment of atom labels with more than one character: Horizontal atom labels Vertical atom labels Examples

Alignment of Atom labels The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image

Ambiguity in interpretation Horizontal lines representing dashes of a dashed wedged bond A horizontal line representing a negative charge Contextual analysis

Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image

Ambiguity in interpretation A vertical line part of a double bond Vertical lines representing Iodine atoms Contextual analysis

Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Constructed molecule Input image

Ambiguity in interpretation Circles represent: Oxygen atoms aromatic rings Contextual analysis

Ambiguity in interpretation Constructed molecule Input image

Crossing bonds in bridged molecule Constructed molecule Input image No extra Carbon atom is generated at the point where bonds cross each other Functional groups are expanded in the exported structure

A generic structure R = H R = Me Constructed molecule Input image

Bad image quality Constructed molecule Input image Isolated black spots (noise from scanning) Black spots touching one CC Black spots merging two or more CCs

Bad image quality Constructed molecule Input image

Conclusions and Outlook CLiDE Pro, a chemical OCR tool 3 main problems in chemical OCR and CLiDE Pro’s solutions The quality of interpretation depends on the ability of dealing with difficult situations such as - ambiguous drawing features - distortions resulting from bad image quality Goal to extend CLiDE Pro on further chemical drawing features such as - Reaction schemes (partly implemented) - Improved generic text interpretation (dealing with tables of R-groups) - Frequency variation in Markush structures - Positional variation in Markush structures - Other difficult situations (e.g. missing bonds between ring atoms)

Palytoxin – A complex structure Input image Constructed molecule

Further Information Acknowledgments CLiDE Pro is licensed with Keymodule Ltd. and SimBioSys Inc. http://www.keymodule.co.uk http://www.simbiosys.ca Live demo at Booth #817 People who previously worked on CLiDE