Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.

Slides:



Advertisements
Similar presentations
Your dissertation and the Library James Webley 19 February 2013.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Finding and managing information for your doctorate (including Endnote): part 2 David Heading and Laura Jeffrey.
SCIENTIFIC SOLUTIONS Thomson ResearchSoft Paul Torpey April 8, 2005.
SciFinder ® : Part of the process™ 2006 Edition. SciFinder ® : Part of the process™ 2006 Edition SciFinder ® 2006 provides new, powerful capabilities.
Journal Citation Reports on the Web. Copyright 2006 Thomson Corporation 2 Introduction JCR distills citation trend data for 7,600+ journals from more.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
Data Input How do I transfer the paper map data and attribute data to a format that is usable by the GIS software? Data input involves both locational.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Introduction to Machine Learning Approach Lecture 5.
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Mining Binary Constraints in the Construction of Feature Models Li Yi Peking University March 30, 2012.
Aniko T. Valko, Keymodule Ltd.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
MSS Technologies and the AIIM Grand Canyon Chapter present: Electronic Document Management System Needs Analysis.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.
Data input 1: - Online data sources -Map scanning and digitizing GIS 4103 Spring 06 Adina Racoviteanu.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
1 CHBE Orientation Program Searching the Literature.
COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
Psychology (02): Finding the Research For Your Literature Review & Research.
Presented by Dr. S. C. Jindal Librarian Central Science Library University of Delhi Delhi Information Competency.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Bibliometrics toolkit Website: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Further info: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Scopus Scopus was launched by Elsevier in.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
One publisher’s perspectives on an evolving industry Grace Baynes Nature Publishing Group October 2009.
EBI is an Outstation of the European Molecular Biology Laboratory. Literature Resources at the EBI Information Workshop on European Bioinformatics Resources.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Feature Extraction Find best Alignment between primitives and data Found Alignment? TUH EEG Corpus Supervised Learning Process Reestimate Parameters Recall.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.
BIO1130 Lab 2 Scientific literature
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
from scientific literature Principal Scientist (Chemoinformatics)
Introduction to Computational and Biological Vision Keren shemesh
Ying He Wuhan University of Technology Twitter: #AMIA2017
Ying He Wuhan University of Technology
Multimedia Information Retrieval
Text Detection in Images and Video
Clustering Semantically Enhanced Web Search Results
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Aniko T. Valko, Keymodule Ltd.
Data Mining Chapter 6 Search Engines
BIO1130 Lab 2 Scientific literature
Dr. Bhavani Thuraisingham The University of Texas at Dallas
Web Mining Department of Computer Science and Engg.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
SIDE: The Summarization IDE
Search for Article Citation
Presentation transcript:

Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University of Michigan, Ann Arbor Workshop on Data, Text, Web, and Social Network Mining Apr. 23, 2010, University of Michigan, Ann Arbor

2 Why ChemReader? PubChem ChemBank ChemDB ChemMine DrugBank GLIDA QueryChem … Chemical Database Journals Patents Books Papers Project reports Websites Theses … Corpus of scientific literature ChemReader

3 Chemical structure in scientific literature Generic name, systematic nomenclature, index number 2D chemical structure diagram Chemical information

4 Chemical OCR Extract 2D chemical structure diagram from literature Convert them to a standard chemical file format General Chemical OCR Strategy CN1CCCC1C2 =CN=CC=C2 Input : Image of chemical structure diagram Output : SMILE String Chemical OCR : ChemReader

5 Searching for chemical information Many synonyms Need to identify related compounds Many chemical structures in journals referenced by chemical structure diagrams Chemical database annotation using Chemical OCR Image based annotation

6 General recognition process General chemical OCR process Original digital image Connected components Character Separation Character Recognition Bond detection Graph compile Standard chemical file format CN1CCCC1 C2=CN=CC =C2

7 Robust line & ring structure detection algorithm based on Hough Transformation Chemical dictionary and chemical spell checking Pre-processing and post-processing filters to discard non-annotatable images Novel features of ChemReader Park, J.; Rosania, G. R.; Shedden, K. A.; Nguyen, M.; Lyu, N.; Saitou, K. Automated Extraction of Chemical Strucuture Information from Digital Raster Images. Chem. Cent J. 2009, 3, Article 4 Original Image Analyzing Image Result

8 Google Image Search GLIDA images Journal images Recognition Performance The fraction of correct outputs

9 Automated annotation by linking published journal articles to entries in a chemical database ChemReader to extract chemical structure diagram Chemical expert system for screening the converted structures Similarity-based linking to maximize the number of useful links Annotation strategy Park, J.; Rosania, G. R.; Saitou, K. Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Databases. J. Chem. Inf. Model. 2009, Article ASAP

10 Test setting Total 609 structure diagrams from 121 journal articles Manual generation of original connection tables Target database PubChem database ( Two cases of a test Demonstrate how the Chemical Expert system can be utilized Annotation Test Test ITest II Filtering condition Tolerant levelStrict level Number of survived structures

11 Result Chemical Expert System Test Test ITest II

12 Percentages of structures rejected, correct, and wrong Chemical Expert System Test Test I Test II

13 Chemical Expert System Test Percentages of articles contain rejected, wrong or correct structures Test I Test II

14 PubChem Annotation Test Filtered output structure Original connection-table PubChem Database (19 million structures) 90% Tanimoto similarity searching Linked entries Relevant entries Relevant YesNo Linked YesTrue Positive (TP)False Positive (FP) NoFalse Negative (FN)True Negative (TN)

15 Result Total number of TP, FP and FN links Averaged recall and precision rates over structures PubChem Annotation Test TPFPFN Test I29,54034,38628,642 Test II23,2776,8457,874 Avg. RecallAvg. Precision Test I Test II

16 Result Distribution of recall and precision rates The size of sphere is proportional to the number of structures corresponding to recall and precision rates. PubChem Annotation Error Analysis Test I Test II

17 ChemReader is an developer’s tool for chemical image based annotation of databases Developed a tunable database annotation strategy based on user- defined relevance of hits In the annotation test, as many as 45% of articles have true positive links to PubChem entries Precision and recall rates can be improved with further enhancement of recognition algorithm in ChemReader Annotation error analysis allows rational prioritization of future development efforts Summary & Conclusion

18 Thank you!