IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 1 IRF Symposium 2007 8 th and 9 th November - Vienna OCR Errors in Patent Full-Text Documents.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

November 2009INIS Training Seminar1 INIS Training Seminar November 2009 Information Retrieval and Query Formulation Christine Krieger-Levine Content.
Sociological Abstracts Author search with composite name University Library next = click.
Introduction to Information Retrieval
Searching for Scientific Information 3rd March 2015 Kirsi Heino.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
® Copyright 2008 Adobe Systems Incorporated. All rights reserved. ADOBE® ACCESSIBILITY Achieving Accessibility with PDF Greg Pisocky Accessibility Specialist.
GeneriKairos® making faster Generics. Molecule patent expiry date Later expiring patents - Need significant R&D, might delay generic drug development.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Slide 1 Word Processing. Slide 2 What is a word processor? A word processor is a computer that you use for writing, editing and printing text. A dedicated.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Connecting with Computer Science, 2e
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 1 1 Microsoft Office Word 2003 Tutorial 1 – Creating a Document.
SEO for Trends to stay on Top Of. The Internet is a huge factor in how marketing is performed today, and keeping up with the latest SEO trends.
Yanhuai Liu, President Beijing East Linden World Traditional Medicine Patent Database.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Objectives Learn what a file system does
Aniko T. Valko, Keymodule Ltd.
1 QUESTEL ORBIT.COM. 2 QUESTEL French company Producer and provider of online and internet services Collection of patents, trademarks, designs, scientific-technical.
Building The Database Chapter 2
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
Microsoft Office Word 2003 Tutorial 1 Creating a Document.
Spelling Belle Vale School Improvement Liverpool 9 th May Sarah Williams.
* DIDA gave a brief of designing a database that would be able to store details of endangered species. The purpose of the database was so that members.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Microsoft Word 2000 Presentation 2 Microsoft Word Topics  Tools –Spelling/Grammar Check –Thesaurus –AutoCorrect –Word Count –Change Case –Background.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr.
Unit 2, Lesson 4 Using Auto Features in Word. Objectives Check and correct spelling. Check and correct spelling. Check and correct grammar. Check and.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Chapter 5 Tax Research McGraw-Hill/Irwin Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Senior Design I (EECE5001, EECE5031) September 26, 2014 Jim Clasper Assistant Engineering & Applied Science Librarian.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
10 steps to an analytical paper…. * Create a cover page with your name, class period, date and title of the document.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Brenda Poulter International Applications Specialist Thailand November 2004.
Group 4. Find and Replace To use Find / Replace Step1 Place your cursor at the beginning of your document. Step2 Go to the Edit menu and select Find.
Work with Tables and Database Records Lesson 3. NAVIGATING AMONG RECORDS Access users who prefer using the keyboard to navigate records can press keys.
Text Mining Special Interest Group Ron Behling, Bristol-Myers Squibb Novartis Institute for Biomedical Research, Cambridge, MA 6-8 th October 2004.
Chapter 17 Preparing Data for Mining. 2 Introduction Just as manufacturing and refining are about transformation of raw materials into finished products,
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Keyboarding Mastery. Proofreader’s Marks What are “Proofreader’s Marks”? Proofreader’s Marks are used by writers to indicate changes they think should.
User Errors in Formulating Queries and IR Techniques to Overcome Them Birger Larsen Information Interaction and Information Architecture Royal School of.
Dynamic SQL Writing Efficient Queries on the Fly ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
Report writing skills A Trade union training on research methodology, TMLC, Kisumu, Kenya 6-10 December 2010 Presentation by Mohammed Mwamadzingo,
1 EUROPEAN TOPIC CENTRE ON WATER EUROWATERNET Towards an Index of Quality of the National Data in Waterbase.
Contact: VTP Corporate Patents C. Wilk Packaging Engineering Th. Prothmann Patent Information M. Philipp.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Find it! Searching Databases - Hands-on workshop Chemistry Research Project (CHE600) School of Biological and Chemical Sciences.
Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of.
Validation and verification 1.2
Chemical structure search in PATENTSCOPE
Convegno AIDB Trieste – June 19, 2009
13 YEARS 11/2000 – 11/2013 Automated Privilege Detection, De-Threading & Automated Priv Logs 1st Quarter 2014 Confidential.
Scanner Scanner Introduction to Compilers.
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Introduction to Statistical Analysis in PatBase
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Presentation transcript:

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 1 IRF Symposium th and 9 th November - Vienna OCR Errors in Patent Full-Text Documents Perspective of an information professional

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 2 Searching in full-text patents Many requests related to pharmaceutical R&D include one or more of the following topics: Compounds / Drugs Drug actions Indications Formulations What kind of errors do we have to deal with when searching or mining for these aspects?

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 3 Searching in full-text patents For all following searches MicroPatent PatSearch was used. Years: now Other full-text patent sources like Espacenet, STN or Patbase do have the same type of OCR error problems!

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 4 Text Mining approach In a typical workflow of a thesaurus based text mining approach OCR errors can lead to losses twice: Generation of synonyms for search Search in various DBsDownload of results TEXT MINING thesauri & rules co-occurrence & semantic Post-processing Analysis & Visualization Highlighting Hyperlinking optimization of search strategy Standardization of extracted terms OCR

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 5 Examples: “l” or “i” or “1” ? Variations of alkyl-groups: methyi or ethyi or propyi or butyi patents ! methy1 or ethy1 or propy1 or buty patents ! Typical OCR Errors Variations of emulsion: emuision780 patents emulslon47 patents emuislon3 patents

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 6 Examples: “rn” or “m”? “l” or “1” or “i” ? Variations of micro* rnicro*in 5398 patents mlcro*in 1004 patents m1cro*in 344 patents Typical OCR Errors 2 OCR errors in such a short word are rare: rnlcro*in 12 patents rn1cro*in 4 patents

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 7 OCR Errors: Formulations Some variations of “microemulsion” ijcroemulsion licroemulsion micro emulsion microemuision microemulsion micro-emulsion microémulsion micro-émulsion microemulsionbased microenulsion miroemulsion miucroemulsion ormicroemulsion rnicroemulsion

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 8 Searching full-text patents (WO, EP, US, FR, GB, DE, JP) for the term “Simvastatin” yields 9030 patents (3666 INPADOC families). But there are 392 more patents which are not found due to typos and ORC errors: OCR Errors: Compound Names

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 9 If you think that was bad... look at the IUPAC names: WO (R)-[2-(8'(S)-2",2"-dimethylbutyryloxy-2'(S),6'(R)-dimethyl- l',2',6',7,'8',8a'(R)- hexahydronapthyl-l'(S))-ethyl]-4(R)-hydroxy -3,4-5,6-tetrahydro- 2H-pyran-2-one WO (R)-[2-[8(5)-(2,2-dimethyl.butyyloxy)-2 (S), 6 (R)-dimethyl-1, 2, 6, 7, 8, 8a(R)- hexahydro-l (S)-napthylelhyl/-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one WO (R)-[2-[8(S)-(2, 2-dimethylbulyryloxy)-2 (S), 6 (R)-dimethyl-1, 2, 6, 7, 8, 8a(R)- hexabydro-l (S)-napthylethyl/-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one WO (R)-[2-[8(S)-(2,2 10 dimethylbutylyloxy)-2(S),6(R)-dimethyl-1,2, 6,7,8,8a(R) hexahydronaphthyl]-l(S)ethyl]-4(R)-hydroxy-3,4,5,6 tetra hydro-2H-pyrane-2-one WO (R)-[2-[8(S)-(2,2- dimethylbutylyloxy)-2(S),6(R)-dimethyl-1,2,6,7,8,8a(R)- hexahydronaphthyl]-l(S)ethyl]-4(R)-hydroxy-3,4,5,6 20 tetrahydro-2H-pyrane-2-on WO (R)-[2-[8(S)-(2,2-dimethylbutylyloxy)-2(S),6(R)-dimeth yl-1,2,6,7,8,8a(R)- hexahydronaphthyl]-l(S) ethyl]-hydrox y-3,4,5,6-tetrahydro-2H-pyrane-2-one WO (R)-[2-[8(S)-(2,2-dimethylbutyrylaxy)-2 (S),6 (R)-dimethyAl, 2, 6, 7, 8, 8a(R)- hexahydro-l (S)-napthylJethyl)-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one WO (R)-{2[8(S)-(2,2dimethylbutyryloxy)2(5),6(R).. dimethyI.. 1,2,6,7,8,8a(R)- hexahydro-1 (S)-naphthylJethy1J-4(R)hydroxy3,4,5, 6 tetrahydro-2H-pyran-2-one OCR Errors: Chemical Names

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 10 In 141 patents containing the IUPAC name of Simvastatin not one (!) contained the correct name: 6(R)-[2-[8(S)-(2,2-dimethylbutyryloxy)-2(S),6(R)-dimethyl-1,2,6,7,8,8a(R)- hexahydronaphthyl]-1(S)ethyl]-4(R)-hydroxy-3,4,5,6-tetrahydro-2H-pyran-2- one After removing all characters which are not a letter or number: 6R28S22dimethylbutyryloxy2S6Rdimethyl126788aRhexahydronaphthyl1Set hyl4Rhydroxy3456tetrahydro2Hpyran2one 13 out of 141 patents were found... OCR Errors: Chemical Names

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 11 Searching for (long) IUPAC names in full-text patents will miss most hits This is very relevant for all applications which convert IUPAC names into chemical structures! Nevertheless, searching for brand names or generic names will for sure find additional relevant hits especially as these names are often mentioned several times in a document. OCR Errors: Chemical Names

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 12 angiotensin 11 antagonistangiotensin In antagonist angiotensin 1I antagonistangiot ensin IT antagonist angiotensin I1 antagonistangiotensin (ff) antagonist angiotensin I1: antagonistangiotensin (II) antagonists angiotensin 1:I antagonistsangiotensin 1[ receptor antagonists angiotensin H hypertension antagonistangiotensin 1E[ antagonist angiotensin I! antagonistangiotensin fI antagonistic angiotensin Id antagonistangiotensin I[[ receptor antagonism angiotensin IEI antagonistangiotensin J7 antagonists angiotensin If antagonistsangiotensin JI hypertension antagonist Angiotensin li antagonistangiotensin JJ hypertension antagonist Anniotensin I I Antanonistangiotensin li I antagonists agiotensin II antagonistangiotensin!l antagonist angiotensen-il receptor antagonistsangiotensin:[I antagonists angiotensin 1:[ antagonistangiotensin][I antagonist angiotensin I[ antagonistsAngioten-sin-il Antagonisten angiotensin I[[ antagonistAngiotensin-JI Antagonisten angiotensin IJ antagonists OCR Errors: Drug Action Found variations of Angiotensin II antagonists Even very short fragments like the roman numeral “II” can cause a lot of trouble!

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 13 Transposed Characters Some errors cannot originate from an erroneous OCR process. Accidentally transposed characters are another source for variations: ehtyl1565 patents mehtyl840 patents compuond231 patents relaese44 patents formual1689 patents

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 14 Wrong Names / Orthography Many errors are the result of bad spelling or lack of knowledge of the correct name / orthography: Sometimes foreign terms slip into patents Only US and GB patents were searched! natrium687 patents kalium431 patents adenosin382 patents naphtyl11206patents napthyl11276patents esther1387 patents

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 15 Sepracor INC used the name “Sildenophil” 64 times (in 18 patents) without once mentioning the correct name “Sildenafil”: US B2 “Compositions comprising sibutramine metabolites in combination with phosphodiesterase inhibitors” SEPRACOR INC....Particular phosphodiesterase inhibitors include, but are not limited to, sildenophil (Viagra®), desmethylsildenophil, vinopocetine, milrinone... Wrong names (used accidentally or on purpose) are an additional source for variations Wrong Names / Orthography

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 16 Missing Space Characters Missing space characters can easily cause losses: Example: Drug action analyses of pharmaceutical patents An extraction based on rules like:target1 with agonist target2 with agonist target3 with agonist etc.... will miss those hits which have no space character between the target name and the term “agonist”: PDE 4agonist Adenosin A2agonist Left truncation is not very helpful: “*agonist” would also yield the antagonists !

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 17 YE R--'n': J (1a) Table 1 Cpd. R. R2- R bis-trifluoro- CH3 H methyl-phenyl.. O 1-2 2,4- dichlorophenyl CH3 \ O H trifluoromethyl4- CH3 H carboxyamino- \ O phenyl 14 3-(2-(1-ethyl- CH3 \1 H propoxy)-6-trifluoro- \/<O methyl)- pyridine l ' cyano-4-trifluoro- CH3 H methyl-phenyl \ I dichlol he yl CH3 ·1/ H 1-7 2,4-dichlorophenyl CH3 J XO H OCR result: Original: Scrambled Tables

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 18 Common error types interfering with searching / text mining:  OCR letter misinterpreation: I-1-l (methy1) m – rn (rnicro)  Typos: mehtyl or relaese or compuond  Intentional Errors or lack of knowledge: Sildenophil  Spacing errors:...agonists  OCR misinterpretation of text areas: inclusion of line numbers into phrases scrambled table structures inclusion of characters from chemical structures into phrases Error Types

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 19 What have we learned?  All patent full-text databases contain (lots of) OCR errors  Only some of the errors are so common/systematic to be included in searches or text mining approaches  Numerous errors are so severe and unpredictable that they can only be corrected manually  Even documents not created via OCR regularly contain errors Conclusions Quality of future OCR documents will improve but re-scanning of huge backfile is unrealistic Smart error correction algorithms and reference lists can help but good solutions for efficient manual scanning are very important too!

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 20 Thank you!