Presentation is loading. Please wait.

Presentation is loading. Please wait.

IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 1 IRF Symposium 2007 8 th and 9 th November - Vienna OCR Errors in Patent Full-Text Documents.

Similar presentations


Presentation on theme: "IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 1 IRF Symposium 2007 8 th and 9 th November - Vienna OCR Errors in Patent Full-Text Documents."— Presentation transcript:

1 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 1 IRF Symposium 2007 8 th and 9 th November - Vienna OCR Errors in Patent Full-Text Documents Perspective of an information professional

2 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 2 Searching in full-text patents Many requests related to pharmaceutical R&D include one or more of the following topics: Compounds / Drugs Drug actions Indications Formulations What kind of errors do we have to deal with when searching or mining for these aspects?

3 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 3 Searching in full-text patents For all following searches MicroPatent PatSearch was used. Years: 1981 - now Other full-text patent sources like Espacenet, STN or Patbase do have the same type of OCR error problems!

4 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 4 Text Mining approach In a typical workflow of a thesaurus based text mining approach OCR errors can lead to losses twice: Generation of synonyms for search Search in various DBsDownload of results TEXT MINING thesauri & rules co-occurrence & semantic Post-processing Analysis & Visualization Highlighting Hyperlinking optimization of search strategy Standardization of extracted terms OCR

5 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 5 Examples: “l” or “i” or “1” ? Variations of alkyl-groups: methyi or ethyi or propyi or butyi 17388 patents ! methy1 or ethy1 or propy1 or buty1 13118 patents ! Typical OCR Errors Variations of emulsion: emuision780 patents emulslon47 patents emuislon3 patents

6 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 6 Examples: “rn” or “m”? “l” or “1” or “i” ? Variations of micro* rnicro*in 5398 patents mlcro*in 1004 patents m1cro*in 344 patents Typical OCR Errors 2 OCR errors in such a short word are rare: rnlcro*in 12 patents rn1cro*in 4 patents

7 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 7 OCR Errors: Formulations Some variations of “microemulsion” ijcroemulsion licroemulsion micro emulsion microemuision microemulsion micro-emulsion microémulsion micro-émulsion microemulsionbased microenulsion miroemulsion miucroemulsion ormicroemulsion rnicroemulsion

8 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 8 Searching full-text patents (WO, EP, US, FR, GB, DE, JP) for the term “Simvastatin” yields 9030 patents (3666 INPADOC families). But there are 392 more patents which are not found due to typos and ORC errors: OCR Errors: Compound Names

9 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 9 If you think that was bad... look at the IUPAC names: WO2007096753 6(R)-[2-(8'(S)-2",2"-dimethylbutyryloxy-2'(S),6'(R)-dimethyl- l',2',6',7,'8',8a'(R)- hexahydronapthyl-l'(S))-ethyl]-4(R)-hydroxy -3,4-5,6-tetrahydro- 2H-pyran-2-one WO2005095374 6(R)-[2-[8(5)-(2,2-dimethyl.butyyloxy)-2 (S), 6 (R)-dimethyl-1, 2, 6, 7, 8, 8a(R)- hexahydro-l (S)-napthylelhyl/-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one WO2005095374 6(R)-[2-[8(S)-(2, 2-dimethylbulyryloxy)-2 (S), 6 (R)-dimethyl-1, 2, 6, 7, 8, 8a(R)- hexabydro-l (S)-napthylethyl/-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one WO2003018570 6(R)-[2-[8(S)-(2,2 10 dimethylbutylyloxy)-2(S),6(R)-dimethyl-1,2, 6,7,8,8a(R) hexahydronaphthyl]-l(S)ethyl]-4(R)-hydroxy-3,4,5,6 tetra hydro-2H-pyrane-2-one WO2003048149 6(R)-[2-[8(S)-(2,2- dimethylbutylyloxy)-2(S),6(R)-dimethyl-1,2,6,7,8,8a(R)- hexahydronaphthyl]-l(S)ethyl]-4(R)-hydroxy-3,4,5,6 20 tetrahydro-2H-pyrane-2-on WO2003018570 6(R)-[2-[8(S)-(2,2-dimethylbutylyloxy)-2(S),6(R)-dimeth yl-1,2,6,7,8,8a(R)- hexahydronaphthyl]-l(S) ethyl]-hydrox y-3,4,5,6-tetrahydro-2H-pyrane-2-one WO2005095374 6(R)-[2-[8(S)-(2,2-dimethylbutyrylaxy)-2 (S),6 (R)-dimethyAl, 2, 6, 7, 8, 8a(R)- hexahydro-l (S)-napthylJethyl)-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one WO2006072963 6(R)-{2[8(S)-(2,2dimethylbutyryloxy)2(5),6(R).. dimethyI.. 1,2,6,7,8,8a(R)- hexahydro-1 (S)-naphthylJethy1J-4(R)hydroxy3,4,5, 6 tetrahydro-2H-pyran-2-one OCR Errors: Chemical Names

10 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 10 In 141 patents containing the IUPAC name of Simvastatin not one (!) contained the correct name: 6(R)-[2-[8(S)-(2,2-dimethylbutyryloxy)-2(S),6(R)-dimethyl-1,2,6,7,8,8a(R)- hexahydronaphthyl]-1(S)ethyl]-4(R)-hydroxy-3,4,5,6-tetrahydro-2H-pyran-2- one After removing all characters which are not a letter or number: 6R28S22dimethylbutyryloxy2S6Rdimethyl126788aRhexahydronaphthyl1Set hyl4Rhydroxy3456tetrahydro2Hpyran2one 13 out of 141 patents were found... OCR Errors: Chemical Names

11 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 11 Searching for (long) IUPAC names in full-text patents will miss most hits This is very relevant for all applications which convert IUPAC names into chemical structures! Nevertheless, searching for brand names or generic names will for sure find additional relevant hits especially as these names are often mentioned several times in a document. OCR Errors: Chemical Names

12 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 12 angiotensin 11 antagonistangiotensin In antagonist angiotensin 1I antagonistangiot ensin IT antagonist angiotensin I1 antagonistangiotensin (ff) antagonist angiotensin I1: antagonistangiotensin (II) antagonists angiotensin 1:I antagonistsangiotensin 1[ receptor antagonists angiotensin H hypertension antagonistangiotensin 1E[ antagonist angiotensin I! antagonistangiotensin fI antagonistic angiotensin Id antagonistangiotensin I[[ receptor antagonism angiotensin IEI antagonistangiotensin J7 antagonists angiotensin If antagonistsangiotensin JI hypertension antagonist Angiotensin li antagonistangiotensin JJ hypertension antagonist Anniotensin I I Antanonistangiotensin li I antagonists agiotensin II antagonistangiotensin!l antagonist angiotensen-il receptor antagonistsangiotensin:[I antagonists angiotensin 1:[ antagonistangiotensin][I antagonist angiotensin I[ antagonistsAngioten-sin-il Antagonisten angiotensin I[[ antagonistAngiotensin-JI Antagonisten angiotensin IJ antagonists OCR Errors: Drug Action Found variations of Angiotensin II antagonists Even very short fragments like the roman numeral “II” can cause a lot of trouble!

13 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 13 Transposed Characters Some errors cannot originate from an erroneous OCR process. Accidentally transposed characters are another source for variations: ehtyl1565 patents mehtyl840 patents compuond231 patents relaese44 patents formual1689 patents

14 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 14 Wrong Names / Orthography Many errors are the result of bad spelling or lack of knowledge of the correct name / orthography: Sometimes foreign terms slip into patents Only US and GB patents were searched! natrium687 patents kalium431 patents adenosin382 patents naphtyl11206patents napthyl11276patents esther1387 patents

15 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 15 Sepracor INC used the name “Sildenophil” 64 times (in 18 patents) without once mentioning the correct name “Sildenafil”: US6974837 B2 “Compositions comprising sibutramine metabolites in combination with phosphodiesterase inhibitors” SEPRACOR INC....Particular phosphodiesterase inhibitors include, but are not limited to, sildenophil (Viagra®), desmethylsildenophil, vinopocetine, milrinone... Wrong names (used accidentally or on purpose) are an additional source for variations Wrong Names / Orthography

16 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 16 Missing Space Characters Missing space characters can easily cause losses: Example: Drug action analyses of pharmaceutical patents An extraction based on rules like:target1 with agonist target2 with agonist target3 with agonist etc.... will miss those hits which have no space character between the target name and the term “agonist”: PDE 4agonist Adenosin A2agonist Left truncation is not very helpful: “*agonist” would also yield the antagonists !

17 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 17 YE R--'n': J (1a) Table 1 Cpd. R. R2- R3 1-1 2 4-bis-trifluoro- CH3 H methyl-phenyl.. O 1-2 2,4- dichlorophenyl CH3 \ O H 1-3 2-trifluoromethyl4- CH3 H carboxyamino- \ O phenyl 14 3-(2-(1-ethyl- CH3 \1 H propoxy)-6-trifluoro- \/<O methyl)- pyridine l ' 1-5 2-cyano-4-trifluoro- CH3 H methyl-phenyl \ I 1-6 2 4 dichlol he yl CH3 ·1/ H 1-7 2,4-dichlorophenyl CH3 J XO H OCR result: Original: Scrambled Tables

18 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 18 Common error types interfering with searching / text mining:  OCR letter misinterpreation: I-1-l (methy1) m – rn (rnicro)  Typos: mehtyl or relaese or compuond  Intentional Errors or lack of knowledge: Sildenophil  Spacing errors:...agonists  OCR misinterpretation of text areas: inclusion of line numbers into phrases scrambled table structures inclusion of characters from chemical structures into phrases Error Types

19 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 19 What have we learned?  All patent full-text databases contain (lots of) OCR errors  Only some of the errors are so common/systematic to be included in searches or text mining approaches  Numerous errors are so severe and unpredictable that they can only be corrected manually  Even documents not created via OCR regularly contain errors Conclusions Quality of future OCR documents will improve but re-scanning of huge backfile is unrealistic Smart error correction algorithms and reference lists can help but good solutions for efficient manual scanning are very important too!

20 IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 20 Thank you!


Download ppt "IRF Symposium 2007 IRF Symposium 2007 Wolfgang Thielemann 1 IRF Symposium 2007 8 th and 9 th November - Vienna OCR Errors in Patent Full-Text Documents."

Similar presentations


Ads by Google