Presentation is loading. Please wait.

Presentation is loading. Please wait.

InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Digitalization and Chemical Entity.

Similar presentations


Presentation on theme: "InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Digitalization and Chemical Entity."— Presentation transcript:

1 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt: Unrivaled Historical Information Meets Modern Technology M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH) 1 / 34

2 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Historical Importance of Chemisches Zentralblatt 1817 Gmelin Handbook … 1830 Chemisches Zentralblatt 1969 First and oldest abstracts journal in chemistry Covers chemical literature from 1830 to 1969 Describes the birth of chemistry as science (vs. alchemy) 18401907 Chemical Abstracts … 1881 Beilstein Handbook … 1772 1771 Biggest and single abstracts source in chemistry Currently >31 million papers and patents Content 1840-1906 added retrospectively 2 / 34

3 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Chemisches Zentralblatt: Content Covers 140 years of chemistry About 3.6 million abstracts journal articles patents 900000 pages (115000 for time period 1830-1906) 700000 pages with abstracts 200000 pages of indexes (Register) Author 1830 Subject alphabetic1830 systematic1863 Patent1897 Formula1925 General indexes1883 3 / 34

4 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 History of Chemisches Zentralblatt: Rise Pharmaceutisches Central-Blatt, 403 abstracts/544 pages/10 journals, weekly after 8 months. 1830 1850Title changes to Chemisch- Pharmaceutisches Central-Blatt 1856Chemisches Central-Blatt 1864Introduction of a systematic table of contents Classification of chemistry 1879First patent abstracts in kleinen Mittheilungen 18831st edition of General Index 1884In-text images 1888273 journals excerpted 4 / 34

5 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 History of Chemisches Zentralblatt: Prosperity 1897 Holding passes to Deutsche Chemische Gesellschaft for DM 15000. Introduction of patent index. 1901Editorial office moves from Leipzig to Berlin. 1919Takes over abstracts from Angew. Chem. Split into scientific (I/III) and technical part (II/IV). 1921Begins to cover foreign patents. 1924 CZ is reunified into one journal of abstracts. 1925Introduction of formula index. 1929Centennial: Richard Willstätter accentuates timeliness, exactness, completeness as attributes and requirements for quality of CZ. CA 5 / 34

6 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 History of Chemisches Zentralblatt: Decline 1940 | 1945 WW II: Difficulties in collecting information. 1944 bombing of editorial office. Pages 1947 | 1949 Double production of CZ in East and West Germany. 1950Reunification of CZ under East and West German organisations. 1954Trying to fill gap by supplement volumes. 1961Berlin Wall does not hinder production. Editorial Office East Berlin Editorial Office West Berlin 1967Introduction of SRD (Schnellreferatedienst, quick abstract service) for organic chemistry. 1969 GDR office declares unable to afford production of SRD and of journal. CZ ceases publication. CA SRD continued as Chemischer Informationsdienst (ChemInform). 6 / 34

7 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Chemisches Zentralblatt vs. CA: Quantity Pages WW II WW I CA format change Abstracts WW II WW I 7 / 34

8 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Chemisches Zentralblatt vs. CA: Quality Many textbooks on chemical literature claim better quality of Chemisches Zentralblatt than CA for pre-WW II H. Skolnik, The literature matrix of chemistry, 1982: outstanding A/I service R.E. Maizell, How to find chemical information, 3rd ed. 1998, citing E.J. Crane, [..] has value because of [..] good abstracts M. Mücke, Die chemische Literatur, 1982, Zwar war CA zahlenmässig [..] dem Chemischen Zentralblatt überlegen, doch war dies gerade umgekehrt, was die Qualität der Referate betraf. R.T. Bottle, J.F. Rowland, Information Sources in Chemistry, 4th ed. 1993, Before WW II, many chemists regarded CZ as superior in coverage to CA; its abstracts were longer and more informative [...] A.S.K. Atsu, Comparative coverage of chemical abstracting services in the period 1906-1940, M. Sc. Thesis, City University, London (1976) 8 / 34

9 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 CZ I(1928), 528CA 22:11339 (1928), 1363 Length (pages)7.51 Length (words)3,882690 Length (chars)24,3084,695 Compounds~ 120~ 70 Structure formulas Chemisches Zentralblatt vs. CA: Quality Example: Hans Fischer, Georg Stangler, Synthese des Mesoporphyrings, Mesohämins und über die Konstitution des Hämins, Justus Liebigs Ann. Chem. 459(1927), 53- 98. 9 / 34

10 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Chemisches Zentralblatt: Digitalization Relevant for documentation of prior art Continuous and growing demand of the information FIZ Chemie Berlin has scanned the whole work and offers a full text searchable database for the web and the dataset for integration in Intranets ETH Zurich has bought the digitalized raw material (pdfs with OCRed text in the background) from FIZ and is creating a database offering full text search 900000 pdf pages,1.3 TB Raw text content incl. search index about 10 GB CAS has performed automatic translation (German English) of the 1897-1907 volumes and included in CAplus 10 / 34

11 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Reasons for buying digitalized Chem. Zentralblatt www.infochembio.ethz.ch/en/holdings.html 11 / 34

12 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Reasons for buying digitalized Chem. Zentralblatt Space Loss of compact shelving space in basement (432 m 194 m, -55%) Disposal of printed Beilstein, CA, Chem. Zentralblatt Access e-books, e-journals, end-user databases at workbench of chemist Chemists trained to electronic sources, print and µ-film cumbersome Restoration costs due to deterioration of acid-containing paper 17K/t for deacidification : Chem. Zentralblatt 1.6 t 27K Digitalization and operation costs much higher (10x), but can be divided Ease of use : Search / Browse / Print 12 / 34

13 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Quality of Obtained Raw Data Errors upon conversion Visual inspection of pages: Cover Flow / Quick Look technology 13 / 34

14 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Quality of Raw Data Observed: Page Errors File errors (conversion) Unreadable directories (missing content) Defect pdf files (missing content) Errors during scanning (visual inpection) Duplicate pages (shifting page index) Missing pages (shifting page index, missing content) Issues scanned in wrong order (minor) Two pages on one (shifting page index) Wrong volume (missing content) 14 / 34

15 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Quality of Raw Data Observed: OCR ETH works with OCR from FIZ Chemie page word index, 346 million words 8.8% with only 1 character slightly expanded fonts, e.g. for author names, sum formulas Abbreviations (journal names, Zentralblatt = C), numbers element names in structure formulas 15 / 34

16 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Planned Tasks ETH Zürich Adding navigation structure, provide DB search and browse for ETH members (Q4/09) Mining and Markup (Q1/10) Bibliographic references Authors General Subject Headings Reference linking to journal articles and patents (Q1/10) 16 / 34

17 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Chemisches Zentralblatt: Conclusion Covers chemical literature from 1830 to 1969 Very good abstract quality Better quality (length, details) than CA for pre-WW II period 1907-1940 Contains also important patent information Invaluable information in indexes (e.g. synonyms of ancient chemical names) Only comprehensive abstract journal on the market up to 1907 More comprehensive than CA for 19th century literature Complements Beilstein and Gmelin handbooks for 19th century literature 17 / 34

18 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Org. Lett., 2006, 8 (19), pp 4279–4281 Chemisches Zentralblatt., 1904, 2, 1145 Importance of Chemisches Zentralblatt: Example The authors have retracted this paper on November 15, 2007 (Org. Lett. 2007, 24, 5139) 18 / 34

19 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 InfoChem Motivation Text search in Chemisches Zentralblatt: Abstracts in German language High number of old German chemical names Chemists think in structures!!! Language independent structure search would help ALL scientists to access this historical source and to use the relevant information of this art Required technology for structure search projects Optimized German-English dictionaries 30 million SPRESI names 19 / 34

20 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Overview of Approach and Applied Technology Manual abstraction of sample set for evaluation Comparison (quantitative).tiff Documents Pdf documents Text under image skhflaskjlkfjlkdj Link to original literature Database Combined search on federated search system (IC FEDSEARCH ) OCR NER N2S IC ANNOTATOR SPRESI Dictionaries 20 / 34

21 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 1870 Challenges OCR (1) 1830 1910 1930 1969 21 / 34

22 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Challenges OCR (2) Bad quality of original source: dirty (blotted, stained) pages print from back page 22 / 34

23 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Challenges OCR (3) Tables:extremely small fonts, not recognizable begin / end of columns 23 / 34

24 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Challenges OCR (4) Ambiguous old fonts (h=b; c=e; ligations) Spaced text Specific rules, large German dictionaries and extensive training are applied to correct systematic mistakes of standard OCR process 24 / 34

25 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Challenges Annotation (1) Names lack position, valence or stoichiometric information Pimarsäure is it the R or L form? Platinchlorid in which oxidation state II, III, IV? Chemical names that indicate a chemical class Nitrolsäure (nitrolic acid) Lactonsäure(lactonic acid) any of several acids with a lactone ring bearing the carboxylic group Mixed compounds Eunole Naphthole + Eucalyptusöl Pikrotoxin Pikrotoxinin + Pikrotin NO solution: correct structure information is not available in the original source 25 / 34

26 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Challenges Annotation (2) Obsolete German language Schwefelsaures Natrium, Chlorür, Bromür Historical names Pelopeum Columbium Niobium Different spelling for the same name: Dibrom… Bibrom… Ätzkali Aetzkali 26 / 34

27 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Solutions in Annotation Process Correction of German-specific grammar Translation in English of not available chemical names Research in old sources: Beilstein Brockhaus Encyclopedia German-English dictionaries of chemistry Meyers Encyclopedia Pierer Encyclopedia References to very old books, journals, articles Naturwissenschaftliche Exzerpte und Notizen Mitte 1877 bis Anfang 1883 by Karl Marx 27 / 34

28 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Results Annotation Chemisches Zentralblatt 120,000 pages covering time period 1830-1907 2.4 million chemical names with associated structure 98,000 unique names 47,000 unique structures Quantitative comparison with manually abstracted sample set Recall 51% Precision87% 28 / 34

29 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Federated Search Prototype 29 / 34

30 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Federated Search Prototype 30 / 34

31 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Federated Search Prototype 31 / 34

32 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Summary Described history, content and importance nowadays of Chemisches Zentralblatt Illustrated how the challenges of OCR and annotation process have been solved Time period 1830-1907 contains 98,000 unique names and 47,000 unique structures Quantitative comparison proves over 50% recall and nearly 90% precision Generated structure searchable Chemisches Zentralblatt database is integrated in IC FEDSEARCH 32 / 34

33 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Outlook Chemisches Zentralblatt:Phase 1, Q2 2009Phase 2, Q4 2009 Pages:120,000900,000 Time period:1830-19071830-1969 Unique names:98,000Ca. 1 million Unique structures: 47,000Ca. 500,000 Recall:50%? 33 / 34

34 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Acknowledgements InfoChem GmbH: InfoChem GmbH: www.infochem.de, www.spresi.com, info@infochem.de www.infochem.dewww.spresi.cominfo@infochem.de Prof. Dr. Deplanque, Mr. Heineke and FIZ Chemie Team Berlin Ms. Langanke InfoChem Team Chemistry Biology Pharmacy Information Center (ETH Zürich) Thank you! ETH Zürich: ETH Zürich: www.infochembio.ethz.ch, braendle@chem.ethz.ch www.infochembio.ethz.chbraendle@chem.ethz.ch 34 / 34

35 InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Chemical nomenclature and orthography Chemical nomenclature <1899 : no rules 1899 : World congress, internat. committee formed 1892 : Geneva congress, first nomenclature rules (hydrocarbons, benzene derivatives) Orthography 1903 : New orthography rules inconsistently used (cz, phf) CZ resilient to new rules because of indexing 1907 : CZ adopts uniform orthography for scientific and technical terms (1906) Molecular formula systems 1884 : Richter System (C n + order for frequent elements) 1900 : Hill System (C n H m + alphabetic order) used by USPTO and CA 1925 : Richter System used by CZ, changes to Hill in 1956 35 / 34


Download ppt "InfoChem / ETH Zürich Copyright © 2009Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Digitalization and Chemical Entity."

Similar presentations


Ads by Google