Presentation is loading. Please wait.

Presentation is loading. Please wait.

ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19

Similar presentations


Presentation on theme: "ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19"— Presentation transcript:

1 ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org

2 2 What is text mining? Marti Hearst, Berkeley: “Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.” Can ChEBI help?

3 3 Overview  Reasoning  ChEBI as dictionary  Regular polysemy in chemistry  Some possible solutions

4 4 Reasoning

5 5 Reasoning is using the logical structure of an ontology to automatically infer facts about the world which have not been explicitly added by a human being. Computers have no real-world knowledge beyond what we tell them.

6 6 Logical structure: properties of relations We only have time to look at transitivity and is_a. Smith et al., “Relations in Biomedical Ontologies”, Genome Biol., 2005, 6, R46. RelationTransitiveSymmetricReflexiveAnti- symmetric is_aYesNoYes part_ofYesNoYes

7 7 ChEBI’s is_a is not transitive (1) If a relation R is transitive, then: If a R b and b R c, then a R c.  glutathione is_a cofactor  cofactor is_a biological role therefore glutathione is_a biological role

8 8 ChEBI’s is_a is not transitive (2)  water is_a amphiprotic solvent  amphiprotic solvent is_a protophilic solvent (*)  protophilic solvent is_a Bronsted base (*)  Bronsted base is_a base  base is_a biological role therefore water is_a base therefore water is_a biological role * how come “protophilic solvent” and “Bronsted base” only have one child each?

9 9 ChEBI’s is_a is not transitive (3)  N-hydroxy-L-aspartic acid is_a hydroxamic acids  hydroxamic acids is_a organic functional classes therefore N-hydroxy-L-aspartic acid is_a organic functional classes

10 10 is_a has many meanings! 1.An amount of a compound has a biological role: tris is_a buffer.* 2.An amount of a compound has an application: sodium dodecyl sulfate is_a detergent.* 3.A less-abstract type is an example of a more abstract type: propane is_a alkanes. 4.?!: metals is_a atoms.* * Not a property of a lone atom or molecule!

11 11 Computers need facts about the world, not about ChEBI curation

12 12 ChEBI as dictionary

13 13 Evaluating name–structure conversion with ChEBI ChEBI release 37 (26 September 2007) contains 12688 annotated entities, of which 8486 have InChI strings. We use OSCAR3 (oscar3-chem.sourceforge.net) for name– structure conversion. We convert chebi.obo to an XML file, each paragraph containing either a ChEBI name or an IUPAC name. The layered structure of the InChI lets us give partial credit for incomplete matches.

14 14 Results: IUPAC names Total8447 Identified as chemical8255 (97.73%) With InChI (upper bound)1810 (21.43%) Matching InChI, disregarding fixed hydrogen layer1734 (20.53%) Matching InChI, disregarding stereo1176 Matching InChI, exact (lower bound)1174 (13.90%) Not all of name matched1024 Name identified as two or more separate names974 (11.53%)

15 15 Results: ChEBI names Total8146 Identified as chemical7173 (88.06%) With InChI (upper bound)1036 (12.72%) Matching InChI, disregarding fixed hydrogen layer953 (11.70%) Matching InChI, disregarding stereo637 Matching InChI, exact (lower bound)628 (7.71%) Not all of name matched764 Name identified as two or more separate names373 (4.58%)

16 16 Regular polysemy

17 17 Regular polysemy … where words stand for multiple things in a consistent way. Examples:  Brand names  Grinding  Figure–ground  Exact–class–part polysemy in chemistry Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

18 18 Regular polysemy Brand names “Learning to buy a Renault and talk to BMW” Grinding “The squirrel scampered down the path and kept stopping and looking at the officers to check they were behind” vs. “[…] the trick was to serve squirrel fresh and not to leave it hanging like other game”

19 19 Regular polysemy Figure–ground  Audrey Hepburn painted the door (figure)  Audrey Hepburn walked through the door (ground)  The Incredible Hulk walked through the door (ambiguous)

20 20 Methyl, the radical (exact)

21 21 Methyl, the group (part)

22 22 Can ChEBI handle methyl? methyl group(CHEBI:32875) YES methyl radical(CHEBI:29309) YES

23 23 Imidazole (exact)

24 24 An imidazole (class)

25 25 imidazole side-chain/group/ring (part)

26 26 Can ChEBI handle imidazole? imidazoles(CHEBI:24780) YES imidazole(CHEBI:16069)YES imidazole ringnot yet imidazolyl groupnot yet

27 27 Mapping exact, class and part to entries in ChEBI Tests: 1.Has InChI: exact 2.Name is plural: class 3.Ends in –yl, “group” or “residue”: part Test 2 doesn’t work for applications or roles. Test 3 is brittle. I would much rather use the logical structure of the ontology.

28 28 Some possible solutions

29 29 Some possible solutions (1)  ChEBI must represent facts about the world rather than about itself. Examples:  If unclassified compounds have a structure, they should be in the molecular structure tree rather than the unclassifieds tree.  “organic functional classes” is a tool for assigning nomenclature. No chemical compound is an “organic functional class”.

30 30 Some possible solutions (2) ChEBI must distinguish between what is always true and what is only sometimes true. Example:  Replace some is_a relationships with has_biological_role and has_application. We need ChEBI to represent parts of molecules that aren’t substituents. They should all be descendants of molecular part (a new term), as should amino acid residues and nucleoside residues.

31 31 Questions?


Download ppt "ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19"

Similar presentations


Ads by Google