Presentation is loading. Please wait.

Presentation is loading. Please wait.

Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa.

Similar presentations


Presentation on theme: "Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa."— Presentation transcript:

1 Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa Ghanem Tom Barnwell Yike Guo Imperial College London Symposium on Semantic Mining in Biomedicine /4/6

2 SMBM Introduction On-going explosive growth of biomedical literature Text Mining techniques can help through: Extractive processes: extracting terms or facts from papers for searching and linking Structuring processes: grouping papers based on content for conceptual navigation of large document collections GO-tag project: Annotating biomedical papers with terms from the Gene Ontology

3 SMBM Gene Ontology Provides common descriptive framework for genes and gene products across species Consists of three structured, controlled vocabularies (ontologies) that describe genes and gene products in terms of: Biological processes Cellular components Molecular functions

4 SMBM Gene Ontology Contains almost 20,000 terms GO Slim (87 terms): subset of all GO terms Aims to give broad overview of ontology content Can be species-specific Typical GO term Term name:isotropic cell growth Accession:GO: Ontology:biological_process Synonyms: related: uniform cell growth Definition: The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth.

5 SMBM Common Use of GO Associations of genes and gene products with GO terms in model organism and protein databases FlyBase, SGD, MGD For example (from SGD): GeneGO AnnotationReferencesEvidence Code ACT1Structural constituentBotstein D, et al. (1997) Traceable Author of cytoskeletonThe yeast cytoskeletonStatement ACT1ExocytosisPruyne D and BretsherTraceable Author (2000) Polarization ofStatement in yeast Botstein D, et al (1997)Traceable Author The yeast cytoskeletonStatement ACT1Histone acetyltransferaseGalarneua L, et al.Inferred from complex(2000) Multiple linksDirect Assay between the NuA4 …

6 SMBM GO-Tagging Task: given a text (PubMed abstract) and GO/GO Slim, assign 0 or more GO terms to the text if the text is about the process/component/function identified by the GO term Only most specific terms are assigned No association of GO term with specific genes or gene products User scenarios: Research scientists: clustering of PubMed search results Database curators: identifying texts that may support Gene-GO term associations

7 SMBM Outline of Rest of Talk Data sets / Gold standards SGD Gold Standard IC Gold Standard Three approaches to GO-tagging Lexical look-up Information retrieval approach Machine learning Evaluation results Conclusions

8 SMBM SGD Gold Standard Derive Gold Standard from SGD model organism database (yeast) Given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T SGD Gold Standard 4922 PMIDS 2455 GO terms PMID-GO term pairs

9 SMBM SGD Gold Standard Advantages SGD data already exists – no further annotation work required More Gold Standard data from other model organism databases Disadvantage List of Gene-GO term assignments in SGD is incomplete for our task Each paper is associated with GO terms whose assignment to specific genes it supports, but the paper may be missing other GO terms which can also be legitimately attached to it List does not contain all papers supporting a given assignment Consequence SGD Gold Standard is GO-term incomplete Weak measure of Recall Precision figures difficult to interpret

10 SMBM SGD Gold Standard Further issue: SGD Gene-GO term assignments are based on full papers, whereas system only has access to abstracts Consequence: Limit on maximum Recall obtainable by system

11 SMBM IC Gold Standard Manually extend SGD Gold Standard to obtain GO-term complete annotation Select SGD papers for which all GO term assignments are supported by abstract or title Semi-automatically add further GO terms by fuzzy term matching + post-editing IC Gold Standard 785 PMIDS 1006 GO terms 5170 PMID-GO term pairs

12 SMBM IC Gold Standard Advantage Closer to GO-term complete Gold Standard Disadvantages Still not GO-term complete Direct mentions of GO terms vs. semantically inferred GO terms Gold Standard creation method favors lexical look-up approach to GO-tagging Data set is small

13 SMBM Outline of Rest of Talk Data sets / Gold standards SGD Gold Standard IC Gold Standard Three approaches to GO-tagging Lexical look-up Information retrieval approach Machine learning Evaluation results Conclusions

14 SMBM Lexical Look-Up (Task: given a text (PubMed abstract) and GO/GO Slim, assign 0 or more GO terms to the text if the text is about the process/component/function identified by the GO term) GO term T is assigned to a paper if term T occurs in the abstract of the paper Simple & fast baseline GO terms recognized in text can be used as features in Machine Learning approach

15 SMBM Lexical Look-Up Web service calls to Termino term tagger Term classes in Termino GO terms GO term synonyms SGD yeast gene names Lexical look-up method Case-insensitive Simple morphological analysis Cells mapped onto cell Mitochondrial, mitochondria not mapped onto mitochondrion

16 SMBM Lexical Look-Up Results Recall Full text (SGD) vs. abstracts only (IC) Inherent drawbacks of lexical look-up: term variation, literal mentions Effects of Gold Standard creation method (IC) Precision Effects of Gold Standard creation method (IC) GO vs. GO Slim Recognizing GO Slim terms is easier than recognizing GO terms

17 SMBM Lexical Look-Up Extensions GO term T is assigned to a paper if synonym of term T occurs in the abstract of the paper GO term T is assigned to a paper if yeast gene name associated with term T occurs in the abstract of the paper Effects on performance Adding synonyms: slight decrease in Precision, substantial increase in Recall Adding yeast terms: substantial decrease in Precision, substantial increase in Recall

18 SMBM IR-Based Approach Document collection For each GO term, create a document consisting of the GO term, its synonyms, and its definition Query For each paper, create a query consisting of the words in the abstract of the paper Given a query (i.e., abstract), retrieve relevant documents (i.e., GO terms) from the document collection Assign top-ranked 5, 10, … GO terms to abstract

19 SMBM IR-Based Approach Index documents using Lucene search engine Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming Similarity measure: vector space model Two kinds of document Flat document = GO term + synonyms + definition Hierarchical document = GO term + synonyms + definition + terms, synonyms, and definitions of parent GO nodes

20 SMBM IR-Based Results Better performance on IC abstracts than on SGD abstracts Hierarchical documents do slightly worse than flat documents Discriminatory effect of specific GO terms may be reduced by occurrence of general terms such as cell and protein

21 SMBM Machine Learning Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, … Naïve Bayes predicts only one GO term per abstract SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract Features: words, frequent phrases Preprocessing steps: tokenization, removal of stop words, stemming Training on 66% of annotated data, evaluation on remainder of data GO term assignments vis-à-vis generic GO Slim to mitigate data sparsity problems

22 SMBM Machine Learning Results One GO term vs. multiple GO terms per abstract makes a difference Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in text not be assigned if GO terms not present in training set Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation between words in text and words in GO terms

23 SMBM Best F scores for GO Slim SGD Gold Standard IC Gold Standard RPF LLU IR ML Comparison of Approaches RPF LLU IR ML

24 SMBM Conclusions GO-tagging is an interesting task NLP challenges Benefits of functional GO-tagger for researchers and curators Creating valid Gold Standard Completeness of annotation

25 SMBM Conclusions Methods for GO-tagging Lexical look-up Fast, simple Term variation, relevant GO terms inferred from text Information retrieval approach Novel perspective Noise from general biomedical terms Machine Learning Able to capture generalizations Feature selection

26 SMBM Future Work Enhancements to each of the three simple approaches Combining three approaches into a hybrid system Improving resources and methodology for evaluating the technology Building and evaluating end-user applications employing this technology Look at other tasks: Extracting GO term-gene/gene product pairs Assigning evidence codes

27 SMBM Navigating GO-Tagged Document Collections GO Hierarchy Abstract Titles Abstract Bodies GO Terms/ Gene Names


Download ppt "Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa."

Similar presentations


Ads by Google