Presentation is loading. Please wait.

Presentation is loading. Please wait.

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Natural Language Processing Group Department of Computer Science Robert Gaizauskas.

Similar presentations


Presentation on theme: "GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Natural Language Processing Group Department of Computer Science Robert Gaizauskas."— Presentation transcript:

1 GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Natural Language Processing Group Department of Computer Science Robert Gaizauskas

2 GO Tag: Assigning Gene Ontology Labels to Medline Abstracts N. Davis, Y.K. Guo, H. Harkema Natural Language Processing Group Department of Computer Science Robert Gaizauskas M. Ghanem, Tom Barnwell, Y. Guo Department of Computing J. Ratcliffe

3 April 21, 2006 NaCTeM Seminar Outline Context Project Background The Gene Ontology Go Annotation in Model Organism Databases Medline Go Tagging Tasks User types/scenarios Possible tasks Related Work Data sets/Gold Standards Approaches and Results to Date Lexical lookup Vector Space Similarity Machine Learning Exploiting the Results in Search Tools

4 April 21, 2006 NaCTeM Seminar Project Background Work is funded by the EPSRC as a Best Practice Project for collaboration between DiscoveryNet and myGrid -- E-Science Pilot Projects (2001-5) Both projects have developed text mining and data analysis components -- complementary approaches NLP vs. datamining/statistical analysis workflow models for co-ordinating distributed services working on life science applications Aim: to develop a unified real-time e-Science text-mining infrastructure that builds upon and extends the technologies and methods developed by both Discovery Net and myGrid Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology

5 April 21, 2006 NaCTeM Seminar The Gene Ontology The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism http://www.geneontology.org/ http://www.geneontology.org/ Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated: biological processes cellular components molecular functions in a species-independent manner E.g. gene product cytochrome c can be described by the molecular function term electron transporter activity the biological process terms oxidative phosphorylation and induction of cell death the cellular component terms mitochondrial matrix and mitochondrial inner membrane

6 April 21, 2006 NaCTeM Seminar Gene Ontology (cont) From: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29.

7 April 21, 2006 NaCTeM Seminar The Gene Ontology (cont) Started as a joint effort between three model organism databases (FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD)) GO now (08/11/05) contains 19022 terms GO Slim(s) are reduced versions of GO ontologies containing a subset of GO terms Aim to give a broad overview of ontology content GO Slim Generic currrently contains 127 terms A typical GO Term Term name: isotropic cell growth Accession: GO:0051210 Ontology: biological_process Synonyms: related: uniform cell growth Definition: The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth. …

8 April 21, 2006 NaCTeM Seminar GO Annotation in Model Organism DBs Model organism dbs typically record for each entry (gene) one or more GO codes + links to the literature supporting the assignment of the GO code E.g. from the Saccharomyces Genome Database (SGD) GeneGo AnnotationReference Evidence code ACT1 structural constituent of cytoskeleton Botstein D, et al. (1997) The yeast cytoskeleton. TAS : Traceable Author Statement Pruyne D and Bretscher A (2000) Polarization of cell growth in yeast. Pruyne D and Bretscher A (2000) Polarization of cell growth in yeast. I. Establishment and maintenance Botstein D, et al. (1997) The yeast cytoskeleton. histone acetyltransferase complex exocytosis Galarneau L, et al. (2000) Multiple links between the NuA4 histone acetyltransferase complex and epigenetic control of transcription TAS : Traceable Author Statement IDA : Inferred from Direct Assay IC: Inferred by Curator IDA: Inferred from Direct Assay IEA: Inferred from Electronic Annotation IEP: Inferred from Expression Pattern IGI: Inferred from Genetic Interaction IMP: Inferred from Mutant Phenotype IPI: Inferred from Physical Interaction ISS: Inferred from Sequence or Structural Similarity NAS: Non-traceable Author Statement ND: No biological Data available RCA: inferred from Reviewed Computational Analysis TAS: Traceable Author Statement NR: Not Recorded

9 April 21, 2006 NaCTeM Seminar PubMed PubMed on-line bibliographic database designed to provide access to citations from biomedical literature developed by the US NCBI at the NLM Contains Medline, OldMedline, various other sources Medline Over 12 million citations dating back to 1960s Author abstracts and citations from > 4800 biomedical journals

10 April 21, 2006 NaCTeM Seminar PubMed Entrez is NCBIs integrated, text-based search and retrieval system for the major databases it maintains

11 April 21, 2006 NaCTeM Seminar Outline Context Project Background The Gene Ontology Go Annotation in Model Organism Databases Medline Go Tagging Tasks User types/scenarios Possible tasks Related Work Data sets/Gold Standards Approaches and Results to Date Lexical lookup Vector Space Similarity Machine Learning Exploiting the Results in Search Tools

12 April 21, 2006 NaCTeM Seminar User Types Research Geneticists Narrow information interest Particular gene Particular activity/functionality Model Organism Genome DB Curators Broader information interest Typically track a number of publications, seeking to enhance information stored in the model organism genome DB at the locus level

13 April 21, 2006 NaCTeM Seminar User Scenarios: Research Geneticist Possible scenarios using GO tagging to support a research geneticist include: Search result presentation: Tag abstracts returned from a PubMed search with GO codes Use GO codes to cluster/structure search results to support more effective information access Structuring of related literature as workflow side-effect Many typical researcher workflows involve BLAST searches yielding BLAST/Swissprot reports Workflow can automatically assemble a set of related papers by extracting PMIDs of homologous genes/proteins from reports and collecting these abstracts plus, optionally, others closely related by text similarity Resulting abstract set can be clustered/structured by GO terms and presented to researcher ( Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. Gaizauskas, Davis, Demetriou, Guo and Roberts, In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), 2004.)

14 April 21, 2006 NaCTeM Seminar Search Result Presentation: Motivating Example One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1) Putting LIM Kinase into Entrez gives 146 possible papers of interest.

15 April 21, 2006 NaCTeM Seminar Search Result Presentation: Motivating Example One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1) Putting LIM Kinase into Entrez gives 146 possible papers of interest. However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers): Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set. GO:0006468 : biological_process : protein amino acid phosphorylation GO:0004674 : molecular_function : protein serine/threonine kinase activity GO:0004672 : molecular_function : protein kinase activity GO:0007283 : biological_process : spermatogenesis GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization GO:0005515 : molecular_function : protein binding GO:0005634 : cellular_component : nucleus GO:0005925 : cellular_component : focal adhesion GO:0005515 : molecular_function : protein binding

16 April 21, 2006 NaCTeM Seminar Search Result Presentation: Motivating Example However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers): Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set. GO:0006468 : biological_process : protein amino acid phosphorylation GO:0004674 : molecular_function : protein serine/threonine kinase activity GO:0004672 : molecular_function : protein kinase activity GO:0007283 : biological_process : spermatogenesis GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization GO:0005515 : molecular_function : protein binding GO:0005634 : cellular_component : nucleus GO:0005925 : cellular_component : focal adhesion GO:0005515 : molecular_function : protein binding

17 April 21, 2006 NaCTeM Seminar User Scenarios: Model Organism DB Curator Possible scenarios using GO tagging/text mining to support DB curators include: Help assemble texts that may support GO code assignment GO tag texts in curators watching brief Automated tagging could act as prompt for/check on curators judgement Help to determine gene-GO term pairs for annotation Perform GO tagging/ gene name identification at text level and suggest all pairs as candidates Perform GO tagging/gene name identification at sentence level and suggest candidates Attempt to assign GO evidence codes To text segments providing evidence for GO code assignment without identifying GO code/gene pair to which the evidenced pertains To text segments providing evidence plus the GO code/gene pair to which the evidenced pertains

18 April 21, 2006 NaCTeM Seminar Possible Tasks (1) Assigning GO codes to abstracts/full papers Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology Task: assign 0 or more GO codes to a text iff the text is about the function/process/component identified by the code (assume most specific code only assigned) Note in this task there is no association of GO code with any specific gene/gene product

19 April 21, 2006 NaCTeM Seminar Possible Tasks (2) Assigning GO codes to genes/gene products in abstracts/full papers Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology Task: If the text supports the assignment of one or more GO codes to a gene/gene product, identify gene/gene product-GO code pairs and the text supporting the assignments This capability would support additional tasks Given a particular gene/gene product and a text collection, find all GO codes for the gene/gene product across the collection Given a GO code and a text collection, find all genes/gene products tagged with the code across the collection

20 April 21, 2006 NaCTeM Seminar Possible Tasks (3) Assigning evidence codes to genes/gene products-GO code pairings in abstracts/full papers Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology Task: As in Task 2. but additionally supply the evidence codes A weaker variant of this is just to suggest evidence text that may assist in the assignment of GO code

21 April 21, 2006 NaCTeM Seminar Related Work Raychaudri, Chang, Sutphin & Altman (2002) Task: associate GO codes with genes by 1.Associating GO codes with papers 2.Associating a specific GO code with a gene if sufficient number of papers mentioning the gene have the GO associated with them Method: Treat 1. as a document classification task and evaluate maximum entropy, Naïve Bayes and Nearest Neighbours approaches Evaluation: corpus of 20,000 Medline abstracts assigned one or more of 21 GO terms/categories Results: maximum entropy best -- 72.8% classification accuracy over 21 categories

22 April 21, 2006 NaCTeM Seminar Related Work Go-KDS (Smith & Cleary, 2003) Product of Reel Two Task: assign arbitrary GO terms to PubMed articles Method: Proprietary Weighted Confidence learner (similar to Naïve Bayes), using only words as features trained on gene/protein DBs which use GO codes AND have links to Medline Evaluated on approx. same data/task as Raychaudri et al. -- 70.5 % accuracy

23 April 21, 2006 NaCTeM Seminar Related Work GoPubMed On-going work at Dresden University (www.gopubmed.org)www.gopubmed.org Task: Annotate PubMed abstracts with GO terms Method: Use a local sequence alignment algorithm with weighted term matching (to overcome limits of strict matching) between GO terms and strings in texts Evaluation: None reported Kiritchenko et al. (U. of Ottawa) Task: assign arbitrary GO terms to biomedical texs Method: Treat task as hierarchical text classification use AdaBoost.MH Evaluation: introduce hierarchical evaluation measure Results unclear

24 April 21, 2006 NaCTeM Seminar Related Work (cont) Biocreative challenge -- task 2 contained three related subtasks 1. Given an article, a protein and a GO code, where the article justifies the assignment of the GO code to the protein, find evidence text in the article supporting the assignment 2. Given protein-article pairs plus the number of GO code assignments supported by the article, find the GO code(s) that should be assigned to the protein based on the article 3. Given a set of proteins, retrieve a set of papers relevant to assigning codes to the proteins plus the GO code annotations and the supporting passages (not evaluated) Results indicated no systems ready for practical use Issues: lack of training data; complexity of tasks

25 April 21, 2006 NaCTeM Seminar Related Work (cont) TREC Genomics Track 2004 -- three tasks related to GO code assignment 1. Triage -- given a set of articles find those that contain some evidence for the assignment of a GO code, i.e. warrant being curated 2. Given an article and names of genes occurring in the article assign one or more of the top three GO ontologies from which human curators had assigned codes 3. Task 2 plus provide evidence code supporting each gene-GO hierarchy label association Results for all three tasks poor

26 April 21, 2006 NaCTeM Seminar Outline Context Project Background The Gene Ontology Go Annotation in Model Organism Databases Medline Go Tagging Tasks User types/scenarios Possible tasks Related Work Data sets/Gold Standards Approaches and Results to Date Lexical lookup Vector Space Similarity Machine Learning Exploiting the Results in Search Tools

27 April 21, 2006 NaCTeM Seminar Data Sets and Evaluation In order to assess performance of GO tag assignment, a gold standard manually annotated/verified corpus is needed However, no such corpus exists …

28 April 21, 2006 NaCTeM Seminar Data Sets and Evaluation Solution 1: SGD Gold Standard Derive a corpus from SGD model organism database (yeast) Assemble all Medline abstracts cited as evidence supporting assignment of GO terms Associate with each abstract the GO term whose assignment it is cited as supporting I.e. given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T SGD Gold Standard 4922 PMIDS 2455 GO terms 10485 PMID-GO term pairs

29 April 21, 2006 NaCTeM Seminar Data Sets and Evaluation: SGD Gold Standard Advantages: Data already exists -- no extra annotation work required Can assemble similar corpora for each model organism DB Disadvantages: Each abstract has associated with it GO terms whose assignment to specific genes it supports, but may be missing other GO terms which can also be legitimately attached to it Not every paper supporting a GO term assignment will be cited Consequence: SGD gold standard is GO term incomplete Weak measure of recall Precision figures difficult to interpret

30 April 21, 2006 NaCTeM Seminar Data Sets and Evaluation: SGD Gold Standard Further issue: SGD Gene-GO term assignments are based on full papers, whereas system only has access to abstracts Consequence: Limit on maximum Recall obtainable by system

31 April 21, 2006 NaCTeM Seminar Data Sets and Evaluation (cont) Solution 2: IC Gold Standard Manually extend the GO annotation of abstracts derivable from the SGD Goal: GO term complete gold standard Selected a subset (~800) for which support for all the assigned GO codes is found in the abstract (rather than the full paper) Manually added additional GO annotations using a combination of fuzzy maching against GO and some manual addition of synonyms during checking For included terms, include lowest within each ontology cell wall biosynthesis => cell wall biosynthesis cell wall Also applied same methodology to evidence paragraphs -- brief summaries written by curators deliberately using GO vocabulary IC Gold Standard 785 PMIDS 1006 GO terms 5170 PMID-GO term pairs

32 April 21, 2006 NaCTeM Seminar Data Sets and Evaluation (cont) Advantages: Much closer to a GO-term complete gold standard Disadvantages Still not GO-term complete Method of creation suggests there may still be many unannotated GO terms that ought to be marked (direct mentions of GO terms vs. semantically entailed GO terms) Gold Standard creation method favors lexical look-up approach to GO- tagging Dataset is small

33 April 21, 2006 NaCTeM Seminar Outline Context Project Background The Gene Ontology Go Annotation in Model Organism Databases Medline Go Tagging Tasks User types/scenarios Possible tasks Related Work Data sets/Gold Standards Approaches and Results to Date Lexical lookup Vector Space Similarity Machine Learning Exploiting the Results in Search Tools

34 April 21, 2006 NaCTeM Seminar The Go Tagging Task Addressed The approaches we investigated all considered Task 1, as defined earlier: Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology Task: assign 0 or more GO codes to a text iff the text is about the function/process/component identified by the code (assume most specific code only assigned)

35 April 21, 2006 NaCTeM Seminar Approach 1: Lexical Lookup Using Termino Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation Termino contains a database holding large numbers of terms imported from various existing terminological resources, e.g., UMLS, GO Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser Available as a Web Service (see http://nlp.shef.ac.uk)

36 April 21, 2006 NaCTeM Seminar Termino Terminology Engine Text in Term Induction TermDB Finite State Look-Up Termino Medline Abstracts GO UMLS Raw TextsExisting Terminological Resources … Term Parser Neurofibromin GO annotations: - 0008181: tumor supressor - 0005737: cytoplasma - … Peptidyl-prolyl isomerase - type: protein term - source: induced from Medline - … Mastectomy UMLS data: - CUI: C0024881 - semantic type: therapeutic or preventive procedure - synonyms: mammectomy - … Text out Source-Specific Loaders

37 April 21, 2006 NaCTeM Seminar Lexical Look-Up for GO Tag Termino Imported names of all terms in GO, plus their GO ids and namespace attributes (18270 names in total) Go term synonyms SGD yeast gene names Recognition of terms in text Case-insensitive Simple morphological variants are recognized Cells mapped onto cell Mitochondrial, mitochondria not mapped onto mitochondrion

38 April 21, 2006 NaCTeM Seminar Lexical Look-Up for GO Tag (cont) GO code assignment GO term T is assigned to text iff name of T is recognized in text Extensions: GO term T is assigned to a paper if synonym of term T occurs in the abstract of the paper GO term T is assigned to a paper if yeast gene name associated with term T occurs in the abstract of the paper

39 April 21, 2006 NaCTeM Seminar Lexical Lookup Results for GO Slim SGD Dataset PRF GO Term35.27%52.36%42.15% Yeast Term22.87%91.76%36.62% GO Synonyms37.86%34.20%35.94% GO + Yeast21.15%93.59%34.50% GO + Synonyms32.94%64.65%43.65% GO + Synonyms + Yeast20.53%94.17%33.72% IC Dataset PRF GO Term98.62%79.33%87.93% Yeast Term37.94%75.13%50.42% GO Synonyms70.49%33.31%45.24% GO + Yeast43.36%94.76%59.50% GO + Synonyms85.52%88.35%86.91% GO + Synonyms + Yeast42.42%95.95%58.83%

40 April 21, 2006 NaCTeM Seminar Lexical Lookup Results for GO Full SGD Dataset PRF GO Term7.33%15.95%10.05% Yeast Term7.97%84.42%14.57% GO Synonyms6.46%7.63%7.00% GO + Yeast6.93%85.66%12.82% GO + Synonyms6.87%22.55%10.54% GO + Synonyms + Yeast6.49%86.14%12.08% IC Dataset PRF GO Term90.52%71.30%79.77% Yeast Term9.26%31.43%14.30% GO Synonyms29.65%11.53%16.60% GO + Yeast21.00%83.73%33.58% GO + Synonyms69.93%80.04%74.65% GO + Synonyms + Yeast20.70%88.38%33.54%

41 April 21, 2006 NaCTeM Seminar Lexical Lookup Approach: Discussion Recall Effect of curators using full text (SGD) vs. abstracts only (IC) Inherent drawbacks of lexical look-up: term variation, literal mentions Effects of Gold Standard creation method (IC) Precision Effects of Gold Standard creation method (IC) GO vs. GO Slim Recognizing GO Slim terms is easier than recognizing GO terms Effects of extensions (synonyms/gene names) on performance Adding synonyms: variable decrease in Precision, substantial increase in Recall Adding yeast terms: substantial decrease in Precision, substantial increase in Recall

42 April 21, 2006 NaCTeM Seminar Error Analysis False negatives for abstracts: Abbreviation: mismatch repair (GO name) vs. MMR (in text) Permutation, derivation: regulation of translation vs. regulated translation, sporulation vs. sporulate Truncation: galactokinase activity vs. galactokinase Alternative descriptions: protein catabolism vs. proteins for degradation, autophagic vacuole vs. autophagosomal

43 April 21, 2006 NaCTeM Seminar Approach 2: IR-based Vector Space Similarity Document Collection Build a collection of GO documents where each GO document consists of GO term, its synonyms and its definition sentence Query Treat each abstract to which GO codes are to be assigned as a query against the GO document collection Retrieval Given a query (i.e abstract) retrieve relevant GO documents (i.e. GO terms) assign top 1, 5, 10 … GO terms to an abstract which are most similar as measured by Vector Space Model(VSM)

44 April 21, 2006 NaCTeM Seminar IR-based Approach indexed the GO documents using Lucene search engine Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming 4 Indices were built according varying as to whether they used Standard GO or GO Slim A GO document consisting of the GO term text (name + definition) or itself plus its ancestor GO terms; Used standard weighting scheme included in Lucene Postprocessing: Re-weighting: give credit to duplicated GO documents (found on more than one path back to root) Threshold: the number of relevant GOIDs to return

45 April 21, 2006 NaCTeM Seminar IR-Based Results Better performance on IC abstracts than on SGD abstracts Hierarchical documents do slightly worse than flat documents Discriminatory effect of specific GO terms may be reduced by occurrence of general terms such as cell and protein

46 April 21, 2006 NaCTeM Seminar Approach 3: Machine Learning Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, … Naïve Bayes predicts only one GO term per abstract SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract Features: words, frequent phrases Preprocessing steps: tokenization, removal of stop words, stemming Training on 66% of annotated data, evaluation on remainder of data GO term assignments vis-à-vis generic GO Slim to mitigate data sparsity problems

47 April 21, 2006 NaCTeM Seminar Machine Learning Results One GO term vs. multiple GO terms per abstract makes a difference Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in text not be assigned if GO terms not present in training set Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation between words in text and words in GO terms

48 April 21, 2006 NaCTeM Seminar Best F scores for GO Slim SGD Gold Standard IC Gold Standard RPF LLU79.398.687.9 IR59.537.646.1 ML76.583.079.6 Comparison of Approaches RPF LLU64.632.943.6 IR51.526.234.7 ML36.851.643.0

49 April 21, 2006 NaCTeM Seminar Outline Context Project Background The Gene Ontology Go Annotation in Model Organism Databases Medline Go Tagging Tasks User types/scenarios Possible tasks Related Work Data sets/Gold Standards Approaches and Results to Date Lexical lookup Vector Space Similarity Machine Learning Exploiting the Results in Search Tools

50 April 21, 2006 NaCTeM Seminar Input keywords here Upload a file containing a list of Medline abstracts Type/paste free texts to get results

51 April 21, 2006 NaCTeM Seminar Click for the abstract details Click for the GO definition Search for the abstracts with similar Go annotations

52 April 21, 2006 NaCTeM Seminar

53 April 21, 2006 NaCTeM Seminar Click for the abstract details Click for the GO definition Search for the abstracts with similar Go annotations

54 April 21, 2006 NaCTeM Seminar

55 April 21, 2006 NaCTeM Seminar

56 April 21, 2006 NaCTeM Seminar

57 April 21, 2006 NaCTeM Seminar Exploiting the Results in Search Tools GO Hierarchy Abstract Titles Abstract Bodies Go Labels/ Gene Names

58 April 21, 2006 NaCTeM Seminar Input keywords here Upload a file containing a list of Medline abstracts Type/paste free texts to get results

59 April 21, 2006 NaCTeM Seminar

60 April 21, 2006 NaCTeM Seminar

61 April 21, 2006 NaCTeM Seminar

62 April 21, 2006 NaCTeM Seminar Conclusions GO tagging is an interesting task that offers significant potential benefits to research biologists and bioinformaticians Several increasingly complex/valuable variants of the task can be identified Simple techniques Direct term matching IR-type text macthing Machine Learning text classification methods have been assessed for their level of performance on the simplest task -- assigning GO terms to texts at the whole text level Evaluation methods/resources are critical issues Effectively utilising imperfect text mining results in end user applications is challenging

63 April 21, 2006 NaCTeM Seminar Future Work Enhancements to each of the 3 simple approaches Combining 3 simple approaches into a hybrid system Look other tasks: Extracting GO term-gene/gene product pairs Assigning evidence codes Improving resources and methodology for evaluating the technology End-user evaluation of search tools employing this technology

64 April 21, 2006 NaCTeM Seminar END Reference: Davis, N., Harkema, H., Gaizauskas, R.,Guo Y.K., Ghanem, M., Barnwell, T., Guo, Y. and Ratcliffe, J. (2006) Three Approaches to GO-Tagging Biomedical Abstracts. In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM06), Jena, April 2006. Available from http://www.dcs.shef.ac.uk/~robertg/publications/


Download ppt "GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Natural Language Processing Group Department of Computer Science Robert Gaizauskas."

Similar presentations


Ads by Google