Presentation on theme: "UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,"— Presentation transcript:
UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20
July 20Bio-Ontologies –ISMB 2007 The role of bioinformatics in biomedical research and future clinical patient care Health problem in a patient Bioinformatics: -Data storage and representation -Large-scale data generation -Large-scale data analysis Basic research: -what is the mechanism? -Epidemiological studies Basic research: -what is the mechanism? -Epidemiological studies Basic research results stored in databases up-to-date knowledge and large-scale results: -research direction -New hypothesis Drug development Clinical trials Clinical patient care: Doctor prescribes an individualized treatment plan. Molecular-level decision-support tools: - Structured knowledge representations - Filtered information on fundamental biological mechanisms and significant Treatment outcome
July 20Bio-Ontologies –ISMB 2007 Biomedical knowledge: a protein-centric view High quality manual annotation. Protein name, sequence, function, Domain, features and references. 16,702 human proteins Proteins: Sequence, Function, structure, modifications Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Disease annotation: -Link to 12,603 OMIM entries -Link to other specialized databases -32,921 variants (or polymorphisms) ->3000 associated diseases Biological processes: Biological pathway/network, Protein-protein interaction Biological process/proteomic: -Pathway annotation -Protein-protein interaction (DIP, INTACT) -protein 2D gel (Swiss-2DPAGE) References Links to >100 other databases Over 82420 journal references Genes: Sequence, chromosomal location, regulation, expression Genomic data: -Genew, GeneCards, GenAtlas -Expression data (e.g. CleanEx) -Genome details: Ensembl
July 20Bio-Ontologies –ISMB 2007 Objective Increase the accessibility of molecular biology resources to clinical researchers by indexing UniProtKB/Swiss-Prot with the MeSH terminology
July 20Bio-Ontologies –ISMB 2007 Why UniProt KB/Swiss-Prot ? Most comprehensive warehouse of protein sequences With a high level of annotation and highly cross-linked with other biological databases. c-SNPs SAPs Includes data on more than 30000 variants, mostly c-SNPs (coding SNPs) or SAPs (Single Amino-acid Polymorphisms) More than 3000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs) http://beta.uniprot.org/
July 20Bio-Ontologies –ISMB 2007 Disease annotation UniProtKB/Swiss-Prot entry P35240
July 20Bio-Ontologies –ISMB 2007 Why MeSH? Controlled vocabulary thesaurus structured in a hierarchy of concepts Each concept includes a set of terms -synonyms and lexical variants MeSH is part of the UMLS, and, thus, linked to other medical terminologies MeSH is used to index the biomedical literature
July 20Bio-Ontologies –ISMB 2007 The structure of MeSH
July 20Bio-Ontologies –ISMB 2007 Mapping procedure UniProtKB/Swiss-Prot entry Disease comment line Extracted disease nameOMIM: title/alternative titles Exact match Partial match Same descriptor MeSH
July 20Bio-Ontologies –ISMB 2007 Disease extraction Extraction using regular expressions are the cause of involved in etc. MeSH Neurofibromatosis 2
July 20Bio-Ontologies –ISMB 2007 Term matching procedure Exact matches: same length, same word order, case insensitive Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval: The term with the highest score was chosen.
July 20Bio-Ontologies –ISMB 2007 Benchmark Used to evaluate the procedure in terms of recall and precision Used to set up a score threshold 92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms
July 20Bio-Ontologies –ISMB 2007 Analysis of the results (1/3) muscle liver brain eye nanism Disease MeSH term abnormalities, multiple muscle-eye-brain disease Manual mappingAutomatic mapping Problems in granularity difference
July 20Bio-Ontologies –ISMB 2007 b-cell lymphomahematologic neoplasms hematopoietic tumors such as b-cell lymphomas Disease (extracted) MeSH term Manual mappingAutomatic mapping Analysis of the results (2/3) Problems in disease name extraction
July 20Bio-Ontologies –ISMB 2007 epidermolysis bullosa dystrophica epidermolysis bullosa simplex epidermolysis bullosa dystrophica, Cockayne-Touraine type Disease (OMIM alternative title) MeSH term Manual mappingAutomatic mapping Analysis of the results (3/3) Problems inherent to the resources epidermolysis bullosa simplex, Weber-Cockayne type Disease SP
July 20Bio-Ontologies –ISMB 2007 Results on all Swiss-Prot 3197 disease comment lines 2398 OMIM SPOMIM SP OMIM Exact match 577 (18%) 655 (20%) 354 (11%) 866 (27%) Partial match 691 (22%) 600 (19%) 317 (10%) 751 (23%) Total 1268 (40%) 1225 (39%) 844 (26%) 1617 (51%)
July 20Bio-Ontologies –ISMB 2007 Discussion The mapping system was tuned for high precision to provide a fully automated procedure. But we need to improve the recall by: Including NLP techniques in the disease extraction and matching procedures; Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH) Permitting a mapping to several MeSH terms; Trying to map to other terminologies such as ICD-10, SnoMed-CT; Using information from the literature which is indexed with MeSH terms.
July 20Bio-Ontologies –ISMB 2007 Work in progress Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM calculate frequency This frequency is used to refine the score for partial match Preliminary results: The recall was successfully increased to 62 % without losing precision.
July 20Bio-Ontologies –ISMB 2007 Conclusion We developped a generic terminology mapping procedure which can be used to link various biomedical resources. Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research. These results will help improve the interoperability between medical informatics and bioinformatics