UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Números.
1 A B C
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AP STUDY SESSION 2.
Reflection nurulquran.com.
1
EuroCondens SGB E.
Worksheets.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Addition and Subtraction Equations
David Burdett May 11, 2004 Package Binding for WS CDL.
Create an Application Title 1Y - Youth Chapter 5.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
CHAPTER 18 The Ankle and Lower Leg
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Break Time Remaining 10:00.
The basics for simulations
Turing Machines.
PP Test Review Sections 6-1 to 6-6
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Artificial Intelligence
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
Subtraction: Adding UP
: 3 00.
5 minutes.
Numeracy Resources for KS2
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.
Static Equilibrium; Elasticity and Fracture
Essential Cell Biology
Converting a Fraction to %
ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
PSSA Preparation.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Presentation transcript:

UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20

July 20Bio-Ontologies –ISMB 2007 The role of bioinformatics in biomedical research and future clinical patient care Health problem in a patient Bioinformatics: -Data storage and representation -Large-scale data generation -Large-scale data analysis Basic research: -what is the mechanism? -Epidemiological studies Basic research: -what is the mechanism? -Epidemiological studies Basic research results stored in databases up-to-date knowledge and large-scale results: -research direction -New hypothesis Drug development Clinical trials Clinical patient care: Doctor prescribes an individualized treatment plan. Molecular-level decision-support tools: - Structured knowledge representations - Filtered information on fundamental biological mechanisms and significant Treatment outcome

July 20Bio-Ontologies –ISMB 2007 Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction Proteins: Sequence, Function, structure, modifications Genes: Sequence, chromosomal location, regulation, expression Biomedical knowledge: a protein-centric view

July 20Bio-Ontologies –ISMB 2007 Biomedical knowledge: a protein-centric view High quality manual annotation. Protein name, sequence, function, Domain, features and references. 16,702 human proteins Proteins: Sequence, Function, structure, modifications Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Disease annotation: -Link to 12,603 OMIM entries -Link to other specialized databases -32,921 variants (or polymorphisms) ->3000 associated diseases Biological processes: Biological pathway/network, Protein-protein interaction Biological process/proteomic: -Pathway annotation -Protein-protein interaction (DIP, INTACT) -protein 2D gel (Swiss-2DPAGE) References Links to >100 other databases Over journal references Genes: Sequence, chromosomal location, regulation, expression Genomic data: -Genew, GeneCards, GenAtlas -Expression data (e.g. CleanEx) -Genome details: Ensembl

July 20Bio-Ontologies –ISMB 2007 Objective Increase the accessibility of molecular biology resources to clinical researchers by indexing UniProtKB/Swiss-Prot with the MeSH terminology

July 20Bio-Ontologies –ISMB 2007 Why UniProt KB/Swiss-Prot ? Most comprehensive warehouse of protein sequences With a high level of annotation and highly cross-linked with other biological databases. c-SNPs SAPs Includes data on more than variants, mostly c-SNPs (coding SNPs) or SAPs (Single Amino-acid Polymorphisms) More than 3000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs)

July 20Bio-Ontologies –ISMB 2007 Disease annotation UniProtKB/Swiss-Prot entry P35240

July 20Bio-Ontologies –ISMB 2007 Why MeSH? Controlled vocabulary thesaurus structured in a hierarchy of concepts Each concept includes a set of terms -synonyms and lexical variants MeSH is part of the UMLS, and, thus, linked to other medical terminologies MeSH is used to index the biomedical literature

July 20Bio-Ontologies –ISMB 2007 The structure of MeSH

July 20Bio-Ontologies –ISMB 2007 Mapping procedure UniProtKB/Swiss-Prot entry Disease comment line Extracted disease nameOMIM: title/alternative titles Exact match Partial match Same descriptor MeSH

July 20Bio-Ontologies –ISMB 2007 Disease extraction Extraction using regular expressions are the cause of involved in etc. MeSH Neurofibromatosis 2

July 20Bio-Ontologies –ISMB 2007 Term matching procedure Exact matches: same length, same word order, case insensitive Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval: The term with the highest score was chosen.

July 20Bio-Ontologies –ISMB 2007 Benchmark Used to evaluate the procedure in terms of recall and precision Used to set up a score threshold 92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms

July 20Bio-Ontologies –ISMB disease comment lines (82 OMIM) Exact matchPartial matchTotal RetrievalRecallPrecisionRetrievalRecallPrecisionRetrievalRecallPrecision SP 16 (17%) 16 (17%) 100% 20 (22%) 16 (17%) 80% 36 (39%) 32 (35%) 89% OMIM 21 (23%) 21 (23%) 100% 21 (23%) 19 (21%) 90% 42 (46%) 40 (43%) 95% SP OMIM 10 (11%) 10 (11%) 100% 8 (9%) 8 (9%) 100% 18 (20%) 18 (20%) 100% SP OMIM 27 (29%) 27 (29%) 100% 23 (25%) 19 (21%) 83% 50 (54%) 46 (50%) 92% Results on the Benchmark

July 20Bio-Ontologies –ISMB 2007 Analysis of the results (1/3) muscle liver brain eye nanism Disease MeSH term abnormalities, multiple muscle-eye-brain disease Manual mappingAutomatic mapping Problems in granularity difference

July 20Bio-Ontologies –ISMB 2007 b-cell lymphomahematologic neoplasms hematopoietic tumors such as b-cell lymphomas Disease (extracted) MeSH term Manual mappingAutomatic mapping Analysis of the results (2/3) Problems in disease name extraction

July 20Bio-Ontologies –ISMB 2007 epidermolysis bullosa dystrophica epidermolysis bullosa simplex epidermolysis bullosa dystrophica, Cockayne-Touraine type Disease (OMIM alternative title) MeSH term Manual mappingAutomatic mapping Analysis of the results (3/3) Problems inherent to the resources epidermolysis bullosa simplex, Weber-Cockayne type Disease SP

July 20Bio-Ontologies –ISMB 2007 Results on all Swiss-Prot 3197 disease comment lines 2398 OMIM SPOMIM SP OMIM Exact match 577 (18%) 655 (20%) 354 (11%) 866 (27%) Partial match 691 (22%) 600 (19%) 317 (10%) 751 (23%) Total 1268 (40%) 1225 (39%) 844 (26%) 1617 (51%)

July 20Bio-Ontologies –ISMB 2007 Discussion The mapping system was tuned for high precision to provide a fully automated procedure. But we need to improve the recall by: Including NLP techniques in the disease extraction and matching procedures; Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH) Permitting a mapping to several MeSH terms; Trying to map to other terminologies such as ICD-10, SnoMed-CT; Using information from the literature which is indexed with MeSH terms.

July 20Bio-Ontologies –ISMB 2007 Benchmark extended to 200 diseases Work in progress 200 disease comment lines (173 OMIM) Exact matchPartial matchTotal RetrievalRecallPrecisionRetrievalRecallPrecisionRetrievalRecallPrecision SP 35 (18%) 35 (18%) 100% 54 (27%) 47 (24%) 87% 89 (45%) 82 (41%) 92% OMIM 40 (20%) 38 (19%) 95% 56 (28%) 48 (24%) 86% 96 (48%) 86 (43%) 90% SP OMIM 22 (11%) 22 (11%) 100% 28 (14%) 26 (13%) 93% 62 (31%) 60 (30%) 97% SP OMIM 52 (26%) 51 (26%) 98% 65 (33%) 56 (28%) 86% 117 (59%) 107 (54%) 91%

July 20Bio-Ontologies –ISMB 2007 Work in progress Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM calculate frequency This frequency is used to refine the score for partial match Preliminary results: The recall was successfully increased to 62 % without losing precision.

July 20Bio-Ontologies –ISMB 2007 Conclusion We developped a generic terminology mapping procedure which can be used to link various biomedical resources. Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research. These results will help improve the interoperability between medical informatics and bioinformatics