Presentation is loading. Please wait.

Presentation is loading. Please wait.

Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann,

Similar presentations

Presentation on theme: "Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann,"— Presentation transcript:

1 Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 1 MCISB 2 EBI 3 MSI

2 Motivation experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources the pressing need for vocabularies and ontologies for metabolomics

3 Metabolomics Society the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments five working groups: –biological sample context –chemical analysis –data analysis –ontology –data exchange

4 MSI OWG Metabolomics Standardisation Initiative Ontology WG coordinated by Dr Susanna-Assunta Sansone develop a common semantic framework for metabolomics studies by means of –controlled vocabularies –ontologies so to be able to: –describe the experimental process consistently –ensure meaningful and unambiguous data exchange

5 Scope the coverage of the domain reflects the typical structure of metabolomics investigations: –general components (investigation design; sample source, characteristics, treatments and collection; computational analysis) –technology-specific components (sample preparation; instrumental analysis; data pre- processing) analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…

6 Terms terms: –linguistic representations of domain-specific concepts –means of conveying scientific and technical information CV terms: –used to tag units of information so that they can be more easily retrieved by a search –improve technical communication by ensuring that everyone is using the same term to mean the same thing

7 Term acquisition CV terms are chosen and organised by trained professionals who possess expertise in the subject area in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone solution: a text mining method for efficient corpus- based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature

8 Strategy each CV is compiled in an iterative process consisting of the following steps: 1.create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions 2.expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications 3.circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness

9 A text mining workflow 1.information retrieval: gather a technology-specific corpus of documents search terms: MeSH terms & CV terms documents: abstracts & full papers resources: Entrez MEDLINE & PubMed Central (PMC) 2.term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus method: C-value provided by NaCTeM 3.term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc. resources: UMLS MetaThesaurus & Semantic Network

10 Information retrieval using MeSH terms MeSH = Medical Subject Headings MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

11 IR using MeSH terms finding the relevant MeSH terms using the MeSH browser look up: NMR resulting MeSH term(s): Magnetic Resonance Spectroscopy PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

12 Beyond MeSH terms NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results the experimental conditions are typically reported within Materials & Methods sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked) NMR MEDLINE (abstracts) PubMed Central (full papers) biomedical literature

13 Selecting search terms 2400

14 Selecting documents doc ID number of matching terms > threshold local corpus = 3

15 Term recognition: C-value

16 C-value syntactic pattern matching used to select term candidates: (ADJ | N) + | ((ADJ | N)* [N PREP] (ADJ | N)*) N termhood of each candidate term t is calculated using: –|t|its length as the number of words –f(t) its frequency of occurrence –S(t) the set of other candidate terms containing it as a subphrase

17 C-value results

18 Unified Medical Language System (UMLS) UMLS = an ontology which merges information from over 100 biomedical source vocabularies UMLS contains the following semantic classes relevant to our problem: Organism A.1.1 Anatomical Structure A.1.2 Substance A.1.4 Biological Function B Injury or Poisoning B.2.3 we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

19 Summary UMLS


21 Results input: 243 NMR terms & 152 GC terms output: 5,699 NMR terms & 2,612 GC terms 2%

22 The End

Download ppt "Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann,"

Similar presentations

Ads by Google