Presentation is loading. Please wait.

Presentation is loading. Please wait.

I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D

Similar presentations

Presentation on theme: "I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D"— Presentation transcript:

1 Facilitating the development of controlled vocabularies for metabolomics with text mining
I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group Members3 1 MCISB EBI MSI

2 Motivation experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources the pressing need for vocabularies and ontologies for metabolomics

3 Metabolomics Society
the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments five working groups: biological sample context chemical analysis data analysis ontology data exchange 5 WGs founded to cover the key areas for describing metabolomics experiments The minimal reporting requirements identified by the first three WGs will inform the development of data exchange standards and the ontology in order to provide a common mode of transporting information between systems.

4 MSI OWG Metabolomics Standardisation Initiative Ontology WG
coordinated by Dr Susanna-Assunta Sansone develop a common semantic framework for metabolomics studies by means of controlled vocabularies ontologies so to be able to: describe the experimental process consistently ensure meaningful and unambiguous data exchange

5 Scope the coverage of the domain reflects the typical structure of metabolomics investigations: general components (investigation design; sample source, characteristics, treatments and collection; computational analysis) technology-specific components (sample preparation; instrumental analysis; data pre-processing) analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy… The first step in developing an ontology involves the identification of its purpose and scope, followed by the knowledge acquisition, manual and/or automatic, from sources such as domain specialists, literature, databases, existing ontologies… pragmatic approach based on re-using the existing efforts in related areas rather than re-inventing the wheel: The general aspects of metabolomics investigations largely overlap with those in other omics domains, where standardisation efforts are already underway (e.g. HUPO-PSI and MGED). For the MS the MSI OWG will leverage on previous work by the PSI MS Ontology WG. The technologies in the current focus of the MSI OWG are NMR and GC.

6 Terms terms: linguistic representations of domain-specific concepts
means of conveying scientific and technical information CV terms: used to tag units of information so that they can be more easily retrieved by a search improve technical communication by ensuring that everyone is using the same term to mean the same thing from Wikipedia

7 Term acquisition CV terms are chosen and organised by trained professionals who possess expertise in the subject area in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature TM – reduces the time and cost of compiling a CV and sustain its completeness

8 Strategy each CV is compiled in an iterative process consisting of the following steps: create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness Metabolomics is highly dynamic domain  experts often use non-standardised terms. Requirements: reduce the time and cost of compiling a CV sustain the completeness of a CV Solution: acquire terms automatically from the scientific literature using a text mining (TM) approach

9 A text mining workflow information retrieval: gather a technology-specific corpus of documents search terms: MeSH terms & CV terms documents: abstracts & full papers resources: Entrez — MEDLINE & PubMed Central (PMC) term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus method: C-value provided by NaCTeM term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc. resources: UMLS — MetaThesaurus & Semantic Network no ready-made text mining solution building customised TM applications from scratch is expensive alternative: building customised workflows from the existing components publicly available via web services typically co-occurring classes of terms denoting substances…, which in contrast to the considered analytical techniques have more established CVs

10 Information retrieval using MeSH terms
MeSH = Medical Subject Headings MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

11 IR using MeSH terms finding the relevant MeSH terms using the MeSH browser look up: NMR resulting MeSH term(s): Magnetic Resonance Spectroscopy PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

12 PubMed Central (full papers) biomedical literature
Beyond MeSH terms NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study  it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material  it is important to process the full text articles as opposed to abstracts only as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked) MEDLINE (abstracts) NMR NMR PubMed Central (full papers) NMR NMR objective: to increase the recall and obtain papers that describe research that utilises the NMR technology, but which do not deal with NMR per se and therefore are not indexed by NMR-related MeSH terms approach: use NMR ontology terms as search terms over full-text articles biomedical literature

13 Selecting search terms
2400 for each term obtain the number of papers returned from PMC sort the terms by the number of paper they return; set a cut-off point to remove the terms that return too many papers (non-discriminatory terms), as they are likely to be broad terms not limited to NMR and therefore would introduce a lot of noise for each remaining term retrieve the PMC IDs of the papers they retrieve sort the PMC IDs according to the number of times they are retrieved; set a cut-off point to remove the ones that do not contain a sufficient number of known NMR terms retrieve the full papers from PMC Start from the ontology: SELECT * FROM taxonomy; Extract the terms from the ontology: SELECT * FROM terms; For each term, retrieve the number of papers from PMC it returns: SELECT term, doc FROM pmc_terms; High number of returned documents reveals noisy terms, i.e. the ones that are NMR related, but are not NMR specific, and therefore may return irrelevant documents. A brief overview of the terms sorted by the number of documents they return clearly shows that the ones on top are very general: SELECT term, doc FROM pmc_terms ORDER BY doc DESC; The (manually) chosen cut-off point: 2400 Observe how the acronym NMR starts occurring more frequently after the cut-off point.

14 number of matching terms
Selecting documents doc ID number of matching terms > threshold = 3 Problem: if we set the threshold too high, then we won’t retrieve many documents; if we set it too low, the number of noisy documents will increase SELECT pmc_id, COUNT(*) FROM pmc_retrieval GROUP BY pmc_id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC; NB: How PMC works: Automatic term mapping Untagged terms that are entered in the search box are matched (in this order) against a MeSH (Medical Subject Headings) translation table, a Journals translation table, the Full Author translation table, and an Author index.  When a match is found for a term or phrase in a translation table the mapping process is complete and does not continue to the next translation table. PMC MeSH translation table The PMC MeSH translation table contains: MeSH terms The See-Reference mappings (also known as entry terms) for MeSH terms MeSH Subheadings Terms derived from the Unified Medical Language System (UMLS) that have equivalent synonyms or lexical variants in English Supplementary concept (substance) names and their synonyms If a match is found in this translation table, the term will be searched as MeSH (which includes the MeSH term and any specific terms indented under that term in the MeSH hierarchy), and as a Text Word. For example, if you enter “vitamin c” in the query box, PMC will translate this search to: "ascorbic acid"[MeSH Terms] OR vitamin c[Text Word] If no match is found? PMC breaks apart the phrase and repeats the above automatic term mapping process until a match is found. PMC ignores stopwords in searches. If there is no match, the individual terms will be combined (with AND) together and searched in all fields. local corpus

15 Term recognition: C-value

16 C-value syntactic pattern matching used to select term candidates:
(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N termhood of each candidate term t is calculated using: |t| its length as the number of words f(t) its frequency of occurrence S(t) the set of other candidate terms containing it as a subphrase part-of-speech (POS) tagging is the process of tagging the words in a text as corresponding to a particular part of speech (e.g. noun, verb, adjective) based on its definition and a particular context (i.e. in relation to adjacent and related words) beta/ADJ isoforms/N of/PREP glucocorticoid/N receptor/N the POS information is used later during syntactic pattern matching, which is used to extract only those words sequences that conform to certain syntactic rules finally, a stop list is used to eliminate candidate terms extracted via syntactic pattern matching that are not likely to be terms. The stop list is corpus-dependent and contains frequently occurring words such as great, numerous, several, etc.

17 C-value results problem: C-value extracts statistically significant terms C-value does not differentiate between the “NMR terms” and the terms representing the study subjects in which NMR is used only as an analytical technique, but itself is not the focus of a study we need to reduce the number of terms not directly related to the NMR technique solution: the initial inspection of automatically extracted terms revealed the main types of concepts studied using NMR: substances, organisms, organs, conditions/diseases… a straightforward approach to filtering out such terms is to use the existing dictionaries of these terms and match it against the list of automatically extracted terms

18 Unified Medical Language System (UMLS)
UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies UMLS contains the following semantic classes relevant to our problem: Organism A.1.1 Anatomical Structure A.1.2 Substance A.1.4 Biological Function B Injury or Poisoning B.2.3 we used these classes to automatically extract the corresponding terms from the UMLS thesaurus UMLS is aimed at facilitating the development of information systems for text processing in the domain of biomedicine. The role of UMLS in such systems is to provide a formal representation of the domain-specific knowledge in order to process, retrieve, integrate, and aggregate biomedical data and information contained in the relevant literature. UMLS is a multilingual ontology, which merges information from over 100 biomedical source vocabularies. It currently contains over one million concepts named by 2.8 million terms. Concepts are organised into a hierarchy of 135 classes and interconnected by 54 different relations.

19 Summary A Web service is any piece of software that makes itself available over the Internet and uses a standardized XML messaging system. The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP). Programmers may write software applications that access the E-Utilities using any SOAP development tool. Unfortunately, NaCTeM does not provide Web service for automatic term recognition (ATR)! If we want to implement CV expansion as a workflow (e.g. using Taverna) and make it publicly accessible as a Web service, then we need a Web service for ATR. UMLS

20 from Wikipedia

21 Results input: 243 NMR terms & 152 GC terms
output: 5,699 NMR terms & 2,612 GC terms term acquisition results (see Table above) for the two case studies and IR approaches using the MeSH terms and the seed CV terms (at least 3 and 7 matching terms for abstracts and full papers respectively) [Although freely available for browsing, for most articles in PMC the publisher does not allow downloading the text in XML format, neither does PMC allows bulk downloading in HTML format. Hence, we were able to process only a small portion of full papers (the numbers in brackets refer to these papers).] Total number of new terms acquired: 5,699 for NMR; 2,612 for GC Preliminary results are available at the MSI OWG web site, where the potential CV terms are available to the metabolomics community for comments and curation. The average ratio between the number of acquired technology-specific terms and the corpus size: for full papers; only 0.13 for abstracts Overlap of terms acquired form abstracts and full papers: only 2% on average. This comparison confirms that the Materials and Methods sections represent a significant source of technology-specific terms and also emphasises the need of making full-text articles available to TM applications for the benefits of the overall biomedical community. 0.13 16.25 2%

22 The End

Download ppt "I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D"

Similar presentations

Ads by Google