Presentation on theme: "I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D"— Presentation transcript:
1 Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group Members31 MCISB EBI MSI
2 Motivationexperimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformaticscontrolled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resourcesthe pressing need for vocabularies and ontologies for metabolomics
3 Metabolomics Society http://www.metabolomicssociety.org the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experimentsfive working groups:biological sample contextchemical analysisdata analysisontologydata exchange5 WGs founded to cover the key areas for describing metabolomics experimentsThe minimal reporting requirements identified by the first three WGs will inform the development of data exchange standards and the ontology in order to provide a common mode of transporting information between systems.
4 MSI OWG Metabolomics Standardisation Initiative Ontology WG coordinated by Dr Susanna-Assunta Sansonedevelop a common semantic framework for metabolomics studies by means ofcontrolled vocabulariesontologiesso to be able to:describe the experimental process consistentlyensure meaningful and unambiguous data exchange
5 Scopethe coverage of the domain reflects the typical structure of metabolomics investigations:general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)technology-specific components (sample preparation; instrumental analysis; data pre-processing)analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…The first step in developing an ontology involves the identification of its purpose and scope, followed by the knowledge acquisition, manual and/or automatic, from sources such as domain specialists, literature, databases, existing ontologies…pragmatic approach based on re-using the existing efforts in related areas rather than re-inventing the wheel:The general aspects of metabolomics investigations largely overlap with those in other omics domains, where standardisation efforts are already underway (e.g. HUPO-PSI and MGED).For the MS the MSI OWG will leverage on previous work by the PSI MS Ontology WG. The technologies in the current focus of the MSI OWG are NMR and GC.
6 Terms terms: linguistic representations of domain-specific concepts means of conveying scientific and technical informationCV terms:used to tag units of information so that they can be more easily retrieved by a searchimprove technical communication by ensuring that everyone is using the same term to mean the same thingfrom Wikipedia
7 Term acquisitionCV terms are chosen and organised by trained professionals who possess expertise in the subject areain a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised termsproblem: manual term acquisition approaches are time-consuming, labour-intensive and error-pronesolution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literatureTM – reduces the time and cost of compiling a CV and sustain its completeness
8 Strategyeach CV is compiled in an iterative process consisting of the following steps:create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventionsexpand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publicationscirculate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completenessMetabolomics is highly dynamic domain experts often use non-standardised terms.Requirements:reduce the time and cost of compiling a CVsustain the completeness of a CVSolution: acquire terms automatically from the scientific literature using a text mining (TM) approach
9 A text mining workflowinformation retrieval: gather a technology-specific corpus of documentssearch terms: MeSH terms & CV terms documents: abstracts & full papers resources: Entrez — MEDLINE & PubMed Central (PMC)term recognition: extract terms as lexical units frequently occurring in a domain-specific corpusmethod: C-value provided by NaCTeMterm filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.resources: UMLS — MetaThesaurus & Semantic Networkno ready-made text mining solutionbuilding customised TM applications from scratch is expensivealternative: building customised workflows from the existing components publicly available via web servicestypically co-occurring classes of terms denoting substances…, which in contrast to the considered analytical techniques have more established CVs
10 Information retrieval using MeSH terms MeSH = Medical Subject HeadingsMeSH is the NLM's CV used for indexing articles for MEDLINE/PubMedMeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts
11 IR using MeSH termsfinding the relevant MeSH terms using the MeSH browserlook up: NMRresulting MeSH term(s): Magnetic Resonance SpectroscopyPubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]
12 PubMed Central (full papers) biomedical literature Beyond MeSH termsNMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these resultsthe experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts onlyas a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)MEDLINE (abstracts)NMRNMRPubMed Central (full papers)NMRNMRobjective: to increase the recall and obtain papers that describe research that utilises the NMR technology, but which do not deal with NMR per se and therefore are not indexed by NMR-related MeSH termsapproach: use NMR ontology terms as search terms over full-text articlesbiomedical literature
13 Selecting search terms 2400for each term obtain the number of papers returned from PMCsort the terms by the number of paper they return; set a cut-off point to remove the terms that return too many papers (non-discriminatory terms), as they are likely to be broad terms not limited to NMR and therefore would introduce a lot of noisefor each remaining term retrieve the PMC IDs of the papers they retrievesort the PMC IDs according to the number of times they are retrieved; set a cut-off point to remove the ones that do not contain a sufficient number of known NMR termsretrieve the full papers from PMCStart from the ontology: SELECT * FROM taxonomy;Extract the terms from the ontology: SELECT * FROM terms;For each term, retrieve the number of papers from PMC it returns: SELECT term, doc FROM pmc_terms;High number of returned documents reveals noisy terms, i.e. the ones that are NMR related, but are not NMR specific, and therefore may return irrelevant documents.A brief overview of the terms sorted by the number of documents they return clearly shows that the ones on top are very general:SELECT term, docFROM pmc_termsORDER BY doc DESC;The (manually) chosen cut-off point: 2400Observe how the acronym NMR starts occurring more frequently after the cut-off point.
14 number of matching terms Selecting documentsdoc IDnumber of matching terms> threshold= 3Problem: if we set the threshold too high, then we won’t retrieve many documents; if we set it too low, the number of noisy documents will increaseSELECT pmc_id, COUNT(*)FROM pmc_retrievalGROUP BY pmc_idHAVING COUNT(*) >= 3ORDER BY COUNT(*) DESC;NB: How PMC works: Automatic term mappingUntagged terms that are entered in the search box are matched (in this order) against a MeSH (Medical Subject Headings) translation table, a Journals translation table, the Full Author translation table, and an Author index. When a match is found for a term or phrase in a translation table the mapping process is complete and does not continue to the next translation table.PMC MeSH translation tableThe PMC MeSH translation table contains:MeSH termsThe See-Reference mappings (also known as entry terms) for MeSH termsMeSH SubheadingsTerms derived from the Unified Medical Language System (UMLS) that have equivalent synonyms or lexical variants in EnglishSupplementary concept (substance) names and their synonymsIf a match is found in this translation table, the term will be searched as MeSH (which includes the MeSH term and any specific terms indented under that term in the MeSH hierarchy), and as a Text Word.For example, if you enter “vitamin c” in the query box, PMC will translate this search to: "ascorbic acid"[MeSH Terms] OR vitamin c[Text Word]If no match is found?PMC breaks apart the phrase and repeats the above automatic term mapping process until a match is found. PMC ignores stopwords in searches. If there is no match, the individual terms will be combined (with AND) together and searched in all fields.local corpus
16 C-value syntactic pattern matching used to select term candidates: (ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) Ntermhood of each candidate term t is calculated using:|t| its length as the number of wordsf(t) its frequency of occurrenceS(t) the set of other candidate terms containing it as a subphrasepart-of-speech (POS) tagging is the process of tagging the words in a text as corresponding to a particular part of speech (e.g. noun, verb, adjective) based on its definition and a particular context (i.e. in relation to adjacent and related words) beta/ADJ isoforms/N of/PREP glucocorticoid/N receptor/Nthe POS information is used later during syntactic pattern matching, which is used to extract only those words sequences that conform to certain syntactic rulesfinally, a stop list is used to eliminate candidate terms extracted via syntactic pattern matching that are not likely to be terms. The stop list is corpus-dependent and contains frequently occurring words such as great, numerous, several, etc.
17 C-value resultsproblem: C-value extracts statistically significant termsC-value does not differentiate between the “NMR terms” and the terms representing the study subjects in which NMR is used only as an analytical technique, but itself is not the focus of a studywe need to reduce the number of terms not directly related to the NMR techniquesolution: the initial inspection of automatically extracted terms revealed the main types of concepts studied using NMR: substances, organisms, organs, conditions/diseases…a straightforward approach to filtering out such terms is to use the existing dictionaries of these terms and match it against the list of automatically extracted terms
18 Unified Medical Language System (UMLS) UMLS = an “ontology” which merges information from over 100 biomedical source vocabulariesUMLS contains the following semantic classes relevant to our problem:Organism A.1.1 Anatomical Structure A.1.2 Substance A.1.4 Biological Function B Injury or Poisoning B.2.3we used these classes to automatically extract the corresponding terms from the UMLS thesaurusUMLS is aimed at facilitating the development of information systems for text processing in the domain of biomedicine. The role of UMLS in such systems is to provide a formal representation of the domain-specific knowledge in order to process, retrieve, integrate, and aggregate biomedical data and information contained in the relevant literature.UMLS is a multilingual ontology, which merges information from over 100 biomedical source vocabularies. It currently contains over one million concepts named by 2.8 million terms. Concepts are organised into a hierarchy of 135 classes and interconnected by 54 different relations.
19 SummaryA Web service is any piece of software that makes itself available over the Internet and uses a standardized XML messaging system.The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP). Programmers may write software applications that access the E-Utilities using any SOAP development tool.Unfortunately, NaCTeM does not provide Web service for automatic term recognition (ATR)!If we want to implement CV expansion as a workflow (e.g. using Taverna) and make it publicly accessible as a Web service, then we need a Web service for ATR.UMLS
21 Results input: 243 NMR terms & 152 GC terms output: 5,699 NMR terms & 2,612 GC termsterm acquisition results (see Table above) for the two case studies and IR approaches using the MeSH terms and the seed CV terms (at least 3 and 7 matching terms for abstracts and full papers respectively)[Although freely available for browsing, for most articles in PMC the publisher does not allow downloading the text in XML format, neither does PMC allows bulk downloading in HTML format. Hence, we were able to process only a small portion of full papers (the numbers in brackets refer to these papers).]Total number of new terms acquired: 5,699 for NMR; 2,612 for GCPreliminary results are available at the MSI OWG web site, where the potential CV terms are available to the metabolomics community for comments and curation.The average ratio between the number of acquired technology-specific terms and the corpus size: for full papers; only 0.13 for abstractsOverlap of terms acquired form abstracts and full papers: only 2% on average.This comparison confirms that the Materials and Methods sections represent a significant source of technology-specific terms and also emphasises the need of making full-text articles available to TM applications for the benefits of the overall biomedical community.0.1316.252%