Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

Slides:



Advertisements
Similar presentations
BioPortal Status and Plans September 2011 Ray Fergerson NCBO Project Director Stanford University 1.
Advertisements

NCBO-I2B2 Collaboration Overview and Use Cases Nigam Shah
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Information Retrieval in Practice
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
AceMedia Personal content management in a mobile environment Jonathan Teh Motorola Labs.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Implementing Metadata Marjorie M K Hlava, President Access Innovations, Inc. Albuquerque, NM
Overview of Search Engines
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Automatic Subject Classification and Topic Specific Search Engines -- Research at KnowLib Anders Ardö and Koraljka Golub DELOS Workshop, Lund, 23 June.
BME1450: Biomaterials and Biomedical Research Michelle Baratta Engineering & Computer Science Library Maria Buda Dentistry Library.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
Bioinformatics and medicine: Are we meeting the challenge?
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Ensemble Computing in the National Science Digital Library (NSDL)
A Multiple Ontology, Concept-Based, Context-Sensitive Search and Retrieval Robert Moskovitch and Prof. Yuval Shahar Medical Informatics Research Center.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Flexible Text Mining using Interactive Information Extraction David Milward
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Lars Juhl Jensen Biomedical text mining. exponential growth.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Ontology Evolution and Regression Analysis Insights into Ontology Regression Testing Maria Copeland Rafael Goncalvez Robert Stevens Bijan Parsia Uli Sattler.
Value Set Resolution: Build generalizable data normalization pipeline using LexEVS infrastructure resources Explore UIMA framework for implementing semantic.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏 慈濟大學醫學資訊學系 2012/06/13.
SSO: THE SYNDROMIC SURVEILLANCE ONTOLOGY Okhmatovskaia A, Chapman WW, Collier N, Espino J, Conway M, Buckeridge DL Ontology Description The SSO was developed.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
On-To-Knowledge review Juan-Les-Pins/France, October 06, 2000 Hans Akkermans, VUA Hans-Peter Schnurr, AIFB Rudi Studer, AIFB York Sure, AIFB KMKMMethodology.
Ontologies for the Semantic Web Prepared By: Tseliso Molukanele Rapelang Rabana Supervisor: Associate Professor Sonia Burman 20 July 2005.
Automatic vs manual indexing Focus on subject indexing Not a relevant question? –Wherever full text is available, automatic methods predominate Simple.
Supporting Collaborative Ontology Development in Protégé International Semantic Web Conference 2008 Tania Tudorache, Natalya F. Noy, Mark A. Musen Stanford.
1 DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen, Germany.
GUIDE. P UB M ED
Information Retrieval in Practice
Improving Data Discovery Through Semantic Search
CCNT Lab of Zhejiang University
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Neil A. Ernst, Margaret-Anne Storey, Polly Allen, Mark Musen
Ontology Evolution: A Methodological Overview
Exploring Scholarly Data with Rexplore
PubMed.
Introduction to Search Engines
Presentation transcript:

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford

Data 3/29/2012 Engineering Informatics Lab at Stanford University 2

TREC Genomics 2007 Data Set Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine Metadata available through MEDLINE Tasks involve passage, document, and feature retrieval Methodologies are evaluated on their response to 36 topics (‘queries’) The topics are categorized based on 13 entity types (Proteins, Genes, etc.) 3/29/2012 Engineering Informatics Lab at Stanford University 3

BioPortal BioPortal is an integrated resource for biomedical ontologies Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies 3/29/2012 Engineering Informatics Lab at Stanford University 4

Methodology 3/29/2012 Engineering Informatics Lab at Stanford University 5

How is Domain Knowledge Integrated (1)Annotating Documents prior to indexing – Response time is fast – Not flexible, the entire index has to be updated if a new ontology needs to be added – Indexes can grow very large (2) Query Expansion – Response time is slower – Very flexible, ontologies can be dynamically chosen 3/29/2012 Engineering Informatics Lab at Stanford University 6

Query Expansion TREC Queries are first manually pre-processed “What [TUMOR TYPES] are found in zebrafish?” => “[Tumor][MeSH] AND zebrafish” [Tumor] indicates term that has to be expanded [MeSH] indicates ontology that should be used 3/29/2012 Engineering Informatics Lab at Stanford University 7

Query Expansion The pre-processed query is automatically expanded using BioPortal’s API [Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma, Leukemia …} Tumor Leukemia Melanoma Adenocarcinoma Nerve Sheath Neo Synonyms Cancer, Neoplasm, … Synonyms Leucocythaemias Leucocythemia MeSH 3/29/2012 Engineering Informatics Lab at Stanford University 8

Which Domain Knowledge is Integrated The use of synonymy results in inconsistent performance (2007 TREC genomics track) Common reasons include: – Relevant terms may not be classified as expected – Some relevant terms may not be classified in a particular ontology – Incomplete information (such as synonyms) Selection of the appropriate domain ontology is important 3/29/2012 Engineering Informatics Lab at Stanford University 9

Enriching Existing Ontologies Existing ontologies must be enriched to complete missing information Multiple ontologies can be used to provide different classifications 3/29/2012 Engineering Informatics Lab at Stanford University 10 MeSH NCI OntologyNDF ConceptPamidronate Synonyms from NDFAPD, Amidronate,... Synonyms from MeSH pamidronate calcium, pamidronate monosodium, aredia Synonyms from NCIPamidronic acid, pamidronate disodium, …

Evaluations Baseline With Query Expansion (Suggested Sources) Using Enriched Ontologies Multiple Query Expansions per query 3/29/2012 Engineering Informatics Lab at Stanford University 11 Summary of 2007 TREC genomics track Max Min Mean Median0.1897

Queries Topic Number QueryDomain Knowledge 205What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease? Symptom Ontology 206What [TOXICITIES] are associated with zoledronic acid? NCI Thesaurus 207What [TOXICITIES] are associated with etidronate?NCI Thesaurus 211What [ANTIBODIES] have been used to detect protein PSD-95? MeSH 229What [SIGNS OR SYMPTOMS] are caused by human parvovirus infection? Symptom Ontology 231What [TUMOR TYPES] are found in zebrafish?MeSH 3/29/2012 Engineering Informatics Lab at Stanford University 12

Baseline Queries are used without modification, e.g., – “What [ANTIBODIES] have been used to detect protein PSD-95?” – “What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?” Document MAP: /29/2012 Engineering Informatics Lab at Stanford University 13

Query Expansion Queries are formulated in ‘AND’ clauses: “[Tumor][MeSH] AND zebrafish” => (Tumor, Neoplasm, Carcinoma, Leukemia …) AND zebrafish Document MAP: /29/2012 Engineering Informatics Lab at Stanford University 14

Multiple Query Expansion Terms Expansion can be performed on multiple terms in the query Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …} [Tumor][MeSH] AND zebrafish[MeSH} => (tumor, neoplasm, …) AND (zebrafish, danio rerio, …) Document MAP: /29/2012 Engineering Informatics Lab at Stanford University 15

Enriched Ontology Marginal improvement over basic enhanced models Document MAP: Why is the improvement only marginal? – Framework for enrichment based on synonymy is rigid, i.e., relevant terms that are entirely missing in the ontology are still not included – Relevant terms that are classified differently are never included in the search 3/29/2012 Engineering Informatics Lab at Stanford University 16

Visualization Expert knowledge is valuable We extend MINOE, a co-occurrence based visualization tool, originally designed for exploring marine ecosystems User can browse (or search) documents through ontologies and visualize interactions between concepts SEE DEMO 3/29/2012 Engineering Informatics Lab at Stanford University 17

Summary Search methodologies must be based on semantics in order to tackle terminology inconsistency Domain ontologies provide these semantics Domain ontologies need to be modified (or enriched) in order to fulfill information needs User interaction is important 3/29/2012 Engineering Informatics Lab at Stanford University 18

Future Work Using multiple enriched ontologies may provide the necessary terms MeSH Descriptors are provided for every publication during indexing and can potentially improve results Implement Okapi model for scoring documents 3/29/2012 Engineering Informatics Lab at Stanford University 19

Backup Slides 3/29/2012 Engineering Informatics Lab at Stanford University 20

Motivation Scientific literature is an important source of information Retrieving relevant information from scientific publications is challenging Domain terminology is used inconsistently in scientific publications Increasing amounts of information amplify the problem Improved methodologies based on semantics are required 3/29/2012 Engineering Informatics Lab at Stanford University 21

Background Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods The Genomics track focused on full-text scientific publications from 49 prominent journals Methodologies involved: – Use of Synonymy from ontologies – Language based models – Query expansion and annotations – Okapi scoring model 3/29/2012 Engineering Informatics Lab at Stanford University 22

Goals Understand how domain ontologies can be leveraged Understand which domain ontologies can be leveraged Develop a knowledge-based approach to integrate domain knowledge with search mechanism 3/29/2012 Engineering Informatics Lab at Stanford University 23