1 Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan, Krys Kochut and Amit Sheth LSDIS Lab, University of Georgia, Athens,

Slides:



Advertisements
Similar presentations
An Ontology Creation Methodology: A Phased Approach
Advertisements

Trailblazing, Complex Hypothesis Evaluation, Abductive Reasoning and Semantic Web Trailblazing, Complex Hypothesis Evaluation, Abductive Reasoning and.
Knowledge Graph: Connecting Big Data Semantics
Knowledge Enabled Information and Services Science Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Kno.e.sis Center, Wright.
6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.
Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Semantic Browser LSDIS (Large Scale Distributed Information Systems) Lab. Bilal Gonen M.Sc. in Computer Science University of Georgia
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
 William M. Pottenger, Ph.D. Computing the Future of Data Mining An Introduction to Data Mining Visit to Messiah College September 4, 2006 William M.
Knowledge Enabled Information and Services Science Semantic Web for Health Care and Biomedical Informatics Keynote at NSF Biomed Web Workshop, December.
Extracting, representing and mining Semantic Metadata from text: Facilitating Knowledge Discovery in Biomedicine Cartic Ramakrishnan Advisor: Dr. Amit.
{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Components of the same challenge? Invited Talk, International Workshop on Ontology.
Semantic Web: promising technologies and current applications in Health care & Life Sciences Amit Sheth Thanks: Kno.e.sis team, collaborators at CCRC,
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Social Pharmacy and Pharmacoepidemiology Lister Hill National Center for Biomedical Communications Text-based Discovery in Biomedicine The Architecture.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Knowledge Enabled Information and Services Science Ontology supported Knowledge Discovery in the field of Human Performance and Cognition Kno.e.sis Center.
Knowledge Enabled Information and Services Science Empowering Translational Research using Semantic Web Technologies OCCBIO, June, 2008 Amit P. Sheth Kno.e.sis.
Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons Chen Liu, Wesley W. Chu, Fred Sabb, Stott Parker and Joseph Korpela.
Kno.e.sis Center, Wright State University,
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Semantics-Empowered Text Exploration for Knowledge Discovery Delroy Cameron, Pablo N. Mendes, Amit P. Sheth Knowledge Enabled Information and Services.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Flexible Text Mining using Interactive Information Extraction David Milward
Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement:
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Driving Deep Semantics in Middleware and Networks: What, why and how? Amit Sheth Semantic Sensor Networks ISWC2006 November 06, 2006,
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Ontology-based search and knowledge sharing using domain ontologies
Algorithmic Detection of Semantic Similarity WWW 2005.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration.
Knowledge Enabled Information and Services Science Relationship Web: Realizing the Memex vision with the help of Semantic Web SemGrail Workshop, Redmond,
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth
Semantic Relation Discovery by Using Co-occurrence Information Background: MEDLINE contains high quality semantic metadata covering more than 22 million.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Traversing Documents by Using Semantic Relationships Bilal Gonen Assistant Professor Computer Science, University of West Florida Ph.D. from University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Deep Indexing in ProQuest Health and Medical Databases.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
TDM in the Life Sciences Application to Drug Repositioning *
Biomedical Text Mining and Its Applications
Improving Data Discovery Through Semantic Search
Lindsay & Gordon’s Discovery Support Systems Model
Knowledge Discovery in the Semantic Web
Multimedia Information Retrieval
Batyr Charyyev.
CS246: Information Retrieval
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

1 Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan, Krys Kochut and Amit Sheth LSDIS Lab, University of Georgia, Athens, GA November 7 th 2006 ISWC 2006

2 Outline Motivation Problem Description & Approach Results Future Work

3 Anecdotal Example Discovering connections hidden in text UNDISCOVERED PUBLIC KNOWLEDGE

4 Motivation 1 – Undiscovered Public knowledge in biology MagnesiumMigraine PubMed ? Stress Spreading Cortical Depression Calcium Channel Blockers Swanson’s Discoveries Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships 11 possible associations found These associations were discovered in 1986

5 Motivation 2 - Hypothesis Driven retrieval of Scientific Literature PubMed Complex Query Supporting Document sets retrieved Migraine Stress Patient affects isa Magnesium Calcium Channel Blockers inhibit Keyword query: Migraine[MH] + Magnesium[MH]

6 Motivation 3 -- Growth Rate of Public Knowledge Data captured per year = 1 exabyte (10 18 ) (Eric Neumann, Science, 2005) How much is that? –Compare it to the estimate of the total words ever spoken by humans = 12 exabyte A small but significant portion is text data –PubMed 16 Million abstracts –MedlinePlus – health information –OMIM – catalog of human genes and genetic disorders Undiscovered public knowledge may have also increased by a large amount

7 Our past work in Connection Discovery Semantic Associations over RDF graphs –Discovery and Ranking Migraine Stress Patient affects isa Magnesium Calcium Channel Blockers inhibit Semantically Connected Assumption: Rich Semantic Metadata containing entities related by a diverse set of relationships It is therefore critical to bridge the gap between unstructured and structured data by extracting entities and relationships between resulting in semantic metadata

8 Outline Motivation Problem Description & Approach Results Future Work

9 Problem – Extracting relationships between MeSH terms from PubMed Biologically active substance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of UMLS Semantic Network MeSH PubMed 9284 documents 4733 documents 5 documents `

10 Background knowledge used UMLS – A high level schema of the biomedical domain –136 classes and 49 relationships –Synonyms of all relationship – using variant lookup (tools from NLM) –49 relationship + their synonyms = ~350 mostly verbs MeSH –22,000+ topics organized as a forest of 16 trees –Used to query PubMed PubMed –Over 16 million abstract –Abstracts annotated with one or more MeSH terms T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced

11 Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) ) Entities (MeSH terms) in sentences occur in modified forms “adenomatous” modifies “hyperplasia” “An excessive endogenous or exogenous stimulation” modifies “estrogen” Entities can also occur as composites of 2 or more other entities “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”

12 Method – Identify entities and Relationships in Parse Tree TOP NP VP S NP VBZ induces NP PP NP IN of DT the NN endometrium JJ adenomatous NN hyperplasia NP PP IN by NN estrogen DT the JJ excessive ADJP NN stimulation JJ endogenous JJ exogenous CC or MeSHID D MeSHID D MeSHID D UMLS ID T147 Modifiers Modified entities Composite Entities

13 Entities – The simple, the modified and the composite To capture the various types of entities we define –Simple entities as MeSH terms –Modifiers as siblings of entities that are Determiners – “Y induces no X” Noun Phrases – “An excessive endogenous or exogenous stimulation” Adjective phrases – “adenomatous” Prepositional phrases – “M is induced by the X in the Z” –Modified Entities as any entity that has a sibling which is a modifier –Composite Entity as any entity that has another entity as a sibling

14 Resulting RDF Modifiers Modified entities Composite Entities

15 Outline Motivation Approach Results Future Work

16 Results Dataset 1 –Swanson’s discoveries Associations between Migraine and Magnesium [Hearst99] –stress is associated with migraines –stress can lead to loss of magnesium –calcium channel blockers prevent some migraines –magnesium is a natural calcium channel blocker –spreading cortical depression (SCD) is implicated in some migraines –high levels of magnesium inhibit SCD –migraine patients have high platelet aggregability –magnesium can suppress platelet aggregability

17 Results – Creation of Dataset 1 Keywords pairs e.g. stress + migraine etc. against PubMed return PubMed abstracts that are annotated (by NLM) with both terms 8 pairs of terms in this scenario result in 8 subsets of PubMed Semantic Metadata –Represented in RDF –With complex entities and relationships connecting them –Pointers to original document and sentence –Size ~2MB RDF for Migraine Magnesium subset of PubMed

18 Evaluating the Result of Extraction Ideal method to evaluate the Extraction method –Domain experts read a set of abstract given a set of relationship names and entities to look for –In addition to this give them the extracted triples and entities –For every abstract the expert judges counts the correct, incorrect and missed triples –Measure precision and recall

19 Evaluating the Result of Extraction In the absence of a domain expert we focus of getting a feel for the utility of the extracted data –We know the association manually discovered between Migraine and Magnesium –We locate paths of various lengths between them and manually inspect these paths –If the paths are indicative of the manually discovered associations the extracted data is useful

20 Paths between Migraine and Magnesium Paths are considered interesting if they have one or more named relationship Other than hasPart or hasModifiers in them

21 An example of such a path

22 Results Dataset 2 –Neoplasm (C04) For subtree of MeSH rooted at Neoplasms all topics under this subtree are used as query terms against PubMed The resulting dataset contains ~500,000 PubMed abstracts The extraction process run on this data returns ~150MB Processing the tagged and parsed sentences for Dataset 2 (Neoplasm) to generate RDF took approx. 5 minutes Stats –211 different named relationships found –500,000 instance-property-instance statements –260,000 instance-property-literal statements Currently setting up to extract RDF from all of PubMed

23 Outline Motivation Problem Description & Approach Results Future Work

24 Future Extensions to the Extraction process Short-term goals (1 month) –MeSH qualifiers (blood pressure, contraindications) –Curate and release Migraine-Magnesium RDF Long-Term goals –More complex structures Conjunctions X causes Y to inhibit Z –Rule-action language to test new extraction rules –Finding new terms to enrich existing vocabularies –Perhaps ontology enrichment

25 The projected future of research in Biology From … Hypothesis driven “wet lab” experiments To … Data-driven reduction/pruning of hypothesis space leading to new insight and possibly discovery What challenges does this transition bring?

26 Use of Generated Semantic Metadata Semantic Browsing of PubMed based on named relationships between MeSH terms Path/hypothesis based document retrieval Knowledge discovery from literature –Coprus-based complex relationship discovery and ranking –Corpus-based relevant connection subgraph discovery

27 Support such retrieval and discovery operations across multiple data sources Extract Semantic Metadata about entities in all of these databases that might occur in PubMed text Resulting metadata will contain relationships between genes (OMIM), diseases (MeSH), nucleotide anomalies (SNP) hypothesis validation and knowledge discovery in biology.

28 THANK YOU!