Concept Grounding to Multiple Knowledge Bases via Indirect Supervision

Slides:



Advertisements
Similar presentations
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Advertisements

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Textual Relations Task Definition Annotate input text with disambiguated Wikipedia titles: Motivation Current state-of-the-art Wikifiers, using purely.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Scalable Text Mining with Sparse Generative Models
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Methodology Conceptual Database Design
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Information Retrieval in Practice
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Flexible Text Mining using Interactive Information Extraction David Milward
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR
Cross-Lingual Named Entity Recognition via Wikification
Illinois CCG LoReHLT16 Situation Frame System
Automatically Labeled Data Generation for Large Scale Event Extraction
Using lexical chains for keyword extraction
A Brief Introduction to Distant Supervision
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Improving a Pipeline Architecture for Shallow Discourse Parsing
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
GLOW- Global and Local Algorithms for Disambiguation to Wikipedia
X Ambiguity & Variability The Challenge The Wikifier Solution
Lecture 24: NER & Entity Linking
William Norris Professor and Head, Department of Computer Science
Social Knowledge Mining
Applying Key Phrase Extraction to aid Invalidity Search
Statistical NLP: Lecture 9
Information Retrieval
William Norris Professor and Head, Department of Computer Science
Relational Inference for Wikification
Extracting claim sentences from biomedical documents:
Subspace Clustering for Microarray Data Analysis:
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Topic Oriented Semi-supervised Document Clustering
Introduction Task: extracting relational facts from text
CS246: Information Retrieval
Enriching Taxonomies With Functional Domain Knowledge
Entity Linking Survey
Dan Roth Department of Computer Science
Introduction to Search Engines
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Concept Grounding to Multiple Knowledge Bases via Indirect Supervision Chen-Tse Tsai and Dan Roth University of Illinois at Urbana-Champaign

Concept Grounding Grounding concepts and entities Wikipedia is not the single ideal resource some domains Multiple ontologies in biological domain E.g., Gene Ontology, Sequence Ontology, Protein Ontology … We study the problem of grounding concepts to multiple knowledge bases (KBs), and use biomedical domain as our application Mubarak, wife of Egyptian President Hosni Mubarak and …

The Task BRCA2 and homologous recombination. PR:000004804 EG:675 id: PR:000004804 name: breast cancer type 2 susceptibility protein def: A protein that is a translation product of the human BRCA2 gene or a 1:1 ortholog thereof synonyms: BRCA2, FACD,… is_a: PR:000000001 Protein Ontology id: EG:675 symbol: BRCA2 description: protein-coding BRCA2 breast cancer 2, early onset synonyms: BRCC2, BROVCA2, … Entrez Gene Unlike Wikipedia entries, there is no full text article with hyperlink structure

Challenges Ambiguity Variability Supervision A phrase can be used to express many different concepts BRCA2 is used by 177 concepts Variability A concept may have many synonyms EG:675 has synonyms BRCC2, FACD, FAD, FANCD, … Supervision Wikipedia has nice hyperlink structure which other doesn’t have It is difficult to obtain human annotations for scientific domain We explore the relationship between KBs to construct training examples without any document, the ranking model trained on these examples outperforms all unsupervised methods in our experiments.

System Overview … Concept Candidate Generation Mention Concept Candidate Ranking Global Inference with Knowledge Ranked Candidates Indirect Supervision KB1 KB2 KBl … Example: Candidates EG:675 PR:04804 EG:77244 GO:02111 Candidates Score PR:04804 0.8 EG:77244 0.6 EG:675 0.2 GO:02111 0.1 Candidates Score PR:04804 0.8 EG:77244 0.6 EG:675 0.2 GO:02111 0.1 BRCA2 and homologous recombination

Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Candidate Generation Given a mention, produce a small set of possible concepts Synonym Matching Construct a dictionary from all synonyms and names across all KBs Phrase  possible concept IDs Word Matching Splitting phrases to words, and combining concepts by words Word  possible concept IDs Only keep top k concepts Words are normalized by the SPECIALIST Lexical Tools

Candidate Ranking Relevance score between (mention, concept candidate) Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Candidate Ranking Relevance score between (mention, concept candidate) Representations of the mention m context-word(m): neighboring words in the document context-concept(m): concept candidates of other mentions in the document Representations of the candidate c def(c): definition of c neighbor(c): concepts have a relation with c in all KBs Ranking features Common words in context-word(m) and def(c) Common concepts in context-concept(m) and neighbor(c)

Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Indirect Supervision We explore the redundancy and relationship between KBs to construct training examples Discovering positive examples Cross reference Chromosome Make one as the “mention”, annotated by another Has participant relationship GO:0005649 SO:0000340 Wikipdeia:Chromosome xref GO:0005649 SO:0000340 GO:0006000 fructose metabolic process fructose has_participant

Indirect Supervision Generating other candidates KB1 KBl … Candidate Ranking Global Inference Candidate Generation Indirect Supervision Generating other candidates Apply candidate generation on the name of the concept Uniformly sample 200 concepts from all KBs Take the number of common ancestors between a candidate and the positive candidate as the relevance score Extracting features from pairs of concepts There is no context for GO:0005649 Using def(m) instead of context-word(m) Using neighbor(m) instead of context-concept(m) GO:0005649 SO:0000340

Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Global Inference Enforcing a coherent global solution of all mentions in a document by constraints Hard constraints If a gene is from a species which is not mentioned anywhere in the document, it is removed from the final list. Entries in Entrez Gene Database and Protein Ontology have relations to NCBI Taxonomy Constraints Ranking score of the j-th candidate of the i-th mention

Dataset Colorado Richly Annotated Full-Text corpus [Bada et al., 2012] 67 full text of biomedical journal articles 7 ontologies Ontology # Concepts # Annotations # Unique annot. PR 26,879 15,593 889 NCBITaxon 789,509 7,449 149 GO 25,471 29,443 1,235 CHEBI 19,633 8,137 553 EG 17,097,474 12,266 1,021 SO 1,704 21,284 259 CL 857 5,760 155 Total 17,961,527 99,138 4,261

Evaluation Comparing to 5 unsupervised methods AUC: area under PR curve hAUC: hierarchical version, considering common ancestors Approach Mean AUC Mean hAUC TF-IDF 40.44 48.50 PageRank 42.78 50.04 Zheng et al. (2014) 35.67 42.93 Agirre and Soroa (2009) 43.39 51.88 Agirre and Soroa (2009) w2w 46.51 55.46 Our Approach 48.58 57.37 Direct Supervision 58.98 62.59

Using KBs Individually v.s. Jointly Joint: grounding a mention to all KBs simultaneously Individual: focusing on a single KB each time Approach Individual Joint PageRank 49.85 55.74 Zheng et al. (2014) 49.46 55.42 Agirre and Soroa (2009) 52.12 54.88 Agirre and Soroa (2009) w2w 52.23 56.18 Our Approach 49.93 57.65

Conclusions Concept grounding to multiple KBs without hyperlink structure We propose an approach to construct training examples without using any document. It enables us to apply well-studied statistical models and outperforms unsupervised methods We show that considering multiple KBs together has advantage over using each KB individually