LREC - 2010 Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis Semi-Automatic Domain Ontology Creation.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Integration of Information Extraction with an Ontology M. Vargas-Vera, J.Domingue, Y.Kalfoglou, E. Motta and S. Buckingham Sum.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
ELN – Natural Language Processing Giuseppe Attardi
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Automatic Extraction of Opinion Propositions and their Holders Steven Bethard, Hong Yu, Ashley Thornton, Vasileios Hatzivassiloglou and Dan Jurafsky Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Survey of Semantic Annotation Platforms
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
MedKAT Medical Knowledge Analysis Tool December 2009.
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Information Retrieval and Web Search
Information Retrieval and Web Search
A method for WSD on Unrestricted Text
Automatic Detection of Causal Relations for Question Answering
Text Mining & Natural Language Processing
CS246: Information Retrieval
Presentation transcript:

LREC Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis Semi-Automatic Domain Ontology Creation from Text Resources

LREC Semi-Automatic Domain Ontology Creation from Text Resources 2 Jaguar Overview Jaguar : Builds Ontologies and Knowledge-Bases from the concepts and relationships between those concepts found in text. Constituents of a knowledge base –Concepts/Vocabulary (“weapon”, “WMD”, “launcher”) –Relations (“anthrax” ISA “biological weapon”, “anthrax” CAU “death”) 26 different semantic relation types extracted –Organization of Relations Hierarchical Contextual

LREC Semi-Automatic Domain Ontology Creation from Text Resources 3 Types of Knowledge Universal (or ontological) –Represented in Hierarchies –Simple binary relations between concepts –“Chemical weapons such as nerve gas, …” Contextual –Represented in individual (semantic) contexts –Groups of relations centered on a common concept –“The forces launched a full-scale attack on Monday” chemical weapon nerve gas launch AGT THM TMP forces full-scale attack monday

LREC Semi-Automatic Domain Ontology Creation from Text Resources 4 KB Constituents Concept Set C3 C5 C6 C4 Knowledge Base C2 C1 Contextual Knowledge C21 C22 C23 C24 R1 R2 R3 C33 C36R4 Hierarchy C7 R5C37 C4 C3 C16 C13 C14 C11 anthrax biological weapon assassinate AGT THM TMP rebel political leader may 21 isa pw Ontology

LREC Semi-Automatic Domain Ontology Creation from Text Resources 5 Jaguar Overview Documents Seeds Ontology (structured knowledge) Functionality 1.Produce ontologies 2.Link concepts & relations to text 3.Visualize ontology 4.Edit ontology 5.Enhance an existing ontology 6.Merge two ontologies into a consistent ontology 7.Ontological search of documents (search documents using ontology) Jaguar Ontology + pointers to text Knowledge Base (ontology + contextual knowledge + pointers to text)

LREC Semi-Automatic Domain Ontology Creation from Text Resources 6 Knowledge Bases Ontology/KB creation overview –Knowledge Extraction from Text Pattern recognition; Semantic Parsing –Knowledge Representation and Storage Contextual vs. Universal XML; Relational Database –Knowledge Base Maintenance Conflict Resolution; Ontology Merging User Interaction; Ontology Modification

LREC Semi-Automatic Domain Ontology Creation from Text Resources 7 Jaguar – Process & Modules Jaguar Text Processing Classification Hierarchy Creation Knowledge Base Maintenance Seeds (keywords-list or Ontology) Ontology + pointers to text Knowledge Base (ontology + contextual knowledge + pointers to text) Chopshop: Tokenization Post: Part-of-speech Tagging Rose: Named Entity Recognition Relu: Syntactic Parsing Talbot: Word Sense Disambiguation Polaris: Semantic Parsing PreProcessor: Text-Extraction from HTML. MS Word & PDF Docs Documents ConceptTagger: Concept/Temporal Tagging Text Processing Input: Documents, Seeds Extract “concepts” of interest Extract binary relations (universal) Use Semantic Parser to obtain contextual knowledge Output: Concepts, Contexts, Binary Relations “The rebels had access to chemical weapons, such as nerve gas and other poisonous gases.”

LREC Semi-Automatic Domain Ontology Creation from Text Resources 8 Domain Ontology Creation Polaris: Extract semantic relations in text –Pattern matching and machine learning –Syntactic parse tree broken down into a number of syntactic patterns –Syntactic patterns include verbs and their arguments, complex nominals, adjective phrases, adjective clauses, and others. –There are six primary pattern types discovered within noun phrases: N-N and Adj-N (which comprise compound nominals) ’s and of (Genitive patterns) Adjective Phrases Adjective Clauses first five further subdivided into nominalized and non-nominalized (giving a total of 11 patterns discovered within compound nominals) –There are also five verb argument level patterns being discovered: NP verb verb NP verb PP verb ADVP verb S Jaguar Text Processing Classification Hierarchy Creation Knowledge Base Maintenance

LREC Semi-Automatic Domain Ontology Creation from Text Resources 9 Domain Ontology Creation Input: Concepts, Binary Relations Classify each concept against every other using defined procedures, obtaining set of ISA relations Add all ISA and other binary relations to the hierarchy using conflict resolution Output: Hierarchy of relations “Scud missile” ISA “missile” “Squadron” PW “Platoon” “weapons inspection team” ISA “inspection team” Jaguar/KAT Text Processing Classification Hierarchy Creation Knowledge Base Maintenance Classification/Hierarchy Creation

LREC Semi-Automatic Domain Ontology Creation from Text Resources 10 Domain Ontology Creation Classification Procedures: –Procedure 1: Classify a concept of the form [word, head] with respect to concept [head] –Procedure 2: Classify a concept [word1, head1] with respect to another concept [word2, head2] –Procedure 3: To classify a concept [word1, word2, head] –Procedure 4: Classify a concept [word1, head] with respect to a concept hierarchy under [head] Jaguar/KAT Text Processing Classification Hierarchy Creation Knowledge Base Maintenance

LREC Semi-Automatic Domain Ontology Creation from Text Resources 11 Domain Ontology Creation Knowledge Base Merging Visualization Knowledge Base Editing –User Interaction –Modifications Jaguar/KAT Text Processing Classification Hierarchy Creation Knowledge Base Maintenance

LREC Semi-Automatic Domain Ontology Creation from Text Resources 12 Domain Ontology/KB Creation - Example

LREC Semi-Automatic Domain Ontology Creation from Text Resources 13 Domain Ontology/KB Creation - Example

LREC Semi-Automatic Domain Ontology Creation from Text Resources 14 Conflict Resolution Algorithm Approach Used: Prevention –Start from an empty hierarchy and an input relation set –Add a relation from the input set to the hierarchy, if: It does not form a cycle It is not redundant (does not duplicate a path) –After the addition of any relation, algorithms (jump link removal) are run to ensure that all jump links are removed

LREC Semi-Automatic Domain Ontology Creation from Text Resources 15 Knowledge Base Merging Current Approach –Label the bigger ontology L1, and the other L2 –Merge concepts (from those in L2 into those of L1) –Copy all contexts (from L2 to L1) –Add all relations (from the hierarchy of L2 to the hierarchy of L1) using the conflict resolution algorithm –Additionally, classify all concepts in L1’s hierarchy against concepts in L2’s hierarchy (form relation set R) –Add relations from R into L1’s hierarchy (conflict resolution)

LREC Semi-Automatic Domain Ontology Creation from Text Resources 16 Merging Hierarchies stock_market exchange work_place money_market market industry stock_exchange money_market capital market financial market L1 L2

LREC Semi-Automatic Domain Ontology Creation from Text Resources 17 Merging Hierarchies stock_market, stock_exchange exchange work_place money_market market industry “stock_market” ISA “capital market” capital market “capital market” ISA “financial market” financial market “money_market” ISA “financial market” “financial market” ISA “market”“capital market” ISA “market” L1 Simulating Classification stock_market “stock_market” SYN “stock_exchange”

LREC Semi-Automatic Domain Ontology Creation from Text Resources 18 Semantic Relation Evaluation Training corpus: –noun phrase patterns: Wall Street Journal (TreeBank 2), L.A. Times (TREC 9), and XWN 2.0 –verb argument patterns: FrameNet Three evaluation corpora to benchmark the Polaris semantic relations: –TreeBank: we manually annotated 500 random sentences from the Penn Treebank 3 corpus with 5879 semantic relations. –GlassBox Human: 51 random sentences from the NIMD corpus was manually POS-tagged, syntactically parsed and semantically annotated with 706 semantic relations. –GlassBox Machine: the same 51 sentences used in GlassBox Human evaluation corpus was POS-tagged, syntactically parsed by our NLP tools and then manually annotated with 741 semantic relations.

LREC Semi-Automatic Domain Ontology Creation from Text Resources 19 Semantic Relation Evaluation For Treebank evaluation corpus: –Polaris discovered 5245 relations 2212 exact matches to the human annotations 630 partial matches –partial matches mean that while the relation type was correct and the argument bracketing at least overlapped, there were some extra or missing tokens in the generated arguments –partial matches are scored using precision, recall, and f-measure on the overlapping tokens For the GlassBox Human evaluation corpus: –Polaris discovered 449 relations 311 were perfect matches to the human annotations 56 were partial matches For the GlassBox Machine evaluation corpus: –Polaris discovered 464 relations 249 were perfect matches to the human annotations 71 were partial matches

LREC Semi-Automatic Domain Ontology Creation from Text Resources 20 Semantic Relation Evaluation

LREC Semi-Automatic Domain Ontology Creation from Text Resources 21 Domain Ontology Library Creation We use Jaguar to create an ontology library for the 33 topics defined in NIPF and 10 topics from the financial domain –NIPF is the Director of National Intelligence’s (DNI’s) guidance to the Intelligence Community on the national intelligence priorities approved by the President of the United States of America –For each topic, we collected 500 documents from the web and manually verified their relevance to the corresponding topic. –For each topic, Jaguar is provided with an initial seed set containing on average 47 concepts of interest

LREC Semi-Automatic Domain Ontology Creation from Text Resources 22 Domain Ontology/KB Evaluation We evaluated the quality of 8 Jaguar ontologies by comparing them against manual gold annotations Our evaluations are focused on the –Lexical Level –Vocabulary, or Data Layer Level –Other Semantic Relations Level Viewing an ontology as a set of semantic relations between two concepts, the human annotators: –Labeled an entry correct if the concepts and the semantic relation are correctly detected by the system else marked the entry as Incorrect –Labeled a correct entry as irrelevant if any of the concepts or the semantic relation are irrelevant to the domain –From the sentences added new entries if the concepts and the semantic relation were omitted by Jaguar

LREC Semi-Automatic Domain Ontology Creation from Text Resources 23 NIPF Ontology/KB Evaluation - Metrics N j (.) gives the counts from Jaguar’s output N g (.) correspond to counts in the user annotations

LREC Semi-Automatic Domain Ontology Creation from Text Resources 24 Domain Ontology/KB Evaluation - Results

LREC Semi-Automatic Domain Ontology Creation from Text Resources 25 Domain Ontology/KB Evaluation - Results

LREC Semi-Automatic Domain Ontology Creation from Text Resources 26 Conclusions We presented a generalized and improved procedure to automatically extract deep semantic information from text resources A methodology to rapidly create semantically-rich domain ontologies while keeping the manual intervention to a minimum We defined evaluation metrics to assess the quality of the ontologies and presented evaluation results for a subset of the intelligence and financial ontology libraries, semi- automatically created using freely-available textual resources from the Web The results show that a decent amount of knowledge can be accurately extracted while keeping the manual intervention in the process to a minimum.