Using WordNet Predicates for Multilingual Named Entity Recognition Matteo Negri and Bernardo Magnini ITC-irst Centro per la Ricerca Scientifica e Tecnologica,

Slides:



Advertisements
Similar presentations
Dealing with Italian Temporal Expressions: the ITA-Chronos System Matteo Negri Fondazione Bruno Kessler - IRST, Trento - Italy EVALITA 2007.
Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
SYNTAX 1 DAY 30 – NOV 6, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Named Entity Recognition LING 570 Fei Xia Week 10: 11/30/09.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
C++ Code Analysis: an Open Architecture for the Verification of Coding Rules Paolo Tonella ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
NERIL: Named Entity Recognition for Indian FIRE 2013.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Survey of Semantic Annotation Platforms
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Extracting metadata for spatially- aware information retrieval on the internet Pual Clough Presented by Ali Khodaei CS 572.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
ArchiWordNet Integrating WordNet with Domain-Specific Knowledge Luisa Bentivogli 1, Andrea Bocco 2, Emanuele Pianta 1 1 ITC-irst Trento, Italy 2 Politecnico.
Ling 570 Day 17: Named Entity Recognition Chunking.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
QUALIFIER in TREC-12 QA Main Task Hui Yang, Hang Cui, Min-Yen Kan, Mstislav Maslennikov, Long Qiu, Tat-Seng Chua School of Computing National University.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Competing Ontologies and Verbal Disputes Jakub Mácha Department of Philosophy Masaryk University, Brno, Czech Republic.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Using Semantic Relations to Improve Information Retrieval
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Automatic Writing Evaluation
Institute of Informatics & Telecommunications
Information Retrieval
Presentation transcript:

Using WordNet Predicates for Multilingual Named Entity Recognition Matteo Negri and Bernardo Magnini ITC-irst Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy GWC’04 - Brno (Czech Republic), January

January 23, 2004GWC'04 - Brno (Czech Republic)2 Outline Named Entity Recognition (NER) Rule-based approach using WordNet information –WordNet Predicates (language independent) –Internal evidence: Word_Instances –External evidence: Word_Classes System architecture Experiments and results on English and Italian Future work

January 23, 2004GWC'04 - Brno (Czech Republic)3 Named Entity Recognition (NER) Given a written text, identify and categorize: –Entity names (e.g. persons, organizations, location names) –Temporal expressions (e.g. dates and time) –Numerical expressions (e.g. monetary values and percentages) NER is crucial for Information Extraction, Question Answering and Information Retrieval –Up to 10% of a newswire text may consist of proper names, dates, times, etc.

January 23, 2004GWC'04 - Brno (Czech Republic)4 Q1848: What was the name of the plane that dropped the Atomic Bomb on Hiroshima? PERSON DATE LOCATION OTHER Tibbets piloted the Boeing B-29 Superfortress Enola Gay, which dropped the atomic bomb on Hiroshima on Aug. 6, 1945, causing an estimated 66,000 to 240,000 deaths. He named the plane after his mother, Enola Gay Tibbets. NER for Question Answering

January 23, 2004GWC'04 - Brno (Czech Republic)5 Named Entity Hierarchy ENTITY NAMEX PERSON LOCATION ORGANIZATION TIMEX DATE TIME DURATION OTHER MONEY CARDINAL MEASURE PERCENT

January 23, 2004GWC'04 - Brno (Czech Republic)6 Motivations 1.Experiment how far can we go with NER using WordNet as the main source of semantic knowledge for one language 2.Isolate language-independent relevant knowledge for the NER task 3.Experiment a multilingual approach taking advantage of aligned wordnets (e.g. English/Italian)

January 23, 2004GWC'04 - Brno (Czech Republic)7 Knowledge-Based NER Combination of a wide range of knowledge sources –lexical, syntactic, and semantic features of the input text –world knowledge (e.g. gazetteers) –discourse level information (e.g. co-reference resolution)

January 23, 2004GWC'04 - Brno (Czech Republic)8 Rule-Based approach Rome is the capital of Italy PATTERNt1 t2 t3 t4 t1 t2 t3 t4 [pos = NP] [ort = Cap] [lemma = be] [pos = DT] [sense = (location-p t4 English)] OUTPUT t1 Rome is the capital of Italy

January 23, 2004GWC'04 - Brno (Czech Republic)9 WordNet Predicates (1) (WN-preds) WN-preds are defined over a set of WordNet synsets which express a certain concept Mandate#2 location-p Geological_formation#1 Body_of_water#1 Solid_ground#1 Road#1 Location#1 person-p measure-p lake#1 …

January 23, 2004GWC'04 - Brno (Czech Republic)10 WordNet Predicates (2) Input –A word w and a language L Output –A boolean value (TRUE or FALSE) –TRUE if there exist at least one sense of w which is subsumed by at least one of the synsets defining the predicate location-p [, ] TRUE because there exists a sense of “lake”(lake#1) which is subsumed by one of the synset that define the predicate (i.e. body_of_water#1)

January 23, 2004GWC'04 - Brno (Czech Republic)11 WordNet Predicates (3) WN-preds have been created for the following NE categories: –PERSON : person-name-p ( person#1, spiritual-being#1 ) person-class-p ( person#1, spiritual-being#1 ) first-name-p ( person#1, spiritual-being#1 ) person-product-p ( artifact#1 ) –LOCATION : location-name-p ( location#1, road#1, mandate#1, body_of_water#1, solid_ground#1, geological_formation#1 ) location-class-p ( location#1, road#1, mandate#1, body_of_water#1, solid_ground#1, geological_formation#1 ) movement-verb-p ( locomote#1 ) –ORGANIZATION : org-name-p ( organization#1 ) org-class-p ( organization#1 ) org-representative-p ( trainer#1, top_dog, spokesperson#1 ) –MEASURE : measure-unit-p ( measure#1, number-p ( digit#1, large_integer#1, common_fraction#1 ) –MONEY : money-p (monetary_unit#1, coin#1 ) –DATE : date-p (time_period#1)

January 23, 2004GWC'04 - Brno (Czech Republic)12 WordNet Predicates (4) The definition of a wordnet-predicate is language- independent. In case of aligned wordnet w-preds can be easily parametrized with respect to a certain language without changing the predicate definition –E.g. (Location-p lake English) (Location-p lago Italian)

January 23, 2004GWC'04 - Brno (Czech Republic)13 Knowledge-Based NER Two kinds of information are usually distinguished in Named Entity Recognition(McDonald, 1996): –Internal Evidences : provided by the candidate string itself (e.g. Rome) –Drawbacks: Dimension of reliable gazetteers Maintenance (gazetteers are never “exhaustive”) Overlap among the lists (“Washington”: person or location?) –Limited availability for languages other than English –External Evidence: provided by the context into which the string appears (e.g. capital)

January 23, 2004GWC'04 - Brno (Czech Republic)14 Mining Evidence from WordNet Both IE and EE can be mined from WordNet –Low coverage of Internal evidences (e.g. person names) –High coverage of trigger words Approach: distinguishing between Word_Instances (e.g. “Nile#1”) and Word_Classes (e.g. “river#1”) Problem: in WordNet such a distinction is not explicit!

January 23, 2004GWC'04 - Brno (Czech Republic)15 Word Classes and Word Instances I In WordNet, the hyponyms of the synset “person#1” are a mixture of concepts (e.g. “astronomer”, “physicist”, etc.) and individuals (e.g. “Galileo Galilei”, “Kepler”, etc.) person intellectual scientist physicist astronomer Galileo_Galilei... Kepler... Italian... …

January 23, 2004GWC'04 - Brno (Czech Republic)16 Word Classes and Word Instances (1) - NOTE: in WordNet, the hyponyms of the synset “person#1” are a mixture of concepts (e.g. “astronomer”, “physicist”, etc.) and individuals (e.g. “Galileo Galilei”, “Kepler”, etc.) person intellectual scientist physicist astronomer Galileo_Galilei... Kepler... Italian... EE (Word_Classes ) IE (Word_Instances)

January 23, 2004GWC'04 - Brno (Czech Republic)17 Word Classes and Word Instances (2) Semi-automatic procedure to distinguish Word_Instances and Word_Classes in WordNet 3 steps: –1) collect all the hyponyms of several high-level synsets (e.g. “person#1”, “social_group#1”, “location#1”, “measure#1”, etc.) –2) separate capitalized words from lower case words: capitalized words Word_Instances lower case words Word_Classes –3) manual filter is necessary: “Italian” is not an Instance!

January 23, 2004GWC'04 - Brno (Czech Republic)18 Distribution of Word Classes and Word Instances in MultiWordNet #ENG Classes#ENG Instances#ITA Classes#ITA Instances PERSON LOCATION ORGANIZ TOTAL

January 23, 2004GWC'04 - Brno (Czech Republic)19 System Architecture (NERD) 1.Preprocessing –tokenization –POS tagging –multiwords recognition 2.Basic rules application –  400 language-specific basic rules, both for English and Italian, are applied to find and tag all the possible NEs present in the input text 3.Composition rules application –higher level language-independent rules for handling ambiguities between possible multiple tags and for co-reference resolution

January 23, 2004GWC'04 - Brno (Czech Republic)20 Basic Rules I English basic rule for capturing IE –Example: “Galileo invented the telescope” PATTERNt1 [sense = (person-name-p t1 English)] OUTPUT t1 –NOTE: the WN-pred person-name-p is satisfied by any of the 1202 English Instances of the category PERSON

January 23, 2004GWC'04 - Brno (Czech Republic)21 Basic Rules II Italian basic rule for capturing IE –Example: “il telescopio fu inventato da Galileo” PATTERNt1 [sense = (person-name-p t1 Italian)] OUTPUT t1 –NOTE: here, the WN-pred person-name-p is satisfied by any of the 1550 Instances (1202 for English for Italian) of the category PERSON

January 23, 2004GWC'04 - Brno (Czech Republic)22 Basic Rules III Basic rule for capturing EE (via trigger words) –Example: “Roma è la capitale italiana” PATTERNt1 t2 t3 t4 t1 t2 t3 t4 [pos = “NP”] [ort = Cap] [lemma = “essere”] [pos = “DT”] [sense = (location-p t4 Italian)] OUTPUT t1 –NOTE: the WN-pred location-p is satisfied by any of the 979 Italian Classes of the category LOCATION

January 23, 2004GWC'04 - Brno (Czech Republic)23 Basic Rules IV Basic rule for capturing EE (via sentence structure) –Example: “Bowman, who was appointed by Reagan … PATTERNt1 t2 t3 t1 t2 t3 [pos = “NP”] [ort = Cap] [lemma = “,”] [lemma = “who”] OUTPUT t1 –NOTE: External Evidence can be captured from the context also in absence of particular word senses

January 23, 2004GWC'04 - Brno (Czech Republic)24 Composition Rules Input: tagged text with all the possible Named Entities Out: a tagged text, where: –overlaps and inclusions between tags are removed –co-references are resolved

January 23, 2004GWC'04 - Brno (Czech Republic)25 Composition Rules II Composition rule for handling tag inclusions –Example: “ miles from New York...” B = CARDINAL A = MEASURE PATTERNNE1 NE2 NE1 NE2 [start = n] [end = m] [TAG = A ] [start = o (n≤o<m)][end = p (o<p ≤m)] [TAG = B  A ] OUTPUT NE1[start = n] [end = m] [TAG = A ]

January 23, 2004GWC'04 - Brno (Czech Republic)26 Composition Rules III Composition rule for co-reference resolution –Example: “…with Judge Pasco Bowman. Bowman was...”   PATTERNNE1 NE* NE2 NE1 NE2 [entity =  ] [TAG = A ] [entity =  substring of  ] [TAG = “ NAMEX ”] OUTPUT NE2 [entity =  ] [TAG = A ]

January 23, 2004GWC'04 - Brno (Czech Republic)27 Experiment DARPA/NIST HUB4 competition test corpora and scoring software Categories: PERSON,LOCATION,ORGANIZATION Reference tagged corpora –English: 365 Kb of newswire texts –Italian: 77 Kb of transcripts from two Italian broadcast news shows (~7000 words, 322 NEs) F-measure, Precision and Recall computed comparing reference corpora with automatically tagged ones –type, content, and extension of each NE are considered

January 23, 2004GWC'04 - Brno (Czech Republic)28 Results RecallPrecisionF-Measure ITAENGITAENGITAENG PERSON LOCATION ORGANIZATION All categories

January 23, 2004GWC'04 - Brno (Czech Republic)29 Conclusion and Future Work We presented a NE recognition system based on information represented in Wordnet Language independent predicates for NE have been defined Results on two languages show that the approach performs as state of art rule based systems The system has been successfully integrated in a QA system Future work: –move to WN 2.0 –integrate gazetteers –use Sumo concepts