Acquisition of Lexical Knowledge for NLP German Rigau i Claramunt TALP Research Center Departament de Llenguatges i Sistemes.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Building Wordnets Piek Vossen, Irion Technologies.
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
Semi-automatic methods for WordNet construction German Rigau i Claramunt TALP Research Center Universitat Politècnica de Catalunya.
S-Match: an Algorithm and an Implementation of Semantic Matching Pavel Shvaiko 1 st European Semantic Web Symposium, 11 May 2004, Crete, Greece paper with.
BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB.
Linguistic and Logical Tools for an Advanced Interactive Speech System in Spanish J. Álvarez, V. Arranz, N. Castell & M. Civit TALP Research Centre UPC,
Computational Lexicography Frank Van Eynde Centre for Computational Linguistics.
The WordNet Lexical Database Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - Italy.
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
Extracting Knowledge-Bases from Machine- Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp
Building a Large- Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae.
Statistical NLP: Lecture 3
Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004.
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
Lexicography ( Dictionary Skills) Lecture 2
C SC 620 Advanced Topics in Natural Language Processing Lecture 22 4/15.
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.
Word Sense Disambiguation German Rigau i Claramunt TALP Research Center Departament de Llenguatges i Sistemes Informàtics.
LREC 2008 AWN 1 Building WordNets: The Arabic case H. Rodríguez.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Word Sense Disambiguation and Information Retrieval ByGuitao Gao Qing Ma Prof:Jian-Yun Nie.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
Part 3. Knowledge-based Methods for Word Sense Disambiguation.
Interpreting Dictionary Definitions Dan Tecuci May 2002.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Word Sense Disambiguation UIUC - 06/10/2004 Word Sense Disambiguation Another NLP working problem for learning with constraints… Lluís Màrquez TALP, LSI,
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Improving Subcategorization Acquisition using Word Sense Disambiguation Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
Ontology Engineering: from Cognitive Science to the Semantic Web Maria Teresa Pazienza University of Roma Tor Vergata, Italy 1.
Element Level Semantic Matching Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan Paper by Fausto.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Zdroje jazykových dat Word senses Sense tagged corpora.
TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
1 Automatic Reasoning Paqui Lucio, Montserrat Hermo and German Rigau Doctorado Ingeniería en Informática. LSI.
Restaurant Vocabulary English and Spanish
Lexicons, Concept Networks, and Ontologies
Talp Research Center, UPC, Barcelona, Spain
Learning Attributes and Relations
Statistical NLP: Lecture 3
Ontology Engineering: from Cognitive Science to the Semantic Web
WordNet WordNet, WSD.
C SC 620 Advanced Topics in Natural Language Processing
Aramis Rodríguez Blanco
THESAURUS CONSTRUCTION: GROUND WATER
Presentation transcript:

Acquisition of Lexical Knowledge for NLP German Rigau i Claramunt TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Acquisition of Lexical Knowledge for NLP2 Acquisition of Lexical Knowledge for NLP Outline n Setting n Words and Works n Structured Sources –MRDs, thesauri n Unstructured Sources –corpora

Acquisition of Lexical Knowledge for NLP3 Acquisition of Lexical Knowledge for NLP Setting n NLP and the Lexicon –Theoretical: WG, GPSG, HPSG. –Practical: realistic complexity and coverage n Lexical Bottleneck (Briscoe 91) –Even worse for languages other than English

Acquisition of Lexical Knowledge for NLP4 Acquisition of Lexical Knowledge for NLP Setting n Which LK is needed by a concrete NLP system? n Where is this LK located? n Which procedures can be applied?

Acquisition of Lexical Knowledge for NLP5 Acquisition of Lexical Knowledge for NLP Setting n Which LK is needed by a concrete NLP system? –Phonology: phonemes, stress, etc. –Morphology: POS, etc. –Syntactic:category, subcat., etc. –Semantic:class, SRs, etc. –Pragmatic:usage, registers, TDs, etc. –Translations:translation links

Acquisition of Lexical Knowledge for NLP6 Acquisition of Lexical Knowledge for NLP Setting n Where is this LK located? –Human brain –Structured Lexical Resources: n Monolingual and bilingual MRDs n Thesauri –Unstructured Lexical Resources: n Monolingual and bilingual Corpora –Mixing resources

Acquisition of Lexical Knowledge for NLP7 Acquisition of Lexical Knowledge for NLP Setting n Which procedures can be applied? –Prescriptive approach n Machine-aided manual construction –Descriptive approach n Automatic acquisition from pre-existing Lexical Resources –Mixed approach

Acquisition of Lexical Knowledge for NLP8 Acquisition of Lexical Knowledge for NLP Outline n Setting n Words and Works n Structured Sources –MRDs, thesauri n Unstructured Sources –corpora

Acquisition of Lexical Knowledge for NLP9 Words and Works Where is this Lexical Knowledge located? –Human brain: n Linguistic String Project (Fox et al. 88) –Lexical Information for 10,000 entries n WordNet (Miller et al. 90) –Semantic Information v1.6 with 99,642 synsets n Comlex (Grishman et al. 94) –Syntactic information 38,000 English words n CYC Ontology (Lenat 95) –a person-century of effort to produce 100,000 terms n LDOCE3-NLP –dictionary with 80,000 senses

Acquisition of Lexical Knowledge for NLP10 Words and Works Where is this Lexical Knowledge located? –Structured Lexical Resources n Monolingual MRDs: –LDOCE n learner’s dictionary n 35,956 entries and 76,059 definitions n 86% semantic and 44% pragmatic codes n controlled vocabulary of 2,000 words n (Boguraev & Briscoe 89) n (Vossen & Serail 90) n (Bruce & Guthrie 92), (Wilks et al. 93) n (Dolan et al. 93), (Richardson 97)

Acquisition of Lexical Knowledge for NLP11 Words and Works Where is this Lexical Knowledge located? –Structured Lexical Resources n Other Monolingual MRDs: –Webster’s (Jensen & Ravin 87) –LPPL (Artola 93) –DGILE (Castellón 93), (Taulé 95), (Rigau 98) –CIDE (Harley & Glennon 97) –AHD (Richardson 97) –WordNet (Harabagiu 98) n Bilingual MRDs –Collins Spanish/English (Knigth & Luk 94) –Vox/Harrap’s Spanish/English (Rigau 98)

Acquisition of Lexical Knowledge for NLP12 Words and Works Where is this Lexical Knowledge located? –Structured Lexical Resources n Thesauri: –Roget’s Thesaurus n 60,071 words in 1,000 categories n (Yarowsky 92), (Grefenstette 93), (Resnik 95) –Roget’s II and The New Collins Thesaurus n (Byrd 89) –Macquarie’s thesaurus n (Grefenstette 93) –Bunrui Goi Hyou Japanese thesaurus n (Utsuro et al. 93)

Acquisition of Lexical Knowledge for NLP13 Words and Works Where is this Lexical Knowledge located? –Structured Lexical Resources n Encyclopaedia –Grolier’s Encyclopaedia (Yarowsky 92) –Encarta (Richardson et al. 98) n Others –Telephonic Guides n Mixing structured lexical resources –Roget’s Thesaurus and Grolier’s (Yarowsky 92) –LDOCE, WN, Collins, ONTOS, UM (Knight & Luk 94) –Japanese MRD to WN (Okumura & Hovy 94) –LLOCE, LDOCE (Chen & Chang 98)

Acquisition of Lexical Knowledge for NLP14 Words and Works Where is this Lexical Knowledge located? –Unstructured Lexical Resources n Corpora: –WSJ, Brown Corpus (SemCor), Hansard –Proper Nouns (Hearst & Schütze 95) –Idiosyncratic Collocations (Church et al. 91) –Preposition preferences (Resnik and Hearst 93) –Subcategorization structures (Briscoe and Carroll 97) –Selectional restrictions (Resnik 93), (Ribas 95) –Thematic structure (Basili et al. 92) –Word semantic classes (Dagan et al. 94) –Bilingual Lexicons for MT (Fung 95)

Acquisition of Lexical Knowledge for NLP15 Words and Works Where is this Lexical Knowledge located? –Mixing structured and non-structured Lexical Resources n MRDs and Corpora –(Liddy & Paik 92) –(Klavans & Tzoukermann 96) n WordNet and Corpora –(Resnik 93), (Ribas 95), (Li & Abe 95), (McCarthy 01) –(Mihalcea & Moldovan 99)

Acquisition of Lexical Knowledge for NLP16 Words and Works Lexical Acquisition from MRDs –Syntactic Disambiguation (Dolan et al. 93) –Semantic Processing (Vanderwende 95) –WSD (Lesk 86), (Wilks & Stevenson 97), (Rigau 98) –IR (Krovetz & Croft 92) –MT (Knight and Luk 94), (Tanaka & Umemura 94) –Semantically enriching MRDs n (Yarowsky 92), (Knight 93), (Chen & Chan 98) –Building LKBs n (Bruce & Guthrie 92) n (Dolan et al. 93) n (Artola 93) n (Castellón 93), (Taulé 95), (Rigau 98)

Acquisition of Lexical Knowledge for NLP17 Words and Works International Projects on Lexical Acquisition –Japanese Projects n EDR (Yokoi 95) –Nine years project oriented to MT –Bilingual Corpora with 250,000 words –Monolingual, bilingual and coocurrence dictionaries –200,000 general vocabulary –100,000 technical terminology –400,000 concepts

Acquisition of Lexical Knowledge for NLP18 Words and Works International Projects on Lexical Acquisition –American Projects n Comlex (Grishman et al. 94) –Syntactic information for 38,000 words n WordNet (Miller 90) –Semantic information –more than 123,000 words organised in 99,000 synsets –more than 116,000 relations between synsets n Pangloss (Knight & Luk 94) –PUM, ONTOS, LDOCE semantic categories, WordNet n Cyc (Lenat 95) –common-sense knowledge –100,000 concepts and 1,000,000 axioms

Acquisition of Lexical Knowledge for NLP19 Words and Works International Projects on Lexical Acquisition –European Projects n Acquilex I and II –LA from monolingual and bilingual MRDs and corpora n LE-Parole –Large-scale harmonised set of corpora and lexicons for all the EU languages n EuroWordNet –To develop a multilingual WordNet for several European Languages

Acquisition of Lexical Knowledge for NLP20 Acquisition of Lexical Knowledge for NLP Setting n Acquilex I n Acquilex II n EuroWordNet

Acquisition of Lexical Knowledge for NLP21 Words and Works Acquilex n Lexical Knowledge Acquisition n Mixed approach n Dictionaries (MRD -> MTD -> LDB -> LKB) n Partners –Cambridge University –Instituto di Linguistica Computazional de Pisa –Amsterdam University –Dublin University n 30 months n Thesis –(Castellón 1993) –(Taulé 1995) –(Rigau 1998)

Acquisition of Lexical Knowledge for NLP22 Words and Works Acquilex II n Lexical Knowledge Acquisition n Mixed approach n Corpora n Partners –Cambridge University –Instituto di Linguistica Computazional de Pisa –Amsterdam University n 30 months n Thesis –[Ribas 1995] (Acquisition of Selectional Restrictions) –[Ageno...] (Robust Parsing) –[Padró 1998] (Relaxation labelling) –[Màrquez 1999] (Desition Trees)

Acquisition of Lexical Knowledge for NLP23 Words and Works EuroWordNet n Multilingual WordNet n Partners –English, Spanish, Dutch, Italian –(and French, German, Txec, Estonian) n noun synsets and verbal synsets n 30 months n Thesis –[Farreres...] (Mapping of Bilingual dictionaries) –[Daudé...] (Mapping of hierarchies)

Acquisition of Lexical Knowledge for NLP24 Acquisition of Lexical Knowledge for NLP Outline n Setting n Words and Works n Structured Sources –MRDs, thesauri n Unstructured Sources –corpora

Acquisition of Lexical Knowledge for NLP25 Structured Sources Acquisition of LK from MRDs n Focusing on: –the massive acquisition of LK –from MRDs (conventional, in any language) –using automatic methodologies n Why MRDs? The conventional dictionaries for human use usually “contain spelling, pronunciation, hyphenation, capitalization, usage notes for semantic domains, geographic regions, and propiety; ethimological, syntactic and semantic information about the most basic units of the language” (Amsler 81)

Acquisition of Lexical Knowledge for NLP26 Structured Sources Dictionaries n LDOCE (Longman Dictionary of Contemporary English) n DGILE (Diccionario General Ilustrado de la Lengua Española) n DGLC (Diccionari General de la Llengua Catalana) n DVHE (Diccionari Vox-Harrap’s Esencial)

Acquisition of Lexical Knowledge for NLP27 Structured Sources Dictionaries: LDOCE n Higly coded, restricted vocabulary n senses in entries –LDOCE id, POS, Grammatical Code, Idiom, Pragmatic Code, –Semantic Code (subject-preference), object-reference, –indirect-object-preference, definition. |cheese_0_1| <> <> <> |cheese_0_1| <> <> <> |cheese_0_2| <> <> <> |cheese_0_2| <> <> <> |cheese_0_3| <> <> <> |cheese_0_3| <> <> <>

Acquisition of Lexical Knowledge for NLP28 Structured Sources Dictionaries: DGILE n Poorly coded, no restricted vocabulary n senses in entries n 1.4 million words in definitions and examples ((queso ) (ETIM l. caseu ) (Sense 1) (CA m.) (DEF Masa que se obtiene cuajando la leche, exprimiéndola para que deje suero y echándole sal para que se conserve: ~ de Gruyre; ~ de Roquefort; ~ de bola, el de tipo holandés, de forma esférica; ~ de hierba, el que se hace cuajando la leche con hierba a propósito; ~ manchego, el de pasta compacta, algo dura, crudo, de leche de oveja.) (Sense 2) (CA m.) (DEF ~ de cerdo, manjar hecho con carne de cerdo o jabalí, picada y prensada.) (Sense 3)(CA m.)(DEF ~ helado, helado compacto hecho en molde.) (Sense 4)(CA m.)(DEF Medio ~, tablero grueso, semicircular, que usan los sastres para planchar cuellos y solapas y para sentar costuras curvas.) (Sense 5)(CA m.)(REG fam.)(DEF Pie.) (Sense 6)(CA m.)(GEO Venez.)(DEF ~ frito, estafa.) (RELA 1)(TIPOR Rel.)(TXR Del l. caseu derivan numerosos tecn. como caseína, cáseo, caseificar, caseico, caseoso.))

Acquisition of Lexical Knowledge for NLP29 Structured Sources Dictionaries: DGLC (Fabra) n Poorly coded, no restricted vocabulary n senses in entries ((formatge)(CC m.) (NS 1 > 1 > 1 > 0 > 0 > 0)(CG m.)(DF Massa alimentosa que s’obté coagulant la llet, esprement-ne el xerigot i consolidant la part presa.) (NS 2 > 1 > 0 > 1 > 0 > 0)(CG m.)(EX Formatge de Ma.) (NS 3 > 1 > 0 > 2 > 0 > 0)(CG m.)(EX Formatge fresc, salat.) (NS 4 > 1 > 0 > 3 > 0 > 0)(CG m.)(EX Ratllar formatge.) (NS 5 > 1 > 0 > 4 > 0 > 0)(CG m.)(EX Un formatge.) (FI f4) )

Acquisition of Lexical Knowledge for NLP30 –Morphological Information n POS (n, v, adj, adv, etc.) n Derivative forms n Composed forms n Derivative Model (verbs) –Sintactic Information n Idioms n Implicit Knowledge barrer_1_1 limpiar (el suelo) con la escoba. freír_1_1cocer (un manjar) en aceite o grasa hirviendo. comprar_1_1adquirir (una cosa) a cambio de cierta cantidad de dinero. cazar_1_1buscar o perseguir (a las aves, fieras, etc.) para cogerlas o matarlas. Structured Sources Acquilex

Acquisition of Lexical Knowledge for NLP31 n Explicit (LDOCE: Pragmatic Codes, Semantic Code, etc.;DGILE: Tema, sinonims, antonims, sentits figutats, etc.) n Implicit jardín_1_1Terreno donde se cultivan plantas y flores ornamentales. jardín_1_1Terreno donde se cultivan plantas y flores ornamentales. florero_1_4 Maceta con flores. ramo_1_3 Conjunto natural o artificial de flores, ramas o hierbas. pétalo_1_1Hoja que forma la corola de la flor. tálamo_1_3Receptáculo de la flor. miel_1_1 Substancia viscosa y muy dulce que elaboran las abejas, en una distensión del esófago, con el jugo de las flores y luego depositan en las celdillas de sus panales. florería_1_1 Floristería; tienda o puesto donde se venden flores. florista_1_1 Persona que tiene por oficio hacer o vender flores. camelia_1_1 Arbusto cameliáceo de jardín, originario de Oriente, de hojas perennes y lustrosas, y flores grandes, blancas, rojas o rosadas (Camellia japonica). camelia_1_2 Flor de este arbusto. camelia_1_2 Flor de este arbusto. rosa_1_1 Flor del rosal. rosa_1_1 Flor del rosal. Structured Sources Semantic Information

Acquisition of Lexical Knowledge for NLP32 Structured Sources Main Problems –Conventional dictionaries are not systematic –Dictionaries are built for human use –Implicit Knowledge n words are described/translated in terms of words

Acquisition of Lexical Knowledge for NLP33 Structured Sources SEISD (Rigau 98) n The System –General Frame –Methodology –SEISD –Application of the methodology

Acquisition of Lexical Knowledge for NLP34 SEISD General Frame –Characteristics of the Lexical Resources used –Lexical Knowledge to be extracted –Lexical Knowledge Representation –The acquisition process

Acquisition of Lexical Knowledge for NLP35 SEISD General Frame –Characteristics of the Lexical Resources used n DGILE n Spanish/English bilingual Dictionaries n WordNet n Type System of the LKB

Acquisition of Lexical Knowledge for NLP36 –Characteristics of the Lexical Resources used n DGILE –89,043 entries and 157,842 senses –1.4 million words in definitions and examples –neither semantic nor pragmatic codes –no restricted vocabulary vino (l. vinu) m. Zumo de uvas fermentado;... 2 fig. Bautizar o cristianizar, el ~, echarle agua. 3 fig. Dormir uno el ~, dormir mientras dura la borrachera; tener uno mal ~, ser pendenciero en la embriaguez. 4 p.ext. Zumo. | HOMOF.: vino (v.), bino (v.). REL. Enológico, enólogo, enotecnia, derivados de enología, ciencia de la vinicultura, formada del gr. oinos. SEISD General Frame

Acquisition of Lexical Knowledge for NLP37 –Characteristics of the Lexical Resources used n Spanish/English bilingual Dictionaries –EEI: 16,463 entries with 28,002 translation fields –EIE: 15,352 entries with 27,033 translation fields vino m wine. ~ de Jerez, sherry; ~ tinto, red wine. wine n vino SEISD General Frame

Acquisition of Lexical Knowledge for NLP38 SEISD General Frame –Characteristics of the Lexical Resources used n WordNet –v1.6 has 123,497 content words and 99,642 synsets Sense 1 wine, vino -- (fermented juice (of grapes especially)) => sake, saki -- (Japanese beverage from fermented rice...) => sake, saki -- (Japanese beverage from fermented rice...) => vintage -- (a season's yield of wine from a vineyard) => vintage -- (a season's yield of wine from a vineyard) => red wine -- (wine having a red color derived from skins...) => red wine -- (wine having a red color derived from skins...) => Pinot noir -- (dry red California table wine...) => Pinot noir -- (dry red California table wine...) => claret, red Bordeaux -- (dry red Bordeaux or Bordeaux-like wine) => claret, red Bordeaux -- (dry red Bordeaux or Bordeaux-like wine) => Saint Emilion -- (full-bodied red wine from...) => Saint Emilion -- (full-bodied red wine from...) => Chianti -- (dry red Italian table wine from the Chianti...) => Chianti -- (dry red Italian table wine from the Chianti...) => Cabernet, Cabernet Sauvignon -- (superior Bordeaux-type red wine) => Cabernet, Cabernet Sauvignon -- (superior Bordeaux-type red wine) => Rioja -- (dry red table wine from the Rioja...) => Rioja -- (dry red table wine from the Rioja...) => zinfandel -- (dry fruity red wine from California) => zinfandel -- (dry fruity red wine from California)

Acquisition of Lexical Knowledge for NLP39 SEISD General Frame –Characteristics of the Lexical Resources used n Type System of the LKB (Copestake 92) –527 types with 196 features

Acquisition of Lexical Knowledge for NLP40 SEISD General Frame –Lexical Knowledge to be extracted n Explicit information (POS, TD, uses, etc.) n Implicit information –Hypernym/hyponym relations (class/subclass) –Synonymy/Antonymy relations –Meronym/Holonym relation (part/whole,...) –Case role relations (agentive, telic,...) –Content relations (qualia, form, constitutive,...) –Collocational relations (compounds, idioms,...) –Selectional restrictions (typical subject, object,...) –Translation Equivalences

Acquisition of Lexical Knowledge for NLP41 SEISD General Frame –Lexical Knowledge Representation n LKB (Copestake 92) –represent both syntactic and semantic information –Type Feature Structure formalism (Carpenter 92) –default inheritance –lexical and phrasal rules –multilingual relations

Acquisition of Lexical Knowledge for NLP42 Structured Sources SEISD n The System –General Frame –Methodology –SEISD –Application of the methodology

Acquisition of Lexical Knowledge for NLP43 SEISD Methodology MRD1 MRDn LDB1Tax1 LDBn Taxn MLKB... LKB1 LKBn...

Acquisition of Lexical Knowledge for NLP44 SEISD Methodology –Problems following a pure descriptive approach n Circularity n Errors and inconsistencies n Definitions with omitted genus n Top dictionary senses do not usually represent useful knowledge for the LKB –Too general –Too specific

Acquisition of Lexical Knowledge for NLP45 SEISD Methodology Mixed Methodology Prescriptive approach Manual construction of the Type System Manual construction of the Type System

Acquisition of Lexical Knowledge for NLP46 SEISD Methodology Mixed Methodology Descriptive approach Acquiring implicit information from MRDs Acquiring implicit information from MRDs Prescriptive approach Manual construction of the Type System Manual construction of the Type System

Acquisition of Lexical Knowledge for NLP47 SEISD Methodology Mixed Methodology Descriptive approach Acquiring implicit information from MRDs Acquiring implicit information from MRDs Prescriptive approach Manual construction of the Type System Manual construction of the Type System

Acquisition of Lexical Knowledge for NLP48 SEISD Methodology –Step 1: Selection of the main top beginners for a semantic primitive –Step 2: Exploiting genus, construction of taxonomies –Step 3: Exploiting differentia –Step 4: Mapping the LK into the LKB –Step 5: Tlinks Generation –Step 6: Validation and exploitation of the LKB

Acquisition of Lexical Knowledge for NLP49 Structured Sources SEISD n The System –General Frame –Methodology –SEISD –Application of the methodology

Acquisition of Lexical Knowledge for NLP50 Structured Sources SEISD –SEISD: Sistema d’Extracció d’Informació Semàntica de Diccionaris (Ageno et al. 92) n designed to support the main methodology n taking into account the characteristics of the Lexical resources used n reusability of software and lexical resources n allowing modular improvements n minimal effort

Acquisition of Lexical Knowledge for NLP51SEISD User LKB System PRE SemBuild TaxBuild CRS TGE LDB/LKB system LDB System Linguistic Knowledge SegWord and FPar MACO+, Relax and SinPar Lexical Knowledge User LDB DGILE English/Spanish Spanish/English MTDs WordNet Taxonomies LKB Type System Lexicons LDB/LKB Lexicons

Acquisition of Lexical Knowledge for NLP52 Structured Sources SEISD (Rigau 98) n The System –General Frame –Methodology –SEISD –Application of the methodology

Acquisition of Lexical Knowledge for NLP53 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive (Rigau et al. 98) Word sense: zumo_1_1 Attached-to:c_art_subst type. Definition:líquido que se extrae de las flores, hierbas, frutos, etc. (liquid extracted from flowers, herbs, fruits, etc). fruits, etc).

Acquisition of Lexical Knowledge for NLP54 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive –A) Attaching DGILE senses to semantic primitives n 1) First labelling: –Conceptual Distance (Rigau 94) n 2) Second labelling: –Salient Words (Yarowsky 92) –B) Filtering Process

Acquisition of Lexical Knowledge for NLP55 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive –A.1) First labelling: n Conceptual Distance (Agirre et al. 94) –length of the shortest path –specificity of the concepts n using WordNet n Bilingual dictionary

Acquisition of Lexical Knowledge for NLP56 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive <entity> abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess) <monastery><abbey> <convent><abbey>

Acquisition of Lexical Knowledge for NLP57 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive <entity> 06 ARTIFACT 06 ARTIFACT abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess) <monastery><abbey> <convent><abbey>

Acquisition of Lexical Knowledge for NLP58 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive –A.1) First labelling (Results) n 29,205 labelled definitions (31%) n 61% accuracy at a sense level n 64% accuracy at a file level

Acquisition of Lexical Knowledge for NLP59 –A.2) Second labelling: n Salient Words (Yarowsky 92) SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive n Importance –local frequency –appears more significantly more often in the corpus of a semantic category than at other points in the whole corpus

Acquisition of Lexical Knowledge for NLP60 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive –A.2) Second labelling (Results): n 86,759 labelled definitions (93%) n 80% accuracy at a file level biberón_1_1 ARTIFACT Frasco de cristal... (glass flask...) biberón_1_2 FOOD Leche que contiene este frasco... (milk contained in that flask...)

Acquisition of Lexical Knowledge for NLP61 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive –B) Filtering process (FOODs) n removes all genus terms –FILTER 1: not FOODs by the bilingual mapping –FILTER 2: appear more often as genus in other SC –FILTER 3: with a low frequency

Acquisition of Lexical Knowledge for NLP62 SEISD: Application of the methodology Step 1: Selection of the main top beginners for a semantic primitive –B) Filtering process (FOOD Results)

Acquisition of Lexical Knowledge for NLP63 Word sense:vino_1_1 Hypernym:zumo_1_1. Definition:zumo de uvas fermentado. (fermented juice of grapes). Word sense: rueda_2_1 Hypernym:vino_1_1. Definition:vino procedente de la región de Rueda (Valladolid). (wine from the region of Rueda). SEISD: Application of the methodology Step 2 (TaxBuild): Exploiting Genus (Rigau et al. 97)

Acquisition of Lexical Knowledge for NLP64 SEISD: Application of the methodology Step 2 (TaxBuild): Exploiting Genus –Genus Sense Identification n 97% accuracy for nouns –Genus Sense Disambiguation n Unsupervised WSD n Unrestricted WSD (coverage 100%) n Eight Heuristics (McRoy 92) –Combining several lexical resources –Combining several methods

Acquisition of Lexical Knowledge for NLP65 SEISD: Application of the methodology Step 2 (TaxBuild): Exploiting Genus –Results:

Acquisition of Lexical Knowledge for NLP66 SEISD: Application of the methodology Step 2 (TaxBuild): Exploiting Genus –Knowledge provided by each heuristic:

Acquisition of Lexical Knowledge for NLP67 SEISD: Application of the methodology Step 2 (TaxBuild): Exploiting Genus –F2+F3>9: 35,099 definitions –F2+F3>4: 40,754 definitions –No filters: 111,624 definitions

Acquisition of Lexical Knowledge for NLP68 SEISD: Application of the methodology Step 2 (TaxBuild): Exploiting Genus... zumo_1_1 vino_1_1 quianti_1_1 zumo_1_1 vino_1_1 raya_1_8 zumo_1_1 vino_1_1 requena_1_1 zumo_1_1 vino_1_1 reserva_1_12 zumo_1_1 vino_1_1 ribeiro_1_1 zumo_1_1 vino_1_1 rioja_1_1 zumo_1_1 vino_1_1 roete_1_1 zumo_1_1 vino_1_1 rosado_1_3 zumo_1_1 vino_1_1 rueda_2_1 zumo_1_1 vino_1_1 sherry_1_1 zumo_1_1 vino_1_1 tarragona_1_1 zumo_1_1 vino_1_1 tintilla_1_1 zumo_1_1 vino_1_1 tintorro_1_1 zumo_1_1 vino_1_1 toro_3_1...

Acquisition of Lexical Knowledge for NLP69 Word Sense:rueda_2_1 Definition:vino procedente de la región de Rueda SinPar:sn:[n:vino] origin:[n:región, sp:[r0d:de, sp:[r0d:de, sn:[n:Rueda]]]. sn:[n:Rueda]]]. SEISD: Application of the methodology Step 3 (SemBuild): Exploiting Differentia

Acquisition of Lexical Knowledge for NLP70 SEISD: Application of the methodology Step 3 (SemBuild): Exploiting Differentia –MACO+, Relax (Padró 97), SinPar

Acquisition of Lexical Knowledge for NLP71 rueda x_1_1 = (“VOX”) = (“VOX”) = (“rueda”) = (“rueda”) = (“2”) = (“2”) = (“1”) = (“1”) = = <(“Rueda”). SEISD: Application of the methodology Step 4 (CRS): Placing the LK into the LKB

Acquisition of Lexical Knowledge for NLP72 rueda_x_2_1 linked to wine_l_1_1 (parent) rueda_x_2_1 linked to drink_l_2_1 (grandparent) rueda_x_2_1 linked to (parent) rueda_x_2_1 linked to (grandparent) SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation (Ageno et al. 93)

Acquisition of Lexical Knowledge for NLP73 SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation –Simple tlink: –Partial tlink –Rioja_x_1_1 linked to wine_l_1_1 –Phrasal tlink –ahumado_x_1_1 linked to smoked_food_l_1_1

Acquisition of Lexical Knowledge for NLP74 n First experiment –(semi)automatic approach using PRE –linking DGILE to LDOCE –drink taxonomy (235 definitions) SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation

Acquisition of Lexical Knowledge for NLP75 n First experiment (results) SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation

Acquisition of Lexical Knowledge for NLP76 n Second experiment –automatic approach using PRE n Conceptual Distance –linking DGILE to WordNet –food taxonomy (140 definitions) SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation

Acquisition of Lexical Knowledge for NLP77 SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation zumo_1_1 vino_1_1 rueda_2_1 <juice> <foodstuff> <object><entity>

Acquisition of Lexical Knowledge for NLP78 SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation zumo_1_1 vino_1_1 rueda_2_1 <juice> <foodstuff> <object><entity> simple-tlink simple-tlink partial-tlink

Acquisition of Lexical Knowledge for NLP79 n Second experiment (results) SEISD: Application of the methodology Step 5 (TGE): Tlinks Generation

Acquisition of Lexical Knowledge for NLP80 n Ten class methods –Four monosemic criteria –Four polysemic criteria –two hybrid criteria n Three conceptual distance methods –CD1: using pairwise word coocurrences –CD2: using headword and genus –CD3: using bilingual Spanish entries with multiple translations SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet (Atserias et al. 97)

Acquisition of Lexical Knowledge for NLP81 n Ten class methods –Four monosemic criteria SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet SWEW SWEWEW SWEWEWSW SWEWSW

Acquisition of Lexical Knowledge for NLP82 n Ten class methods –Four monosemic criteria SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet SWEWSWEWEW Synset Synset Synset SynsetSWEWEWSWSynsetSynset SWEW SW

Acquisition of Lexical Knowledge for NLP83 n Ten class methods –Four polysemic criteria SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet SWEWSWEWEW SWEW EWSW Synset+ Synset+ Synset+ Synset+ Synset+ Synset+SWEWSW

Acquisition of Lexical Knowledge for NLP84 n Ten class methods –Variant criterion –Field criterion SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet SW SW

Acquisition of Lexical Knowledge for NLP85 n Ten class methods (results) SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet

Acquisition of Lexical Knowledge for NLP86 n Three CD methods (results) SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet

Acquisition of Lexical Knowledge for NLP87 n Combining methods (results) SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet

Acquisition of Lexical Knowledge for NLP88 n Resulting Spanish WordNets SEISD: Application of the methodology Step 5: Mapping bilingual entries to WordNet

Acquisition of Lexical Knowledge for NLP89 Acquisition of Lexical Knowledge for NLP LK from MRDs: Taxonomies... zumo_1_1 vino_1_1 quianti_1_1 zumo_1_1 vino_1_1 raya_1_8 zumo_1_1 vino_1_1 requena_1_1 zumo_1_1 vino_1_1 reserva_1_12 zumo_1_1 vino_1_1 ribeiro_1_1 zumo_1_1 vino_1_1 rioja_1_1 zumo_1_1 vino_1_1 roete_1_1 zumo_1_1 vino_1_1 rosado_1_3 zumo_1_1 vino_1_1 rueda_2_1 zumo_1_1 vino_1_1 sherry_1_1 zumo_1_1 vino_1_1 tarragona_1_1 zumo_1_1 vino_1_1 tintilla_1_1 zumo_1_1 vino_1_1 tintorro_1_1 zumo_1_1 vino_1_1 toro_3_1...

Acquisition of Lexical Knowledge for NLP90 Acquisition of Lexical Knowledge for NLP LK from MRDs: MTDs conexions elaborado queso pasta queso leche queso oveja queso queso sabor queso tortilla queso vaca maíz queso queso suero mantequilla queso compacta queso picante queso manchego queso cabra queso pan queso

Acquisition of Lexical Knowledge for NLP91

Acquisition of Lexical Knowledge for NLP92 Acquisition of Lexical Knowledge for NLP LK from MRDs: EuroWordNet SPANISH synsetswordsvariants adjs12,4618,71416,713 nouns43,57347,81362,319 verbs8,2986,01013,230 Sum64,33262,53792,262 CATALAN adjs1,3861,5072,030 nouns30,70132,98842,320 verbs4,4874,28910,297 Sum36,57438,78454,647