Building Wordnets Piek Vossen, Irion Technologies.

Slides:



Advertisements
Similar presentations
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
Advertisements

KR-2002 Panel/Debate Are Upper-Level Ontologies worth the effort? Chris Welty, IBM Research.
Building a Large- Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae.
Cognitive Linguistics Croft & Cruse 6 A dynamic construal approach to sense relations I: hyponymy and meronymy.
Introduction to Computational Linguisitics The Lexicon.
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
Complete and Consistent Annotation of WordNet with the Top Concept Ontology Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Egoitza Laparra,
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
Consistency of Assessment
Klaus M. Frei1 WordNet „An On-line Lexical Database“ (Miller, G. A.; Beckwith, R.; Fellbaum, Chr.; Gross, D.; Miller, K. 1993, title). Based on psycho-linguistic.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Introduction to Lexical Semantics Vasileios Hatzivassiloglou University of Texas at Dallas.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Using resources WordNet and the BNC. WordNet: History 1985: a group of psychologists and linguists start to develop a “lexical database” –Princeton University.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Foundations This chapter lays down the fundamental ideas and choices on which our approach is based. First, it identifies the needs of architects in the.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
RESEARCH DESIGN.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Session 8 Lexical Semantic
Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.
Adam Pease and Christiane Fellbaum Presenter: 吳怡安
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Of 39 lecture 2: ontology - basics. of 39 ontology a branch of metaphysics relating to the nature and relations of being a particular theory about the.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
WordNet: Connecting words and concepts Peng.Huang.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Integrating lexical units, synsets and ontology in the Cornetto Database Piek Vossen 1, 2, Isa Maks 1, Roxane Segers 1, Hennie van der Vliet 1 1: Faculty.
1 What is OO Design? OO Design is a process of invention, where developers create the abstractions necessary to meet the system’s requirements OO Design.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
FDT Foil no 1 On Methodology from Domain to System Descriptions by Rolv Bræk NTNU Workshop on Philosophy and Applicablitiy of Formal Languages Geneve 15.
Wordnet - A lexical database for the English Language.
Semantic distance & WordNet Serge B. Potemkin Moscow State University Philological faculty.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Ontology Engineering: from Cognitive Science to the Semantic Web Maria Teresa Pazienza University of Roma Tor Vergata, Italy 1.
Element Level Semantic Matching Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan Paper by Fausto.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Annotation Framework & ImageCLEF 2014 JAN BOTOREK, PETRA BUDÍKOVÁ
Knowledge Structure Vijay Meena ( ) Gaurav Meena ( )
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Ontologies COMP6028 Semantic Web Technologies Dr Nicholas Gibbins
Mapping the NCI Thesaurus and the Collaborative Inter-Lingual Index Amanda Hicks University of Florida HealthInsight Workshop, Oslo, Norway.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Introduction to Computational Linguisitics The Lexicon.
ece 627 intelligent web: ontology and beyond
Generating sets of synonyms between languages
Ontology From Wikipedia, the free encyclopedia
Ontology Engineering: from Cognitive Science to the Semantic Web
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
Survey of Knowledge Base Content
CSC 594 Topics in AI – Applied Natural Language Processing
A Research Companion to Principles and Standards
Presentation transcript:

Building Wordnets Piek Vossen, Irion Technologies

Overview Starting points Semantic framework Process overview Methodologies in other projects Multilinguality

Starting points Purpose of the wordnet database: education, science, applications formal ontology or linguistic ontology making inferences or lexical substitution conceptual density or large coverage Distributed development Reproducability Available resources Language-specific features (Cross-language) compatibility Exploit cummunity resources by projecting conceptual relations on a target wordnet

Semantic framework

Differences in wordnet structures voorwerp {object} lepel {spoon} werktuig {tool} tas {bag} bak {box} blok {block} lichaam {body} Wordnet1.5Dutch Wordnet bag spoon box object natural object (an object occurring naturally) artifact, artefact (a man-made object) instrumentality blockbody container device implement tool instrument - Artificial Classes versus Lexicalized Classes: instrumentality; natural object - Lexicalization differences of classes: container and artifact (object) are not lexicalized in Dutch

Linguistic versus conceptual ontologies Conceptual ontology: A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure. Introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), Neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise ). What properties can we infer for spoons? spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking Linguistic ontology: Exactly reflects the relations between all the lexicalized words and expressions in a language. Valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery,

Wordnets as Linguistic Ontologies Classical Substitution Principle: Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and hyponyms: horse stallion, mare, pony, mammal, animal, being. It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms: horseXcat, dog, camel, fish, plant, person, object. Conceptual Distance Measurement: Number of hierarchical nodes between words is a measurement of closeness, where the level and the local density of nodes are additional factors. Main purpose is to predict what words can be used as substitutes in language, considering all the lexicalized words in a language.

Define a semantic framework Definition of relations Diagnostic frames (Cruse 1986) Examples and corpus data Top-level ontology Constraints on relations Type consistency Large scale validation

Process overview

Techniques Manual encoding and verification Automatic extraction: definitions synonyms distribution and similarity patterns in copora defining contexts, e.g. cats and other pets parallel corpora, e.g. bible translations morphological structure bilingual dictionaries Encode source and status of data: who, when, based on what algorithm, validated, final

Encoding cycle 1. Collecting data Vocabulary: what is the list of words of a language? Concepts: what is the list of concepts related to the vocabulary? 2. Encoding data: Defining synsets Defining language internal relations: hyponymy, meronymy roles, causal relations Defining equivalence relations to English Defining other relations,e.g. Ontology types, Domains 3. Validation 4. Go to 1.

Where to start? How to get a first selection: Words (alphabetic, frequency) -> concepts -> relations Concept (hyperonym, domain, semantic feature) -> words - > concepts -> relations How to get a complete overview of words and expressions that belong to a segment of a wordnet? Up to 20 hyperonyms for instrumentality: instrument, instrumentality, means, tool, device, machine, apparatus,.... iterative process: collect, structure, collect, restructure... using multiple sources of evidence comparing results, e.g. tri-cycle is a toy or a vehicle

Synonymy as a basis? Synsets are the core unit of a wordnet database Synonymy is only vaguely defined: substitution in a context. Synonyms are very hard to detect Other relations (role relations, causal relations): easier to detect and encode easier to validate within a formal framework easier to validate in a corpus Rich set of relations per concept help alignment with other resources

Diagnostic frames and examples Agent Involvement (A/an) X is the one/that who/which does the Y, typically intentionally. Conditions:- X is a noun - Y is a verb in the gerundive form Example: A teacher is the one who does the teaching intentionally Effect: {to teach} (Y) INVOLVED_AGENT {teacher} (X) Patient Involvement (A/an) X is the one/that who/which undergoes the Y Conditions:- X is a noun - Y is a verb in the gerundive form Example: A learner is the one who undergoes the learning Effect: {to learn} (Y) INVOLVED_PATIENT {learner} (X)

Diagnostic frames and examples Result Involvement A/an) X is comes into existence as a result of Y, where X is a noun and Y is a verb in the gerundive form and a hyponym of make, produce, generate. Example: A crystal comes into existence as a result of crystalizing A crystal is the result of crystalizing A crystal is created by crystalizing Effect: {to crystalize} (Y) INVOLVED_RESULT {crystal} (X) Comments: Special kind of patient relation. The entity is not jut changed or affected but it comes into existence as a result of the event: Only applies to concrete entities (1stOrder) or mental objects such as ideas (3rdOrder). Situations that result from other situations are related by the CAUSE relation.

Hyponymy overloading (Guarino 1998, Vossen and Bloksma 1998). The vocabulary does not clearly differentiate between orthogonal roles and disjoint types: role: passenger, teacher, student type: dog; cat ?: knife ->weapon, cutlery; spoon -> container, cutlery food material <- building material <-?- stone; <-?-water; <- brick; Disjunctive and conjunctive hyperonyms: albino -> animal or plant spoon -> cutlery & container

Hyponymy restructuring dierenziekte (animal disease) infectieziekte (infectious disease) ingewandsziekte (bowel disease) ziekte (disease) kolder (staggers: brain disease of cattle) vuilbroed (infectious infectious disease of bees) veeziekte (cattle disease) haringwormziekte (anisakiasis: bowel disease of herrings)

Methodologies in a number of projects Princeton Wordnet EuroWordNet: English, Dutch, German, French, Spanish, Italian, Czech, Estonian 10,000 up to 50,000 synsets BalkaNet: Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian 10,000 synsets

Main strategies for building wordnets Expand approach: translate WordNet synsets to another language and take over the structure easier and more efficient method compatible structure with WordNet vocabulary and structure is close to WordNet but also biased can exploit many resources linked to Wordnet: SUMO, Wordnet domains, selection restriction from BNC, etc... Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations more complex and labor intensive different structure from WordNet language specific patterns can be maintained, i.e. very precise substitution patterns

Aligning wordnets muziekinstrument orgel hammond orgel organ ? hammond organ musical instrument instrument artifact objectnatural object object Dutch wordnet English wordnet orgaan orgel? ?

General criteria for approach: Maximize the overlap with wordnets for other languages Maximize semantic consistency within and across wordnets Maximally focus the manual effort where needed Maximally exploit automatic techniques

Top-down methodology Develop a core wordnet (5,000 synsets): all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school provide a formal and explicit semantics Validate the core wordnet: does it include the most frequent words? are semantic constraints violated? Extend the core wordnet: (5,000 synsets or more): automatic techniques for more specific concepts with high- confidence results add other levels of hyponymy add specific domains add easy derivational words add easy translation equivalence Validate the complete wordnet

Developing a core wordnet Define a set of concepts(so-called Base Concepts) that play an important role in wordnets: high position in the hierarchy & high connectivity represented as English WordNet synsets Common base concepts: shared by various wordnets in different languages Local base concepts: not shared EuroWordNet: 1024 synsets, shared by 2 or more languages BalkaNet: 5000 synsets (including 1024) Common semantic framework for all Base Concepts, in the form of a Top-Ontology Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets) Manually build and verify the hypernym relations for the Base Concepts All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs 1024 CBCs First Level Hyponyms Remaining Hyponyms Hypero nyms CBC Represen- tatives Local BCs WMs related via non-hypo nymy Top-Ontology Inter-Lingual-Index Remaining Hyponyms Hypero nyms CBC Repre- senta. Local BCs WMs related via non-hypo nymy First Level Hyponyms Remaining WordNet1.5 Synsets Top-down methodology

Domain Named Entities Next Level Hyponyms Sumo Ontology WordNet Synsets SBC Hyper nyms ABC EuroWordNet BalkaNet Base Concepts 5000 Synsets English Arabic Lexicon teach - darrasa WordNet Domains Domain chemics WordNet Synsets English Wordnet Arabic Wordnet Arabic word frequency Arabic roots & derivation rules Top-down methodology More Hyponyms Easy Translations Named Entities 1000 Synsets = Core wordnet 5000 synsets CBC WordNet Synsets v {teach} WordNet Synsets v {darrasa}

Advantages of the approach Well-defined semantics that can be inherited down to more specific concepts Apply consistency checks Automatic techniques can use semantic basis Most frequent concepts and words are covered High overlap and compatibility with other wordnets Manual effort is focussed on the most difficult concepts and words

Distribution over the top ontology clusters

Wordnet DomainsConceptsProportion Wordnet DomainsConceptsProportion acoustics %linguistics % administration %literature % aeronautic %mathematics % agriculture %mechanics % alimentation %medicine % anatomy %merchant_navy % anthropology %meteorology % applied_science %metrology % archaeology %military % archery50.004%money % architecture %mountaineering % art %music % artisanship %mythology % astrology %number % astronautics %numismatics % astronomy %occultism % athletics %oceanography %

EWN Interlingual Relations EQ_SYNONYM: there is a direct match between a synset and an ILI-record EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously, HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record. HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records. other relations:CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE

Multilinguality

Complex equivalence relations eq_near_synonym 1. Multiple Targets One sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5: {make clean by removing dirt, filth, or unwanted substances from} {remove unwanted substances from, such as feathers or pits, as of chickens or fruit} (remove in making clean; "Clean the spots off the rug") {remove unwanted substances from - (as in chemistry)} The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean. 2. Multiple Source meanings Synsets inter-linked by a near_synonym relation can be linked to same target ILI- record(s), either with an eq_synonym or an eq_near_synonym relation: Dutch wordnet: toestel near_synonym apparaat ILI-records:{machine}; {device}; {apparatus}; {tool}

Complex equivalence relations has_eq_hyperonym Typically used for gaps in WordNet1.5 or in English: genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both. has_eq_hyponym Used when wordnet1.5 only provides more narrow terms. In this case there can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.

Overview of equivalence relations to the ILI RelationPOSSources: TargetsExample eq_synonymsame1:1auto : voiture car eq_near_synonymanymany : manyapparaat, machine, toestel: apparatus, machine, device eq_hyperonymsamemany : 1 (usually)citroenjenever: gin eq_hyponymsame(usually) 1 : manydedo : toe, finger eq_metonymysamemany/1 : 1universiteit, universiteitsgebouw: university eq_diathesissamemany/1 : 1raken (cause), raken: hit eq_generalizationsamemany/1 : 1schoonmaken : clean

Filling gaps in the ILI Types of GAPS 1. genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, Non-productive Non-compositional 2. pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier) Productive Compositional 3. Universality of gaps: Concepts occurring in at least 2 languages

Productive and Predictable Lexicalizations exhaustively linked to the ILI beat stamp {doodslaan V } NL {cajera N } ES {doodschoppen V } NL {doodstampen V } NL kill kick {tottrampeln V } DE {totschlagen V } DE hypernym cashier female young fish {casière} NL {alevín N } ES in_state hypernym

Domain Named Entities Next Level Hyponyms Sumo Ontology WordNet Synsets 1000 Synsets SBC CBC Hyper nyms ABC EuroWordNet BalkaNet Base Concepts 5000 Synsets English Arabic Lexicon WordNet Domains Domain chemics WordNet Synsets English Wordnet Arabic Wordnet Arabic word frequency Arabic roots & derivation rules Top-down methodology More Hyponyms Easy Translations Named Entities =

dierenziekte (animal disease) infectieziekte (infectious disease) ingewandsziekte (bowel disease) ziekte (disease) kolder (staggers: brain disease of cattle) vuilbroed (infectious infectious disease of bees) veeziekte (cattle disease) haringwormziekte (anisakiasis: bowel disease of herrings)

dierenziekte (animal disease) infectieziekte (infectious disease) ingewandsziekte (bowel disease) ziekte (disease) kolder (staggers: brain disease of cattle) vuilbroed (infectious infectious disease of bees) veeziekte (cattle disease) haringwormziekte (anisakiasis: bowel disease of herrings)

Resources Monolingual dictionaries: definitions synonym relations other relations Bi-lingual dictionaries: L-English, English-L Ontologies Thesauri Corpora: monolingual parallel