Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wordnet, EuroWordNet, Global Wordnet

Similar presentations


Presentation on theme: "Wordnet, EuroWordNet, Global Wordnet"— Presentation transcript:

1 Wordnet, EuroWordNet, Global Wordnet
Piek Vossen

2 Overview Princeton WordNet (1980 - ongoing) EuroWordNet (1996 - 1999)
The database design The general building strategy Towards a universal index of meaning Global WordNet Association ( ongoing) Other wordnets BalkaNet ( ) IndoWordnet ( ongoing) Meaning ( )

3 WordNet1.5 Developed at Princeton by George Miller and his team as a model of the mental lexicon. Semantic network in which concepts are defined in terms of relations to other concepts. Structure: organized around the notion of synsets (sets of synonymous words) basic semantic relations between these synsets Initially no glosses Main revision after tagging the Brown corpus with word meanings: SemCor.

4 Structure of WordNet1.5

5 EuroWordNet The development of a multilingual database with wordnets for several European languages Funded by the European Commission, DG XIII, Luxembourg as projects LE and LE4-8328 March September 1999 2.5 Million EURO. URL:

6 Objectives of EuroWordNet
Languages covered: EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian. Size of vocabulary: EuroWordNet-1: 30,000 concepts - 50,000 word meanings. EuroWordNet-2: 15,000 concepts- 25,000 word meaning. Type of vocabulary: the most frequent words of the languages all concepts needed to relate more specific concepts

7 Consortium

8 The basic principles of EuroWordNet
the structure of the Princeton WordNet the design of the EuroWordNet database wordnets as language-specific structures the language-internal relations the multilingual relations

9 Specific features of EuroWordNet
it contains semantic lexicons for other languages than English. each wordnet reflects the relations as a language-internal system, maintaining cultural and linguistic differences in the wordnets. it contains multilingual relations from each wordnet to English meanings, which makes it possible to compare the wordnets, tracking down inconsistencies and cross-linguistic differences. each wordnet is linked to a language independent top-ontology and to domain labels.

10 Autonomous & Language-Specific
Wordnet1.5 Dutch Wordnet bag spoon box object natural object (an object occurring naturally) artifact, artefact (a man-made object) instrumentality block body container device implement tool instrument voorwerp {object} blok {block} lichaam {body} werktuig{tool} bak {box} lepel {spoon} tas {bag}

11 Differences in structure
Artificial Classes versus Lexicalized Classes: instrumentality; natural object Lexicalization differences of classes: container and artifact (object) are not lexicalized in Dutch What is the purpose of different hierarchies? Should we include all lexicalized classes from all (8) languages?

12 Linguistic versus Conceptual Ontologies
Conceptual ontology: A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure. introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise). What properties can we infer for spoons? spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking

13 Linguistic versus Conceptual Ontologies
Linguistic ontology: Exactly reflects the relations between all the lexicalized words and expressions in a language. It therefore captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery,

14 Separate Wordnets and Ontologies
Language-Specific Wordnets Language-Neutral Ontology ReferenceOntologyClasses: BOX ContainerProduct; SolidTangibleThing object box container Dutch Wordnet voorwerp doos box container WordNet1.5 EuroWordNet Top-Ontology: Form: Cubic Function: Contain Origin: Artifact Composition: Whole

15 Wordnets versus ontologies
autonomous language-specific lexicalization patterns in a relational network. Usage: to predict substitution in text for information retrieval, text generation, machine translation, word-sense-disambiguation. Ontologies: data structure with formally defined concepts. Usage: making semantic inferences.

16 Wordnets as Linguistic Ontologies
Classical Substitution Principle: Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and hyponyms: horse  stallion, mare, pony, mammal, animal, being. It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms: horse X cat, dog, camel, fish, plant, person, object. Conceptual Distance Measurement: Number of hierarchical nodes between words is a measurement of closeness, where the level and the local density of nodes are additional factors.

17 Linguistic Principles for deriving relations
1. Substitution tests (Cruse 1986): 1 a. It is a fiddle therefore it is a violin. b It is a violin therefore it is a fiddle. 2 a. It is a dog therefore it is an animal. b *It is an animal therefore it is a dog. 3 a to kill (/a murder) causes to die (/ death) to kill (/a murder) has to die (/ death) as a consequence b *to die / death causes to kill *to die / death has to kill as a consequence

18 Linguistic Principles for deriving relations
2. Principle of Economy (Dik 1978): If a word W1 (animal) is the hyperonym of W2 (mammal) and W2 is the hyperonym of W3 (dog) then W3 (dog) should not be linked to W1 (animal) but to W2 (mammal). 3. Principle of Compatibility If a word W1 is related to W2 via relation R1, W1 and W2 cannot be related via relation Rn, where Rn is defined as a distinct relation from R1.

19 Architecture of the EuroWordNet Data Base Inter-Lingual-Index Domains
Traffic Air Road` Ontology 2OrderEntity Location Dynamic Lexical Items Table drive ride move go III bewegen gaan III rijden berijden I I III III II II Lexical Items Table ILI-record {drive} Lexical Items Table Lexical Items Table cabalgar jinetear III conducir mover transitar III III II II cavalcare guidare Inter-Lingual-Index III I = Language Independent link II = Link from Language Specific to Inter lingual Index III = Language Dependent Link andare muoversi

20 The mono-lingual design of EuroWordNet

21 Language Internal Relations
WN 1.5 starting point The ‘synset’ as a weak notion of synonymy: “two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value.” (Miller et al. 1993) Relations between synsets: Relation POS-combination Example ANTONYMY adjective-to-adjective verb-to-verb open/ close HYPONYMY noun-to-noun car/ vehicle verb-to-verb walk/ move MERONYMY noun-to-noun head/ nose ENTAILMENT verb-to-verb buy/ pay CAUSE verb-to-verb kill/ die

22 Differences EuroWordNet/WordNet1.5
Added Features to relations Cross-Part-Of-Speech relations New relations to differentiate shallow hierarchies New interpretations of relations

23 EWN Relationship Labels
Disjunction/Conjunction of multiple relations of the same type WordNet1.5 door1 -- (a swinging or sliding barrier that will close the entrance to a room or building; "he knocked on the door"; "he slammed the door as he left") PART OF: doorway, door, entree, entry, portal, room access door 6 -- (a swinging or sliding barrier that will close off access into a car; "she forgot to lock the doors of her car") PART OF: car, auto, automobile, machine, motorcar.

24 EWN Relationship Labels
{airplane} HAS_MERO_PART: conj1 {door} HAS_MERO_PART: conj2 disj1 {jet engine} HAS_MERO_PART: conj2 disj2 {propeller} {door} HAS_HOLO_PART: disj1 {car} HAS_HOLO_PART: disj2 {room} HAS_HOLO_PART: disj3 {entrance} {dog} HAS_HYPERONYM: conj1 {mammal} HAS_HYPERONYM: conj2 {pet} {albino} HAS_HYPERONYM: disj1 {plant} HAS_HYPERONYM: dis2 {animal} Default Interpretation: non-exclusive disjunction

25 EWN Relationship Labels
Disjunction/Conjunction of multiple relations of the same type { {dog} HAS_HYPONYM: dis1 {poodle} HAS_HYPONYM: dis1 {labrador} HAS_HYPONYM: {sheep dog} (Orthogonal) HAS_HYPONYM: {watch dog} (Orthogonal) Default Interpretation: non-exclusive disjunction

26 EWN Relationship Labels
Factive/Non-factive CAUSES (Lyons 1977) factive (default interpretation): “to kill causes to die”: {kill} CAUSES {die} non-factive: E1 probably or likely causes event E2 or E1 is intended to cause some event E2: “to search may cause to find”. {search} CAUSES {find} non-factive

27 EWN Relationship Labels
Reversed In the database every relation must have a reverse counter-part but there is a difference between relations which are explicitly coded as reverse and automatically reversed relations: {finger} HAS_HOLONYM {hand} {hand} HAS_MERONYM {finger} {paper-clip} HAS_MER_MADE_OF {metal} {metal} HAS_HOL_MADE_OF {paper-clip} reversed Negation {monkey} HAS_MERO_PART {tail} {ape} HAS_MERO_PART {tail} not

28 Cross-Part-Of-Speech relations
WordNet1.5: nouns and verbs are not interrelated by basic semantic relations such as hyponymy and synonymy: adornment 2 change of state-- (the act of changing something) adorn 1 change, alter-- (cause to change; make different) EuroWordNet: words of different parts of speech can be inter-linked with explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations: {adorn V} XPOS_NEAR_SYNONYM {adornment N}

29 Cross-Part-Of-Speech relations
The advantages of such explicit cross-part-of-speech relations are: similar words with different parts of speech are grouped together. the same information can be coded in an NP or in a sentence. By unifying higher-order nouns and verbs in the same ontology it will be possible to match expressions with very different syntactic structures but comparable content by merging verbs and abstract nouns we can more easily link mismatches across languages that involve a part-of-speech shift. Dutch nouns such as “afsluiting”, “gehuil” are translated with the English verbs “close” and “cry”, respectively.

30 Entailment in WordNet WordNet1.5: Entailment indicates the direction of the implication or entailment: a. + Temporal Inclusion (the two situations partially or totally overlap) a.1 co-extensiveness (e. g., to limp/to walk) hyponymy/troponymy a.2 proper inclusion (e.g., to snore/to sleep) entailment b. - Temporal Exclusion (the two situations are temporally disjoint) b.1 backward presupposition (e.g., to succeed/to try) entailment b.2 cause (e.g., to give/to have)

31 Subevents in EuroWordNet
Direction of the entailment is expressed by the labels factive and reversed: {to succeed} is_caused_by {to try} factive {to try} causes {to succeed} non-factive Proper inclusion is described by the has_subevent/ is_subevent_of relation in combination with the label reversed: {to snore} is_subevent_of {to sleep} {to sleep} has_subevent {to snore} reversed {to buy} has_subevent {to pay} {to pay} is_subevent_of {to buy} reversed

32 The interpretation of the CAUSE relation
WordNet1.5: The causal relation only holds between verbs and it should only apply to temporally disjoint situations: EuroWordNet: the causal relation will also be applied across different parts of speech: {to kill} V causes {death} N {death} n is_caused_by {to kill} v reversed {to kill } v causes {dead} a {dead} a is_caused_by {to kill} v reversed {murder} n causes {death}n {death} a is_caused_by {murder} n reversed

33 The interpretation of the CAUSE relation
Various temporal relationships between the (dynamic/non-dynamic) situations may hold: Temporally disjoint: there is no time point when dS1 takes place and also S2 (which is caused by dS1) (e.g. to shoot/to hit); Temporally overlapping: there is at least one time point when both dS1 and S2 take place, and there is at least one time point when dS1 takes place and S2 (which is caused by dS1) does not yet take place (e.g. to teach/to learn); Temporally co-extensive: whenever dS1 takes place also S2 (which is caused by dS1) takes place and there is no time point when dS1 takes place and S2 does not take place, and vice versa (e.g. to feed/to eat).

34 Role relations In the case of many verbs and nouns the most salient relation is not the hyperonym but the relation between the event and the involved participants. These relations are expressed as follows: {hammer} ROLE_INSTRUMENT {to hammer} {to hammer} INVOLVED_INSTRUMENT {hammer} reversed {school} ROLE_LOCATION {to teach} {to teach} INVOLVED_LOCATION {school} reversed These relations are typically used when other relations, mainly hyponymy, do not clarify the position of the concept network, but the word is still closely related to another word.

35 Co_Role relations guitar player HAS_HYPERONYM player
CO_AGENT_INSTRUMENT guitar player HAS_HYPERONYM person ROLE_AGENT to play music CO_AGENT_INSTRUMENT musical instrument to play music HAS_HYPERONYM to make ROLE_INSTRUMENT musical instrument guitar HAS_HYPERONYM musical instrument CO_INSTRUMENT_AGENT guitar player ice saw HAS_HYPERONYM saw CO_INSTRUMENT_PATIENT ice saw HAS_HYPERONYM saw ROLE_INSTRUMENT to saw ice CO_PATIENT_INSTRUMENT ice saw REVERSED

36 Co_Role relations Examples of the other relations are:
criminal CO_AGENT_PATIENT victim novel writer/ poet CO_AGENT_RESULT novel/ poem dough CO_PATIENT_RESULT pastry/ bread photograpic camera CO_INSTRUMENT_RESULT photo

37 BE_IN_STATE and STATE_OF
Example: the poor are the ones to whom the state poor applies Effect: poor N HAS_HYPERONYM person N poor N BE_IN_STATE poor A poor A STATE_OF poor N reversed IN_MANNER and MANNER_OF Example: to slurp is to eat in a noisely manner Effect: slurp V HAS_HYPERONYM eat V slurp V IN_MANNER noisely Adverb noisely Adverb MANNER_OF slurp V reversed

38 Overview of the Language Internal relations in EuroWordnet
Same Part of Speech relations: NEAR_SYNONYMY apparatus - machine HYPERONYMY/HYPONYMY car - vehicle ANTONYMY open - close HOLONYMY/MERONYMY head - nose Cross-Part-of-Speech relations: XPOS_NEAR_SYNONYMY dead - death; to adorn - adornment XPOS_HYPERONYMY/HYPONYMY to love - emotion XPOS_ANTONYMY to live - dead CAUSE die - death SUBEVENT buy - pay; sleep - snore ROLE/INVOLVED write - pencil; hammer - hammer STATE the poor - poor MANNER to slurp - noisily BELONG_TO_CLASS Rome - city

39 Thematic networks organisme (organism) Causes genezen (to get well)
Patient wezen(being) Part of ziekte (disease) Patient orgaan (organ) persoon (person) behandelen(treat) Agent scalpel Patient arts (doctor) Instrument opereren (operate) zieke (sick person, patient) maagaandoening (stomach disease) maag (stomach) Involves

40 The multi-lingual design of EuroWordNet

41 The Multilingual Design
Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages; Index-records are mainly based on WordNet1.5 synsets and consist of synonyms, glosses and source references; Various types of complex equivalence relations are distinguished; Equivalence relations from synsets to index records: not on a word-to-word basis; Indirect matching of synsets linked to the same index items;

42 EWN Interlingual Relations
EQ_SYNONYM: there is a direct match between a synset and an ILI- record EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously, HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record. HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI- records. other relations: CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE

43 Equivalent Near Synonym
1. Multiple Targets One sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5: {make clean by removing dirt, filth, or unwanted substances from} {remove unwanted substances from, such as feathers or pits, as of chickens or fruit} (remove in making clean; "Clean the spots off the rug") {remove unwanted substances from - (as in chemistry)} The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean.

44 Equivalent Near Synonym
2. Multiple Source meanings Synsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation: Dutch wordnet: toestel near_synonym apparaat ILI-records: {machine}; {device}; {apparatus}; {tool}

45 Equivalent Hyponymy has_eq_hyperonym has_eq_hyponym
Typically used for gaps in WordNet1.5 or in English: genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both. has_eq_hyponym Used when wordnet1.5 only provides more narrow terms. In this case there can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.

46 Complex mappings across languages
GB-Net IT-Net toe dito { toe : part of foot } finger { finger head : part of hand } { dedo , dito : finger or toe } { head : part of body } NL-Net { hoofd ES-Net : human head } { kop : animal head } hoofd dedo kop = normal equivalence = eq _has_hyponym _has_hyperonym

47 The methodologies for building wordnets

48 Overall Building Process
Machine Readable Dictionaries Wordnets, Taxonomies, Corpora Loaded in local databases Ia Ib Specification of selection criteria Improve and extend the wordnet fragments Subset of word meanings Encoding of language internal and equivalence relations Ia Wordnet fragment with links to WordNet1.5 in local database II Adjust coverage improve encoding Load wordnet in the EuroWordNet Database Ic Verification by users Wordnet fragment in EuroWordNet database Demonstration in Information Retrieval Comparing and restructuring the wordnet Verification Report III

49 Main Methods Expand approach: translate WordNet1.5 synsets to another language and take over the structure easier and more efficient method compatible structure with WordNet1.5 structure is close to WordNet1.5 but also biased by it Merge approach: create an independent wordnet in another language and align the separate hierarchies by generating the appropriate translations more complex and labour intensive different structure from WordNet1.5 lanuage specific patterns can be maintained

50 Methods for extracting language-internal relations
editors and database for manually encoding relations; comparison with WordNet1.5 structure; definition patterns in monolingual dictionaries; co-occurrences in corpora; morphology; bilingual dictionaries; lexical semantic substitution tests

51 Methods for extracting equivalence relations
extract monosemeous translations of English synsets, e.g. a Spanish word has only 1 translation to an English word which has only one sense and vice versa; disambiguation of multiple ambivalent translations by measuring their conceptual-distance between the senses of these translations in the WordNet1.5 hierarchy (Rigau and Aguirre, 95); disambiguation of ambivalent translations by measuring the conceptual-distance directly in the WordNet1.5 hierarchy between alternative translations and the translations of the direct semantic context in the source wordnet; disambiguation of ambivalent translations by measuring the overlap in top-concepts inherited in the source wordnet and inherited for the different senses of translations in WordNet1.5;

52 Aligning wordnets muziekinstrument orgel hammond orgel organ ?
hammond organ musical instrument instrument artifact object natural object object

53 Inheriting Semantic Features
hart 1 orgaan 1 (Living Part) deel 2 (Part) iets 1 LEAF heart 1 playing card 1 card 1 (Artifact Function Object) paper 6 (Artifact Solid) material 5 (Substance) matter 1 inanimate object 1 entity 1 LEAF heart 2 disposition 2 (Dynamic Experience Mental)nature 1 trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF heart 3 bravery 1 spirit 1 character 1 trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF heart 4 internal organ 1 organ 4 (Living Part) body part 1 (Living Part) part 10 entity 1 LEAF

54 Reliability of Equivalence Relations

55 Reliability of Equivalence Relations

56 Conflicting Starting points
1. There should be a maximum of flexibility: the wordnets should be able to reflect language-specific relations and patterns the wordnets should be built relatively independently because each sites has different starting points: different tools, database and resources (Machine Readable Dictionaries) differences in the languages 2. The wordnets have to be compatible in terms of coverage and relations to be useful for multilingual information retrieval and translations tools and to be able to compare the wordnets.

57 Measures to achieve maximal compatibility
The results are loaded into a common Multilingual Database (Polaris): consistency checks and types of incompatibility specific comparison options to measure consistency and overlap in coverage User-guides for building wordnets in each language: the steps to encode the relations for a word meaning. common tests and criteria for all the relations. overview of problems and solutions. A set of common Base-Concepts which are shared by all the sites, having: most relations and the most-important positions in the wordnets most meanings and badly defined Classification of the common Base Concept in terms of a Top-Ontology of 63 basic Semantic Distinctions Top-Down Approach, where first the Base Concepts and their direct context are (manually) encoded and next the wordnets are (semi-automatically) extended top-down to include more specific concepts that depend on these Base Concept.

58 Top-Ontology and Base Concepts
Top-Ontology with 63 higher-level concepts Existing Ontologies: WordNet1.5 top-levels Aktions-Art models (Vendler, Verkuyl) Acquilex and Sift ontologies (EC-projects) Qualia-structure (Pustejovsky) Upper-Model, MikroKosmos, Cyc, Ad Hoc ANSI-Committee on ontologies The ontology was adapted to represent the variety of concepts in the set of Common Base Concepts, across the 4 language:. homogenous Base-Concept Clusters average size of Base Concept Cluster apply to both nouns and verbs Set of 1024 common Base Concepts making up the core of the separate wordnets.

59 Base Concepts Procedure: Concepts selected by all sites: Explanations:
Each site determined the set of word meanings with most relations (up to 15% of all relations) and high positions in the hierarchy. This set was extended with all meanings used to define the first selection. The local selection was translated to WordNet1.5 equivalences: 4 lists of WordNet1.5 synsets (between 450 – 2000 synsets per selection). These sets of WordNet1.5 translations have been compared. Concepts selected by all sites: 30 synsets (24 nouns synsets, 6 verb synsets). Explanations: The individual selections are not representative enough. There are major differences in the way meanings are classified, which have an effect on the frequency of the relations. The translations of the selection to WordNet1.5 synsets are not reliable The resources cover very different vocabularies

60 Concepts selected by at least two sites: intersections of pairs
NOUNS VERBS NL ES IT GB/WN NL ES IT GB/WN NL ES IT GB/WN Total Set of shared Base Concepts : Union of intersection pairs Nouns Verbs Total 1stOrderEntities 2ndOrderEntities 3rdOrderEntities Total

61 Table 4: Number of Common BCs represented in the local wordnets
Related to CBCs Eq_synonym Eq_near_ CBCs Without Relations Synonym relations Direct Equivalent AMS FUE PSA Table 5: BC4 Gaps in at least two wordnets (10 synsets) body covering#1 mental object#1; cognitive content#1; content#2 body substance#1 natural object#1 social control#1 place of business#1; business establishment#1 change of magnitude#1 plant organ#1 contractile organ#1 Plant part#1 psychological feature#1 spatial property#1; spatiality#1

62 Table 6: Local senses with complex equivalence relations to CBCs
NL ES IT Eq_has_hyperonym eq_has_hyponym Eq_has_holonym 2 0 Eq_has_meronym 3 2 Eq_involved 3 Eq_is_caused_by 3 Eq_is_state_of 1 Example of complex relation CBC: cause to feel unwell#1, Verb Closest Dutch concept: {onwel#1}, Adjective (sick) Equivalence relation: eq_is_caused_by

63 Adaptation of Base Concepts in EuroWordNet-2
A similar selection of fundamental concepts has been made in EuroWordNet-2 The selected concepts have been compared among German, French, Czech and Estonian and with the EuroWordNet-1 selection The EuroWordNet-1 set has been extended to 1310 Base Concepts A distinction has been made between Hard and Soft Base Concepts Hard: represented by only a single Index-record Soft: represented by several close Index-records The final set has been used as starting point in EuroWordNet-2

64 Comparison of Base Concept Selections

65 Revised Set of Base Concepts

66 Starting points for the Top-Ontology
The ontology should support the building and encoding of semantic networks as linguistic ontologies: networks of lexicalized words and expressions in a language. The classification of the Base Concepts in terms of the Top Ontology should apply to all the involved languages. Enforce uniformity and compatibility of the different wordnets, by providing a common framework. Divide the Base Concepts (BCs) into coherent clusters to enable contrastive-analysis and discussion of closely related word meanings Customize the database by assigning features to the top-concepts, irrespective of language-specific structures. Provide an anchor point for connecting other ontologies to the Inter-Lingual-Index, such as CYC, MikroKosmos, the Upper-Model, by linking them to the corresponding ILI-records.

67 Principles for deciding on the distinctions
Starting point is that the wordnets are linguistic ontologies: Semantic classifications common in linguistic paradigms: Aktionsart models [Vendler 1967, Verkuyl 1972, Verkuyl 1989, Pustejovsky 1991], entity-orders [Lyons 1977], Aristotle’s Qualia-structure [Pustejovsky 1995]. Ontologies developed in previous EC-projects, which had a similar basis and are well-known in the project consortium: Acquilex (BRA 3030, 7315), Sift (LE-62030, [Vossen and Bon 1996]. The ontology should be capable of reflecting the diversity of the set of common BCs, across the 4 languages. In this sense the classification of the common BCs in terms of the top-concepts should result in: Homogeneous Base Concept Clusters: classifications in WordNet1.5 and the other wordnets. Average-sized Base Concept Clusters: not extremely large or small.

68 Other important characteristics:
The distinctions apply to both nouns, verbs and adjectives, because these can be related in the language-specific wordnets via a xpos_synonymy relation, and the ILI-records can be related to any part-of-speech. The top-concepts are hierarchically ordered by means of a subsumption relation but there can only be one super-type linked to each top-concept: multiple inheritance between top-concepts is not allowed. In addition to the subsumption relation top-concepts can have an opposition-relation to indicate that certain distinctions are disjunct, whereas others may overlap. There may be multiple relations from ILI-records to top-concepts: the Base Conceptss can be cross-classified in terms of multiple top-concepts (as long as these have no opposition-relation between them): i.e. multiple inheritance from Top-Concept to Base Concept is allowed. Result: the TCs function as cross-classifying features rather than conceptual classes. Meanings for bodyparts are not linked to a single class BodyPart but to two features: Living and Part.

69 The EuroWordNet Top-Ontology: 63 concepts (excluding the top)
First Level [Lyons 1977]: 1stOrderEntity (491 BC synsets, all nouns) Any concrete entity (publicly) perceivable by the senses and located at any point in time, in a three-dimensional space. 2ndOrderEntity (500 BC synsets, 272 nouns and 228 verbs) Any Static Situation (property, relation) or Dynamic Situation, which cannot be grasped, heart, seen, felt as an independent physical thing. They can be located in time and occur or take place rather than exist; e.g. continue, occur, apply 3rdOrderEntity (33 BC synsets, all nouns) An unobservable proposition that exists independently of time and space. They can be true or false rather than real. They can be asserted or denied, remembered or forgotten. E.g. idea, though, information, theory, plan.

70 Test to distinguish 1st, 2nd and 3rd OrderEntities
Third-order entities cannot occur, have no temporal duration and therefore fail on both tests: a The same person was here again to-day b The same thing happened/occurred again to-day *? The idea, fact, expectation, etc.... was here/occurred/ took place A positive test for a 3rdOrderEntity is based on the properties that can be predicated: ok The idea, fact, expectation, etc.. is true, is denied, forgotten The first division of the ontology is disjoint: BCs cannot be classified as combinations of these TCs. This distinction cuts across the different parts of speech in that: 1stOrderEntities are always (concrete) nouns. 2ndOrderEntities can be nouns, verbs and adjectives, where adjectives are always non-dynamic (refer to states and situations not involving a change of state). 3rdOrderEntities are always (abstract) nouns.

71 Base Concepts classified as 3rdOrderEntities
theory; idea; structure; evidence; procedure; doctrine; policy; data point; content; plan of action; concept; plan; communication; knowledge base; cognitive content; know-how; category; information; abstract; info;

72 1stOrderEntity1 Origin 0 the way in which an entity has come about
Natural21 Living30 Plant18 Human106 Creature2 Animal123 Artifact144 Function0 the typical activity or role that is associated with an entity Vehicle8 Occupation23 Covering8 Garment3 Software4 Furniture6 Place45 Container12 Comestible32 Instrument18 Container12 Building13 Representation12: MoneyRepresentation10; LanguageRepresentation34; Image Representation9 Form0 a-morf or fixed shape. Substance32 Solid63 Liquid13 Gas1 Object62 Composition0 group of self-contained wholes or as a part of such a whole Part86 Group63

73 Conjunctive classes of 1stOrderEntities
Frequent combinations 5 Comestible;Solid;Artifact 7 LanguageRepresentation 5 Container;Part;Solid;Living 7 Vehicle;Object;Artifact 5 Furniture;Object;Artifact 10 Instrument;Object;Artifact 5 Instrument;Artifact 12 Part 5 Living 14 Place 5 Plant 14 Place;Part 6 Liquid 15 Substance 6 Object;Artifact 19 LanguageRepresentation;Artifact 6 Part;Living 20 Occupation;Object;Human 6 Place;Part;Solid 22 Object;Animal; Function 7 Building;Object;Artifact 38 Group;Human 7 Group 42 Object;Human

74 Conjunctive classes of 1stOrderEntities
Low Frequent combinations fruit: Comestible (Function) life: Group (Composition) Object (Form) Living (Natural, Origin) Part (Composition) cell: Part (Composition) Plant (Natural, Origin) Living (Natural, Origin) skin: Covering (Covering) arms: Instrument (Function) Solid (Form) Group (Composition) Part (Composition) Object (Form) Living (Natural, Origin) Artifact (Origin)

75 1stOrderEntities classified as Function only
barrier 1; belonging 2;building material 1;causal agency 1;commodity 1;consumer goods 1;creation 3;curative 1;decoration 2;device 4;fastener 1;force 6;force 7;form 5;impediment 1; medicament 1;piece of work 1;possession 1;protection 4;remains 2;restraint 2;support 6;support; 7;supporting structure 1;thing 3

76 2ndOrderEntity0 SituationType6 (the event-structure in terms of which a situation can be characterized as a conceptual unit over time; Disjoint features) Dynamic134 (he sat down quickly. a quick meeting) BoundedEvent183 UnboundedEvent48 Static28 (?he sits quickly.) Property61 Relation38 SituationComponent0 (the most salient semantic component(s) that characterize(s) a situation; Conjuncted Features) Cause67 Communication50 Condition62 Physical140 Agentive170 Existence27 Experience43 Possession23 Phenomenal17 Location76 Manner21 Purpose137 Stimulating25 Mental90 Modal10 Quantity39 Social102 Time24 Usage8

77 Conjunctive classes of 2ndOrderEntities
Static 5 Property;Physical;Condition 5 Property;Stimulating;Physical 5 Relation 5 Relation;Social 6 Static;Quantity 7 Property;Condition 8 Relation;Location 9 Property 10 Relation;Physical;Location: adjoin 1; aim 4; blank space 1; course 7; direction 8; distance 1; elbow room 1; path 3; spatial property 1; spatial relation 1

78 Conjunctive classes of 2ndOrderEntities
Dynamic 5 BoundedEvent;Cause;Physical 5 BoundedEvent;Cause;Physical;Location 5 BoundedEvent;Time 5 Dynamic 5 Dynamic;Location 5 Dynamic;Phenomenal 5 Dynamic;Phenomenal;Physical 6 BoundedEvent;Agentive 6 BoundedEvent;Location 6 BoundedEvent;Physical;Location 6 Dynamic;Agentive;Communication 6 Dynamic;Cause 8 BoundedEvent;Agentive;Mental;Purpose 8 BoundedEvent;Quantity;Time 9 BoundedEvent;Cause 9 Dynamic;Experience;Mental experience 7; find 3;affect 5; arouse 5; excite 2; cognition 1; desire 2; disposition 2; disposition 4; disturbance 7; emotion 1; feeling 1; humor 3; pleasance 1; process 4; look 8; phenomenon 1; cause to appear 1; perception 2; sensation 1; feel 12; experience 8; trouble 3; reality 1

79 Top-Down Building Procedure
1) Construction of a core wordnet from the common set of Base Concepts Find Representatives in the local language for the Common Base Concepts (1310 synsets) Add local Base Concepts that are not selected as Common Base Concepts Specify the hyperonyms of the local and common Base Concepts 2) Extend the Core Wordnets Add the first level of hyponyms to the core wordnets Add other hyponyms which have many sub-hyponyms Add other types of relations: XPOS, roles, meronymy, subevents, causes. 3) Verify the Selection Corpus frequency: Parole lexicons and corpora Top-Concept clustering Intersection of ILI-records Overlap in ILI-chains

80 Top-Down Building Top-Ontology Inter-Lingual-Index Hypero nyms 63TCs
CBC Represen- tatives Local BCs 1310 CBCs 149 new ILIs CBC Repre-senta. Local BCs WMs related via non-hypo nymy WMs related via non-hypo nymy Remaining WordNet1.5 Synsets First Level Hyponyms First Level Hyponyms Remaining Hyponyms Remaining Hyponyms Inter-Lingual-Index

81 The current wordnets

82 Comparison of wordnets
In depth comparison of major semantic fields Comparison of the intersection of the associated ILI-records Distribution of the associated ILI-records over the different top ontology clusters Comparison of the hyponymy relations in the wordnets, projected on the associated ILI-records

83

84 Intersection of the associated ILI-records
Nouns Verbs Total 62780 32520 12215 7455 frequency % of  (WN,IT, NL, ES) (IT, NL, ES) ES 24596 39.2% 75.6% 4654 38.1% 62.4% IT 14272 22.7% 43.9% 4673 38.3% 62.7% NL 21259 33.9% 65.4% 6416 52.5% 86.1% Ç (ES, IT) 10907 17.4% 33.5% 3272 26.8% Ç (ES, NL) 14773 23.5% 45.4% 3870 31.7% 51.9% Ç (IT, NL) 9862 15.7% 30.3% 3950 32.3% 53.0% Ç (ES, IT, NL) 8183 13.0% 25.2% 3051 25.0% 40.9%

85 Distribution over the top ontology clusters

86 Distribution over the top ontology clusters

87 Comparison of the hyponymy relations, projected on the associated ILI-records
To be able to compare hyponymy chains, each word sense in the chain has been replaced by the ILI-records that are linked to these synsets which gives the following result: veranderen (change)  bewegen (move intransitive)  bewegen (move reflexive)  voortbewegen (move location)  verplaatsen (move from A to B)  stijgen (move to a higher position)  opstijgen (take off)

88 Coverage of complete noun chains projected over WN1.5 structure

89 Partial noun chains projected over WN1.5

90 Partial noun chains with 1 gap projected over WN1.5

91 Towards an efficient, condensed and universal index of sense-distinctions
Independently of the wordnet structures in each language, we can manipulate the mapping across languages via the ILI. We can use the information of all the languages to correct incompleteness and inconsistencies of the individual resources Ultimately, we should try to find a minimal and sufficient set of concepts to provide an efficient mapping.

92 Characteristics of the Inter-Lingual-Index
The Inter-lingual-Index (ILI) is an unstructured fund of concepts with the sole purpose of providing an efficient mapping of senses across languages. Requirements: 1. efficient level of granularity ILI Wordnets {break} “He broke the glass” breken Dutch {break; cause to break} breken Dutch {break; damage} inflict damage upon. romper Spanish rompere Italian 2. superset of concepts that occur across languages {cashier} eq_hyperonym cassière Dutch eq_hyperonym cajera Spanish {female cashier} eq_synonym cassière Dutch eq_synonym cajera Spanish

93 A Minimal and Efficient set of concepts
Globalizing the sense-differentiation: create metonymic clusters abstract from contextual specialization and grammatical perspectives abstract from part-of-speech realization abstract from productive and predictable meanings Extending the Inter-Lingual-Index to become the superset of concepts occurring in two or more wordnets only if: concepts are unpredictable and unproductive concepts cannot be linked exhaustively and uniquely to the ILI

94 Under-specified concepts Metonymic clusters
eq_metonym eq_metonym club metonym# club: organization metonym# club: building {vereniging}NL eq_synonym eq_synonym {club}EN {club; verenigingsgebouw}NL

95 Under-specified concepts Generalization and Diathesis clusters
eq_diathesis eq_diatheis break diathesis# break: inchoative diathesis# break: causative {breken; kapotgaan}NL {rompere}IT {breken; kapotmaken}NL eq_synonym eq_synonym {rompersi}IT

96 Under-specified for POS
eq_xpos_synonym eq_xpos_synonym depart xpos# departure xpos# depart {vertrekkenV}NL {departV}EN eq_synonym eq_synonym {departureN}EN {vertrekN}NL

97 Overview of equivalence relations to the ILI
Relation POS Sources: Targets Example eq_synonym same 1:1 auto : voiture car eq_near_synonym any many : many apparaat, machine, toestel: apparatus, machine, device eq_hyperonym same many : 1 (usually) citroenjenever: gin eq_hyponym same (usually) 1 : many dedo : toe, finger eq_metonymy same many/1 : 1 universiteit, universiteitsgebouw: university eq_diathesis same many/1 : 1 raken (cause), raken: hit eq_generalization same many/1 : 1 schoonmaken : clean

98 Progress on restructuring the ILI
Clusters added manually and automatically based on: structural properties of WN1.5 mapping to other sources: Levin’s classes, WN1.6 cross-lingual mapping clusters words word senses synsets Nouns Verbs New ILIs from other wordnets have not yet been added. We estimated that for verbs hardly any new ILIs are needed, for nouns about 30% of non-translated concepts (2,000 synsets based on Dutch).

99 Effects of ILI-clusters
Intersection of ILI-references for Dutch, Spanish, Italian and English Nouns 2895 clustered synsets (4,6% of WN1.5 noun synsets) intersection increased from 7736 (23,8%) to 8183 (25,2%) out of the union of synsets Verbs 3839 clustered synsets (31,4% of WN1.5 verb synsets) intersection increased from 1632 (21,9%) to 3051 (40,9%) out of the union of 7455 synsets

100 Superset of all concepts.
Procedure: Initially, the ILI will only contain WordNet1.5 synsets. a site that cannot find a proper equivalent among the available ILI-concepts will link the meaning to another ILI-record using a so-called complex-equivalence relation and will generate a potential new ILI-record: Dutch Meaning Definition Complex-equivalence Target concept klunen to walk on skates has_eq_hyperonym walk after a building-phase all potentially-new ILI-records are collected and verified for overlap by one site; a proposal for updating the ILI is distributed to all sites and has to be verified; the ILI is updated and all sites have to reconsider the equivalence relations for all meanings that can potentially be linked to the new ILI-records;

101 Filling gaps in the ILI Types of GAPS
genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, Non-productive Non-compositional pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier) Productive Compositional Universality of gaps: Concepts occurring in at least 2 languages

102 Productive and Predictable Lexicalizations exhaustively linked to the ILI
beat eq_has_hyperonym eq_has_hyperonym {doodslaanV}NL {totschlagenV}DE eq_has_hyperonym kill eq_has_hyperonym {doodstampenV}NL {tottrampelnV}DE eq_has_hyperonym eq_has_hyperonym stamp eq_has_hyperonym {doodschoppenV}NL kick eq_has_hyperonym eq_has_hyperonym eq_has_hyperonym cashier {casière}NL {cajeraN}ES eq_in_state female eq_in_state eq_has_hyperonym fish {alevínN}ES young eq_in_state

103 WordNet gaps across languages

104 Towards an efficient, condensed and universal index of sense-distinctions
Productive derivations and compounds linked exhaustively WordNet1.5 90,000 concepts Metonymy/ Generalization clusters Universal Core meanings POS Independent Non-predictable Universal systematic polysemy and level of granularity Language and domain specific lexicalizations that do not occur in a large variety of languages Language specific realizations in grammatical forms

105 The EuroWordNet database
1.) The actual wordnets in Flaim database format: an indexing and compression format of Novell. 2.) Polaris (Louw 1997): Re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture. import and export wordnets or wordnet selections from/to ASCII files. resolve links for imported concepts. edit and add concepts, variants and relations in the wordnets. access to the ILI and ontologies and to switch between the wordnets and ontologies via the ILI. extract, import and export clusters of senses based on relations. project synsets or clusters from one wordnet to another wordnet compare clusters of synsets. import new or adapted ILI-records. update ILI-references to updated ILI. 3. Periscope (Cuypers and Adriaens 1997): a graphical interface for viewing the EuroWordNet database.

106 Global Wordnet Association http://www.globalwordnet.org
provide a standardized framework to link, compare and build complete wordnets for all the European languages and dialects. initialize the development of wordnets in non-European languages develop more specific definitions, tests and procedures for evaluating and developing wordnets. extend the specification of EuroWordNet to lexical units which are not yet covered (adjectives/adverbs, lexicalized phrases and multi-words). develop (axiomatized) ontologies for Domains and World-Knowledge that can be shared by all languages via the ILI. develop an efficient ILI for linking, sharing, consistency checking and cross-language technology applications. This ILI could function as a gold-standard of sense-distinctions. organize a (annual/bi-annual) workshop or conference.

107 2nd Global Wordnet Conference
Location: Masaryk University, Brno (Czech Republic), January, , 2004.

108 Other wordnet initiatives
Danish Norway Swedish Portuguese Arabic Korean Russian Welsh Basque, Catalan Chinese BalkaNet IndoWordnet Meaning

109 BalkaNet Funded by the European Union as project IST-2000-29388.
3-year project: Follows a strict EuroWordNet approach: Expanded set of base concepts Top-down building approach EWN database extended with: Greek, Romanian, Serbian, Turkish, Bulgarian, Czech Development of new wordnet database system: VisDic

110 IndoWordnet Current Wordnet development in India:
Hindi and Marathi at IIT Bombay, Tamil at Anna University-K.B Chandrashekhar Research Centre (AU-KBC) Chennai and Tamil University Tanjavur, Gujarathi at MS University Baroda, Oriya at Utkal University Bhubaneswar and Bengali at IIT Kharagpur. The Hindi WordNet is at an advanced stage of development with about semantically linked synsets and with associated software and user interface.

111 IndoWordnet By the end of 2003 each Indian language will create a WordNet of 5000 synsets. These will be for about 2000 most frequent content words in each language. Use will be made of the wordlist sorted by frequency- available with the CIIL Language specific WordNets developed by the following institutions: CIIL, Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali, Malayalam. IIT Bombay: Hindi, Marathi and Konkani AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam University of Hyderabad: Telegu University of Baroda: Gujarati Utkal University Bhubaneswar: Oriya IIT Kharagpur: Bengali Reserach groups have to be identified for building the WordNets of Assamese, Nepali and Languages of the North East.

112 Developing Multilingual Web-scale Language Technologies
Meaning Developing Multilingual Web-scale Language Technologies

113 Meaning Objectives Funded by the European Union as project IST 3 -year project: April April 2005 Large-scale (Lexical) Knowledge Bases Automatic enrichment of EWN Mixed approach (KB + ML) Applied to Q/A, CLIR Problem structural and lexical ambiguity

114 Meaning Approach automatic collection of sense examples (Leacock et al. 98, Mihalcea y Moldovan 99) Large-scale WSD (Boosting, SVM, transductives) Large-scale Knowledge Acquisition (McCarthy 01, Agirre & Martinez 02)

115 Architecture Meaning English Web Corpus Italian Web Corpus WSD WSD
EWN Italian EWN ACQ UPLOAD UPLOAD ACQ Multilingual Central Repository PORT PORT PORT PORT UPLOAD UPLOAD Spanish EWN Basque EWN ACQ ACQ Spanish Web Corpus Basque Web Corpus WSD PORT UPLOAD WSD Catalan EWN ACQ WSD Catalan Web Corpus

116 WP6: Word Sense Disambiguation
Meaning WP6: Word Sense Disambiguation A combination of unsupervised Knowledge-based and supervised Machine Learning techniques that will provide a high-precision system that is able to tag running text with word senses A system that acquires a huge number of examples per word from the web The use of sophisticated linguistic information, such as, syntactic relations, semantic classes, selectional restrictions, subcategorization information, domain, etc. Efficient margin-based Machine Learning algorithms. Novel algorithms that combine tagged examples with huge amounts of untagged examples in order to increase the precision of the system.

117 THE END...


Download ppt "Wordnet, EuroWordNet, Global Wordnet"

Similar presentations


Ads by Google