Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wordnet, EuroWordNet, Global Wordnet Piek Vossen

Similar presentations

Presentation on theme: "Wordnet, EuroWordNet, Global Wordnet Piek Vossen"— Presentation transcript:

1 Wordnet, EuroWordNet, Global Wordnet Piek Vossen

2 Overview Princeton WordNet ( ongoing) EuroWordNet ( ) The database design The general building strategy Towards a universal index of meaning Global WordNet Association ( ongoing) Other wordnets BalkaNet ( ) IndoWordnet ( ongoing) Meaning ( )

3 WordNet1.5 Developed at Princeton by George Miller and his team as a model of the mental lexicon. Semantic network in which concepts are defined in terms of relations to other concepts. Structure: organized around the notion of synsets (sets of synonymous words) basic semantic relations between these synsets Initially no glosses Main revision after tagging the Brown corpus with word meanings: SemCor.

4 Structure of WordNet1.5

5 EuroWordNet The development of a multilingual database with wordnets for several European languages Funded by the European Commission, DG XIII, Luxembourg as projects LE and LE March September Million EURO. URL:

6 Objectives of EuroWordNet Languages covered: EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian. Size of vocabulary: EuroWordNet-1: 30,000 concepts - 50,000 word meanings. EuroWordNet-2: 15,000 concepts- 25,000 word meaning. Type of vocabulary: the most frequent words of the languages all concepts needed to relate more specific concepts

7 Consortium

8 The basic principles of EuroWordNet the structure of the Princeton WordNet the design of the EuroWordNet database wordnets as language-specific structures the language-internal relations the multilingual relations

9 Specific features of EuroWordNet it contains semantic lexicons for other languages than English. each wordnet reflects the relations as a language-internal system, maintaining cultural and linguistic differences in the wordnets. it contains multilingual relations from each wordnet to English meanings, which makes it possible to compare the wordnets, tracking down inconsistencies and cross-linguistic differences. each wordnet is linked to a language independent top-ontology and to domain labels.

10 Autonomous & Language-Specific voorwerp {object} lepel {spoon} werktuig{tool} tas {bag} bak {box} blok {block} lichaam {body} Wordnet1.5Dutch Wordnet bag spoon box object natural object (an object occurring naturally) artifact, artefact (a man-made object) instrumentality blockbody container device implement tool instrument

11 Differences in structure Artificial Classes versus Lexicalized Classes: instrumentality; natural object Lexicalization differences of classes: container and artifact (object) are not lexicalized in Dutch What is the purpose of different hierarchies? Should we include all lexicalized classes from all (8) languages?

12 Conceptual ontology: A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure. introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise ). What properties can we infer for spoons? spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking Linguistic versus Conceptual Ontologies

13 Linguistic ontology: Exactly reflects the relations between all the lexicalized words and expressions in a language. It therefore captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery,

14 Separate Wordnets and Ontologies ReferenceOntologyClasses: BOX ContainerProduct; SolidTangibleThing Language-Neutral Ontology object box container box container WordNet1.5 Language-Specific Wordnets doos voorwerp Dutch Wordnet EuroWordNet Top-Ontology: Form: Cubic Function: Contain Origin: Artifact Composition: Whole

15 Wordnets versus ontologies Wordnets: autonomous language-specific lexicalization patterns in a relational network. Usage: to predict substitution in text for information retrieval, text generation, machine translation, word-sense- disambiguation. Ontologies: data structure with formally defined concepts. Usage: making semantic inferences.

16 Classical Substitution Principle: Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and hyponyms: horse stallion, mare, pony, mammal, animal, being. It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms: horseXcat, dog, camel, fish, plant, person, object. Conceptual Distance Measurement: Number of hierarchical nodes between words is a measurement of closeness, where the level and the local density of nodes are additional factors. Wordnets as Linguistic Ontologies

17 Linguistic Principles for deriving relations 1. Substitution tests (Cruse 1986): 1a.It is a fiddle therefore it is a violin. bIt is a violin therefore it is a fiddle. 2a.It is a dog therefore it is an animal. b*It is an animal therefore it is a dog. 3ato kill (/a murder) causes to die (/ death) to kill (/a murder) has to die (/ death) as a consequence b*to die / death causes to kill *to die / death has to kill as a consequence

18 Linguistic Principles for deriving relations 2. Principle of Economy (Dik 1978): If a word W 1 (animal) is the hyperonym of W 2 (mammal) and W 2 is the hyperonym of W 3 (dog) then W 3 (dog) should not be linked to W 1 (animal) but to W 2 (mammal). 3. Principle of Compatibility If a word W 1 is related to W 2 via relation R 1, W 1 and W 2 cannot be related via relation R n, where R n is defined as a distinct relation from R 1.

19 Architecture of the EuroWordNet Data Base I I = Language Independent link II = Link from Language Specific to Inter lingual Index III = Language Dependent Link II Lexical Items Table bewegen gaan rijden berijden III guidare III Lexical Items Table cavalcare andare muoversi ILI-record {drive} Inter-Lingual-Index I Lexical Items Table driveride move go III Ontology 2OrderEntity LocationDynamic Lexical Items Table cabalgar jinetear III conducir mover transitar Domains Traffic AirRoad` III II

20 The mono-lingual design of EuroWordNet

21 Language Internal Relations WN 1.5 starting point The synset as a weak notion of synonymy: two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value. (Miller et al. 1993) Relations between synsets: RelationPOS-combinationExample ANTONYMYadjective-to-adjective verb-to-verbopen/ close HYPONYMYnoun-to-nouncar/ vehicle verb-to-verbwalk/ move MERONYMYnoun-to-nounhead/ nose ENTAILMENTverb-to-verbbuy/ pay CAUSEverb-to-verbkill/ die

22 Differences EuroWordNet/WordNet1.5 Added Features to relations Cross-Part-Of-Speech relations New relations to differentiate shallow hierarchies New interpretations of relations

23 EWN Relationship Labels Disjunction/Conjunction of multiple relations of the same type WordNet1.5 door1 -- (a swinging or sliding barrier that will close the entrance to a room or building; "he knocked on the door"; "he slammed the door as he left") PART OF: doorway, door, entree, entry, portal, room access door 6 -- (a swinging or sliding barrier that will close off access into a car; "she forgot to lock the doors of her car") PART OF: car, auto, automobile, machine, motorcar.

24 EWN Relationship Labels {airplane}HAS_MERO_PART: conj1 {door} HAS_MERO_PART: conj2 disj1{jet engine} HAS_MERO_PART: conj2 disj2{propeller} {door}HAS_HOLO_PART: disj1 {car} HAS_HOLO_PART: disj2 {room} HAS_HOLO_PART: disj3 {entrance} {dog} HAS_HYPERONYM: conj1{mammal} HAS_HYPERONYM: conj2{pet} {albino}HAS_HYPERONYM: disj1{plant} HAS_HYPERONYM: dis2{animal} Default Interpretation: non-exclusive disjunction

25 EWN Relationship Labels Disjunction/Conjunction of multiple relations of the same type { {dog} HAS_HYPONYM: dis1{poodle} HAS_HYPONYM: dis1{labrador} HAS_HYPONYM: {sheep dog}(Orthogonal) HAS_HYPONYM: {watch dog}(Orthogonal) Default Interpretation: non-exclusive disjunction

26 Factive/Non-factive CAUSES (Lyons 1977) factive (default interpretation): to kill causes to die: {kill}CAUSES{die} non-factive: E 1 probably or likely causes event E 2 or E 1 is intended to cause some event E 2 : to search may cause to find. {search}CAUSES {find} non-factive EWN Relationship Labels

27 Reversed In the database every relation must have a reverse counter-part but there is a difference between relations which are explicitly coded as reverse and automatically reversed relations: {finger} HAS_HOLONYM{hand} {hand}HAS_MERONYM{finger} {paper-clip} HAS_MER_MADE_OF{metal} {metal}HAS_HOL_MADE_OF{paper-clip} reversed Negation {monkey}HAS_MERO_PART{tail} {ape}HAS_MERO_PART {tail} not

28 Cross-Part-Of-Speech relations WordNet1.5: nouns and verbs are not interrelated by basic semantic relations such as hyponymy and synonymy: adornment 2 change of state-- (the act of changing something) adorn 1 change, alter-- (cause to change; make different) EuroWordNet: words of different parts of speech can be inter-linked with explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations: {adorn V}XPOS_NEAR_SYNONYM{adornment N}

29 The advantages of such explicit cross-part-of-speech relations are: similar words with different parts of speech are grouped together. the same information can be coded in an NP or in a sentence. By unifying higher-order nouns and verbs in the same ontology it will be possible to match expressions with very different syntactic structures but comparable content by merging verbs and abstract nouns we can more easily link mismatches across languages that involve a part-of-speech shift. Dutch nouns such as afsluiting, gehuil are translated with the English verbs close and cry, respectively. Cross-Part-Of-Speech relations

30 Entailment in WordNet WordNet1.5: Entailment indicates the direction of the implication or entailment: a. + Temporal Inclusion (the two situations partially or totally overlap) a.1 co-extensiveness (e. g., to limp/to walk) hyponymy/troponymy a.2 proper inclusion (e.g., to snore/to sleep)entailment b. - Temporal Exclusion (the two situations are temporally disjoint) b.1 backward presupposition (e.g., to succeed/to try)entailment b.2 cause (e.g., to give/to have)

31 Subevents in EuroWordNet EuroWordNet Direction of the entailment is expressed by the labels factive and reversed: {to succeed} is_caused_by{to try}factive {to try}causes{to succeed}non-factive Proper inclusion is described by the has_subevent/ is_subevent_of relation in combination with the label reversed: {to snore}is_subevent_of{to sleep} {to sleep}has_subevent{to snore}reversed {to buy}has_subevent{to pay} {to pay}is_subevent_of{to buy}reversed

32 The interpretation of the CAUSE relation WordNet1.5: The causal relation only holds between verbs and it should only apply to temporally disjoint situations: EuroWordNet: the causal relation will also be applied across different parts of speech: {to kill} Vcauses{death} N {death} nis_caused_by{to kill} vreversed {to kill } vcauses{dead} a {dead} ais_caused_by{to kill} vreversed {murder} ncauses{death}n {death} ais_caused_by{murder} nreversed

33 The interpretation of the CAUSE relation Various temporal relationships between the (dynamic/non- dynamic) situations may hold: Temporally disjoint: there is no time point when dS 1 takes place and also S 2 (which is caused by dS 1 ) (e.g. to shoot/to hit); Temporally overlapping: there is at least one time point when both dS 1 and S 2 take place, and there is at least one time point when dS 1 takes place and S 2 (which is caused by dS 1 ) does not yet take place (e.g. to teach/to learn); Temporally co-extensive: whenever dS 1 takes place also S 2 (which is caused by dS 1 ) takes place and there is no time point when dS 1 takes place and S 2 does not take place, and vice versa (e.g. to feed/to eat).

34 Role relations In the case of many verbs and nouns the most salient relation is not the hyperonym but the relation between the event and the involved participants. These relations are expressed as follows: {hammer}ROLE_INSTRUMENT{to hammer} {to hammer}INVOLVED_INSTRUMENT{hammer}reversed {school}ROLE_LOCATION {to teach} {to teach}INVOLVED_LOCATION {school}reversed These relations are typically used when other relations, mainly hyponymy, do not clarify the position of the concept network, but the word is still closely related to another word.

35 Co_Role relations guitar playerHAS_HYPERONYMplayer CO_AGENT_INSTRUMENTguitar player HAS_HYPERONYMperson ROLE_AGENTto play music CO_AGENT_INSTRUMENTmusical instrument to play musicHAS_HYPERONYM to make ROLE_INSTRUMENTmusical instrument guitarHAS_HYPERONYMmusical instrument CO_INSTRUMENT_AGENTguitar player ice saw HAS_HYPERONYMsaw CO_INSTRUMENT_PATIENTice sawHAS_HYPERONYMsaw ROLE_INSTRUMENTto saw iceCO_PATIENT_INSTRUMENTice saw REVERSED

36 Co_Role relations Examples of the other relations are: criminalCO_AGENT_PATIENTvictim novel writer/ poetCO_AGENT_RESULTnovel/ poem doughCO_PATIENT_RESULTpastry/ bread photograpic cameraCO_INSTRUMENT_RESULTphoto

37 BE_IN_STATE and STATE_OF Example:the poor are the ones to whom the state poor applies Effect:poor NHAS_HYPERONYMperson N poor NBE_IN_STATEpoor A poor ASTATE_OFpoor N reversed IN_MANNER and MANNER_OF Example:to slurp is to eat in a noisely manner Effect:slurp VHAS_HYPERONYMeat V slurp VIN_MANNERnoisely Adverb noisely AdverbMANNER_OFslurp V reversed

38 Overview of the Language Internal relations in EuroWordnet Same Part of Speech relations: NEAR_SYNONYMYapparatus - machine HYPERONYMY/HYPONYMYcar - vehicle ANTONYMYopen - close HOLONYMY/MERONYMYhead - nose Cross-Part-of-Speech relations: XPOS_NEAR_SYNONYMYdead - death; to adorn - adornment XPOS_HYPERONYMY/HYPONYMYto love - emotion XPOS_ANTONYMYto live - dead CAUSEdie - death SUBEVENTbuy - pay; sleep - snore ROLE/INVOLVEDwrite - pencil; hammer - hammer STATEthe poor - poor MANNERto slurp - noisily BELONG_TO_CLASSRome - city

39 Thematic networks behandelen(treat) zieke (sick person, patient) genezen (to get well) arts (doctor) scalpel opereren (operate) persoon (person) wezen(being) organisme (organism) orgaan (organ) maag (stomach) maagaandoening (stomach disease) ziekte (disease) Agent Patient Causes Patient Involves Instrument Part of Patient

40 The multi-lingual design of EuroWordNet

41 Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages; Index-records are mainly based on WordNet1.5 synsets and consist of synonyms, glosses and source references; Various types of complex equivalence relations are distinguished; Equivalence relations from synsets to index records: not on a word-to-word basis; Indirect matching of synsets linked to the same index items; The Multilingual Design

42 EWN Interlingual Relations EQ_SYNONYM: there is a direct match between a synset and an ILI- record EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously, HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record. HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI- records. other relations: CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE

43 Equivalent Near Synonym 1. Multiple Targets One sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5: {make clean by removing dirt, filth, or unwanted substances from} {remove unwanted substances from, such as feathers or pits, as of chickens or fruit} (remove in making clean; "Clean the spots off the rug") {remove unwanted substances from - (as in chemistry)} The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean.

44 Equivalent Near Synonym 2. Multiple Source meanings Synsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation: Dutch wordnet: toestel near_synonym apparaat ILI-records:{machine}; {device}; {apparatus}; {tool}

45 Equivalent Hyponymy has_eq_hyperonym Typically used for gaps in WordNet1.5 or in English: genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both. has_eq_hyponym Used when wordnet1.5 only provides more narrow terms. In this case there can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.

46 { toe : part of foot } { finger : part of hand } { dedo, dito : finger or toe } { head : part of body } { hoofd : human head } { kop : animal head } toe finger head dito dedo hoofd kop GB-Net NL-Net IT-Net ES-Net = normal equivalence =eq_has_hyponym =eq_has_hyperonym Complex mappings across languages

47 The methodologies for building wordnets

48 Overall Building Process Verification by users Comparing and restructuring the wordnet Load wordnet in the EuroWordNet Database Improve and extend the wordnet fragments Adjust coverage improve encoding Machine Readable Dictionaries Wordnets, Taxonomies, Corpora Loaded in local databases Subset of word meanings Encoding of language internal and equivalence relations Wordnet fragment with links to WordNet1.5 in local database Specification of selection criteria Wordnet fragment in EuroWordNet database Demonstration in Information Retrieval Verification Report Ia Ib Ic II Ia III

49 Main Methods Expand approach: translate WordNet1.5 synsets to another language and take over the structure easier and more efficient method compatible structure with WordNet1.5 structure is close to WordNet1.5 but also biased by it Merge approach: create an independent wordnet in another language and align the separate hierarchies by generating the appropriate translations more complex and labour intensive different structure from WordNet1.5 lanuage specific patterns can be maintained

50 Methods for extracting language-internal relations editors and database for manually encoding relations; comparison with WordNet1.5 structure; definition patterns in monolingual dictionaries; co-occurrences in corpora; morphology; bilingual dictionaries; lexical semantic substitution tests

51 extract monosemeous translations of English synsets, e.g. a Spanish word has only 1 translation to an English word which has only one sense and vice versa; disambiguation of multiple ambivalent translations by measuring their conceptual-distance between the senses of these translations in the WordNet1.5 hierarchy (Rigau and Aguirre, 95); disambiguation of ambivalent translations by measuring the conceptual-distance directly in the WordNet1.5 hierarchy between alternative translations and the translations of the direct semantic context in the source wordnet; disambiguation of ambivalent translations by measuring the overlap in top-concepts inherited in the source wordnet and inherited for the different senses of translations in WordNet1.5; Methods for extracting equivalence relations

52 Aligning wordnets muziekinstrument orgel hammond orgel organ? hammond organ musical instrument instrument artifact objectnatural object object

53 Inheriting Semantic Features hart 1 orgaan 1 (Living Part) deel 2 (Part) iets 1 LEAF heart 1 playing card 1 card 1 (Artifact Function Object) paper 6 (Artifact Solid) material 5 (Substance) matter 1 inanimate object 1 entity 1 LEAF heart 2 disposition 2 (Dynamic Experience Mental)nature 1 trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF heart 3 bravery 1 spirit 1 character 1 trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF heart 4 internal organ 1 organ 4 (Living Part) body part 1 (Living Part) part 10 entity 1 LEAF

54 Reliability of Equivalence Relations


56 Conflicting Starting points 1. There should be a maximum of flexibility: the wordnets should be able to reflect language-specific relations and patterns the wordnets should be built relatively independently because each sites has different starting points: different tools, database and resources (Machine Readable Dictionaries) differences in the languages 2. The wordnets have to be compatible in terms of coverage and relations to be useful for multilingual information retrieval and translations tools and to be able to compare the wordnets.

57 Measures to achieve maximal compatibility The results are loaded into a common Multilingual Database (Polaris): consistency checks and types of incompatibility specific comparison options to measure consistency and overlap in coverage User-guides for building wordnets in each language: the steps to encode the relations for a word meaning. common tests and criteria for all the relations. overview of problems and solutions. A set of common Base-Concepts which are shared by all the sites, having: most relations and the most-important positions in the wordnets most meanings and badly defined Classification of the common Base Concept in terms of a Top-Ontology of 63 basic Semantic Distinctions Top-Down Approach, where first the Base Concepts and their direct context are (manually) encoded and next the wordnets are (semi-automatically) extended top- down to include more specific concepts that depend on these Base Concept.

58 Top-Ontology and Base Concepts Top-Ontology with 63 higher-level concepts Existing Ontologies: WordNet1.5 top-levels Aktions-Art models (Vendler, Verkuyl) Acquilex and Sift ontologies (EC-projects) Qualia-structure (Pustejovsky) Upper-Model, MikroKosmos, Cyc, Ad Hoc ANSI-Committee on ontologies The ontology was adapted to represent the variety of concepts in the set of Common Base Concepts, across the 4 language:. homogenous Base-Concept Clusters average size of Base Concept Cluster apply to both nouns and verbs Set of 1024 common Base Concepts making up the core of the separate wordnets.

59 Base Concepts Procedure: Each site determined the set of word meanings with most relations (up to 15% of all relations) and high positions in the hierarchy. This set was extended with all meanings used to define the first selection. The local selection was translated to WordNet1.5 equivalences: 4 lists of WordNet1.5 synsets (between 450 – 2000 synsets per selection). These sets of WordNet1.5 translations have been compared. Concepts selected by all sites: 30 synsets (24 nouns synsets, 6 verb synsets). Explanations: The individual selections are not representative enough. There are major differences in the way meanings are classified, which have an effect on the frequency of the relations. The translations of the selection to WordNet1.5 synsets are not reliable The resources cover very different vocabularies

60 Concepts selected by at least two sites: intersections of pairs NOUNSVERBS NLESITGB/WNNLESITGB/WN NL ES IT GB/WN Total Set of shared Base Concepts : Union of intersection pairs NounsVerbsTotal 1stOrderEntities ndOrderEntities rdOrderEntities3333 Total

61 Table 4: Number of Common BCs represented in the local wordnets Related to CBCsEq_synonymEq_near_ CBCs Without RelationsSynonym relationsDirect Equivalent AMS FUE PSA Table 5: BC4 Gaps in at least two wordnets (10 synsets) body covering#1mental object#1; cognitive content#1; content#2 body substance#1natural object#1 social control#1place of business#1; business establishment#1 change of magnitude#1plant organ#1 contractile organ#1Plant part#1 psychological feature#1spatial property#1; spatiality#1

62 Table 6: Local senses with complex equivalence relations to CBCs NLESIT Eq_has_hyperonym61404 eq_has_hyponym Eq_has_holonym20 Eq_has_meronym32 Eq_involved3 Eq_is_caused_by3 Eq_is_state_of1 Example of complex relation CBC: cause to feel unwell#1, Verb Closest Dutch concept: {onwel#1}, Adjective (sick) Equivalence relation: eq_is_caused_by

63 Adaptation of Base Concepts in EuroWordNet-2 A similar selection of fundamental concepts has been made in EuroWordNet-2 The selected concepts have been compared among German, French, Czech and Estonian and with the EuroWordNet-1 selection The EuroWordNet-1 set has been extended to 1310 Base Concepts A distinction has been made between Hard and Soft Base Concepts Hard: represented by only a single Index-record Soft: represented by several close Index-records The final set has been used as starting point in EuroWordNet-2

64 Comparison of Base Concept Selections

65 Revised Set of Base Concepts

66 Starting points for the Top-Ontology The ontology should support the building and encoding of semantic networks as linguistic ontologies: networks of lexicalized words and expressions in a language. The classification of the Base Concepts in terms of the Top Ontology should apply to all the involved languages. Enforce uniformity and compatibility of the different wordnets, by providing a common framework. Divide the Base Concepts (BCs) into coherent clusters to enable contrastive-analysis and discussion of closely related word meanings Customize the database by assigning features to the top-concepts, irrespective of language-specific structures. Provide an anchor point for connecting other ontologies to the Inter-Lingual- Index, such as CYC, MikroKosmos, the Upper-Model, by linking them to the corresponding ILI-records.

67 Principles for deciding on the distinctions Starting point is that the wordnets are linguistic ontologies: Semantic classifications common in linguistic paradigms: Aktionsart models [Vendler 1967, Verkuyl 1972, Verkuyl 1989, Pustejovsky 1991], entity-orders [Lyons 1977], Aristotles Qualia-structure [Pustejovsky 1995]. Ontologies developed in previous EC-projects, which had a similar basis and are well-known in the project consortium: Acquilex (BRA 3030, 7315), Sift (LE , [Vossen and Bon 1996]. The ontology should be capable of reflecting the diversity of the set of common BCs, across the 4 languages. In this sense the classification of the common BCs in terms of the top-concepts should result in: Homogeneous Base Concept Clusters: classifications in WordNet1.5 and the other wordnets. Average-sized Base Concept Clusters: not extremely large or small.

68 Other important characteristics: The distinctions apply to both nouns, verbs and adjectives, because these can be related in the language-specific wordnets via a xpos_synonymy relation, and the ILI-records can be related to any part-of-speech. The top-concepts are hierarchically ordered by means of a subsumption relation but there can only be one super-type linked to each top-concept: multiple inheritance between top-concepts is not allowed. In addition to the subsumption relation top-concepts can have an opposition-relation to indicate that certain distinctions are disjunct, whereas others may overlap. There may be multiple relations from ILI-records to top-concepts: the Base Conceptss can be cross-classified in terms of multiple top-concepts (as long as these have no opposition-relation between them): i.e. multiple inheritance from Top-Concept to Base Concept is allowed. Result: the TCs function as cross-classifying features rather than conceptual classes. Meanings for bodyparts are not linked to a single class BodyPart but to two features: Living and Part.

69 The EuroWordNet Top-Ontology: 63 concepts (excluding the top) First Level [Lyons 1977]: 1stOrderEntity (491 BC synsets, all nouns) Any concrete entity (publicly) perceivable by the senses and located at any point in time, in a three-dimensional space. 2ndOrderEntity (500 BC synsets, 272 nouns and 228 verbs) Any Static Situation (property, relation) or Dynamic Situation, which cannot be grasped, heart, seen, felt as an independent physical thing. They can be located in time and occur or take place rather than exist; e.g. continue, occur, apply 3rdOrderEntity (33 BC synsets, all nouns) An unobservable proposition that exists independently of time and space. They can be true or false rather than real. They can be asserted or denied, remembered or forgotten. E.g. idea, though, information, theory, plan.

70 Third-order entities cannot occur, have no temporal duration and therefore fail on both tests: aThe same person was here again to-day bThe same thing happened/occurred again to-day *?The idea, fact, expectation, etc.... was here/occurred/ took place A positive test for a 3rdOrderEntity is based on the properties that can be predicated: okThe idea, fact, expectation, etc.. is true, is denied, forgotten The first division of the ontology is disjoint: BCs cannot be classified as combinations of these TCs. This distinction cuts across the different parts of speech in that: 1stOrderEntities are always (concrete) nouns. 2ndOrderEntities can be nouns, verbs and adjectives, where adjectives are always non-dynamic (refer to states and situations not involving a change of state). 3rdOrderEntities are always (abstract) nouns. Test to distinguish 1st, 2nd and 3rd OrderEntities

71 Base Concepts classified as 3rdOrderEntities theory; idea; structure; evidence; procedure; doctrine; policy; data point; content; plan of action; concept; plan; communication; knowledge base; cognitive content; know-how; category; information; abstract; info;

72 1stOrderEntity 1 Origin 0 the way in which an entity has come about Natural 21 Living 30 Plant 18 Human 106 Creature 2 Animal 123 Artifact 144 Function 0 the typical activity or role that is associated with an entity Vehicle 8 Occupation 23 Covering 8 Garment 3 Software 4 Furniture 6 Place 45 Container 12 Comestible 32 Instrument 18 Container 12 Building 13 Representation 12 : MoneyRepresentation 10; LanguageRepresentation 34 ; Image Representation 9 Form 0 a-morf or fixed shape. Substance 32 Solid 63 Liquid 13 Gas 1 Object 62 Composition 0 group of self-contained wholes or as a part of such a whole Part 86 Group 63

73 Conjunctive classes of 1stOrderEntities Frequent combinations 5Comestible;Solid;Artifact 7LanguageRepresentation 5Container;Part;Solid;Living 7Vehicle;Object;Artifact 5Furniture;Object;Artifact10Instrument;Object;Artifact 5Instrument;Artifact12Part 5Living14Place 5Plant14Place;Part 6Liquid15Substance 6Object;Artifact19 LanguageRepresentation;Artifact 6Part;Living20Occupation;Object;Human 6Place;Part;Solid22Object;Animal; Function 7Building;Object;Artifact38Group;Human 7Group42Object;Human

74 Conjunctive classes of 1stOrderEntities Low Frequent combinations fruit:Comestible (Function)life: Group (Composition) Object (Form) Living (Natural, Origin) Part (Composition)cell:Part (Composition) Plant (Natural, Origin) Living (Natural, Origin) skin:Covering (Covering)arms:Instrument (Function) Solid (Form)Group (Composition) Part (Composition)Object (Form) Living (Natural, Origin) Artifact (Origin)

75 1stOrderEntities classified as Function only barrier 1; belonging 2;building material 1;causal agency 1;commodity 1;consumer goods 1;creation 3;curative 1;decoration 2;device 4;fastener 1;force 6;force 7;form 5;impediment 1; medicament 1;piece of work 1;possession 1;protection 4;remains 2;restraint 2;support 6;support; 7;supporting structure 1;thing 3

76 2ndOrderEntity 0 SituationType 6 (the event-structure in terms of which a situation can be characterized as a conceptual unit over time; Disjoint features) Dynamic 134 (he sat down quickly. a quick meeting) BoundedEvent 183 UnboundedEvent 48 Static 28 (?he sits quickly.) Property 61 Relation 38 SituationComponent 0 (the most salient semantic component(s) that characterize(s) a situation; Conjuncted Features) Cause 67 Communication 50 Condition 62 Physical 140 Agentive 170 Existence 27 Experience 43 Possession 23 Phenomenal 17 Location 76 Manner 21 Purpose 137 Stimulating 25 Mental 90 Modal 10 Quantity 39 Social 102 Time 24 Usage 8

77 Conjunctive classes of 2ndOrderEntities Static 5Property;Physical;Condition 5Property;Stimulating;Physical 5Relation 5Relation;Social 6Static;Quantity 7Property;Condition 8Relation;Location 9Property 10Relation;Physical;Location : adjoin 1; aim 4; blank space 1; course 7; direction 8; distance 1; elbow room 1; path 3; spatial property 1; spatial relation 1

78 Conjunctive classes of 2ndOrderEntities Dynamic 5BoundedEvent;Cause;Physical 5BoundedEvent;Cause;Physical;Location 5BoundedEvent;Time 5Dynamic 5Dynamic;Location 5Dynamic;Phenomenal 5Dynamic;Phenomenal;Physical 6BoundedEvent;Agentive 6BoundedEvent;Location 6BoundedEvent;Physical;Location 6Dynamic;Agentive;Communication 6Dynamic;Cause 8BoundedEvent;Agentive;Mental;Purpose 8BoundedEvent;Quantity;Time 9BoundedEvent;Cause 9Dynamic;Experience;Mental experience 7; find 3;affect 5; arouse 5; excite 2; cognition 1; desire 2; disposition 2; disposition 4; disturbance 7; emotion 1; feeling 1; humor 3; pleasance 1; process 4; look 8; phenomenon 1; cause to appear 1; perception 2; sensation 1; feel 12; experience 8; trouble 3; reality 1

79 Top-Down Building Procedure 1) Construction of a core wordnet from the common set of Base Concepts Find Representatives in the local language for the Common Base Concepts (1310 synsets) Add local Base Concepts that are not selected as Common Base Concepts Specify the hyperonyms of the local and common Base Concepts 2) Extend the Core Wordnets Add the first level of hyponyms to the core wordnets Add other hyponyms which have many sub-hyponyms Add other types of relations: XPOS, roles, meronymy, subevents, causes. 3) Verify the Selection Corpus frequency: Parole lexicons and corpora Top-Concept clustering Intersection of ILI-records Overlap in ILI-chains

80 Top-Down Building 63TCs 1310 CBCs 149 new ILIs First Level Hyponyms Remaining Hyponyms Hypero nyms CBC Represen- tatives Local BCs WMs related via non-hypo nymy Top-Ontology Inter-Lingual-Index Remaining Hyponyms Hypero nyms CBC Repre- senta. Local BCs WMs related via non-hypo nymy First Level Hyponyms Remaining WordNet1.5 Synsets

81 The current wordnets

82 Comparison of wordnets In depth comparison of major semantic fields Comparison of the intersection of the associated ILI- records Distribution of the associated ILI-records over the different top ontology clusters Comparison of the hyponymy relations in the wordnets, projected on the associated ILI-records


84 Intersection of the associated ILI-records NounsVerbs Total Total frequen cy % of (WN,IT, NL, ES) % of (IT, NL, ES) frequen cy % of (WN,IT, NL, ES) % of (IT, NL, ES) ES %75.6% %62.4% IT %43.9% %62.7% NL %65.4% %86.1% (ES, IT) %33.5% %43.9% (ES, NL) %45.4% %51.9% (IT, NL) %30.3% %53.0% (ES, IT, NL) %25.2% %40.9%

85 Distribution over the top ontology clusters


87 Comparison of the hyponymy relations, projected on the associated ILI-records To be able to compare hyponymy chains, each word sense in the chain has been replaced by the ILI-records that are linked to these synsets which gives the following result: veranderen (change) bewegen (move intransitive) bewegen (move reflexive) voortbewegen (move location) verplaatsen (move from A to B) stijgen (move to a higher position) opstijgen (take off)

88 Coverage of complete noun chains projected over WN1.5 structure

89 Partial noun chains projected over WN1.5

90 Partial noun chains with 1 gap projected over WN1.5

91 Independently of the wordnet structures in each language, we can manipulate the mapping across languages via the ILI. We can use the information of all the languages to correct incompleteness and inconsistencies of the individual resources Ultimately, we should try to find a minimal and sufficient set of concepts to provide an efficient mapping. Towards an efficient, condensed and universal index of sense-distinctions

92 Characteristics of the Inter-Lingual-Index The Inter-lingual-Index (ILI) is an unstructured fund of concepts with the sole purpose of providing an efficient mapping of senses across languages. Requirements: 1. efficient level of granularity ILIWordnets {break} He broke the glassbreken Dutch {break; cause to break}breken Dutch {break; damage} inflict damage upon.romper Spanish rompere Italian 2. superset of concepts that occur across languages ILIWordnets {cashier}eq_hyperonymcassière Dutch eq_hyperonym cajeraSpanish {female cashier} eq_synonymcassière Dutch eq_synonymcajeraSpanish

93 A Minimal and Efficient set of concepts Globalizing the sense-differentiation: create metonymic clusters abstract from contextual specialization and grammatical perspectives abstract from part-of-speech realization abstract from productive and predictable meanings Extending the Inter-Lingual-Index to become the superset of concepts occurring in two or more wordnets only if: concepts are unpredictable and unproductive concepts cannot be linked exhaustively and uniquely to the ILI

94 Under-specified concepts Metonymic clusters club {vereniging} NL {club; verenigingsgebouw} NL {club} EN metonym# club: organization metonym# club: building eq_metonym eq_synonym

95 Under-specified concepts Generalization and Diathesis clusters break {rompere} IT diathesis# break: inchoative diathesis# break: causative {breken; kapotgaan} NL {breken; kapotmaken} NL eq_synonym eq_diatheis {rompersi} IT eq_diathesis

96 Under-specified for POS depart {vertrekken V } NL {vertrek N } NL {depart V } EN {departure N } EN xpos# departure xpos# depart eq_xpos_synonym eq_synonym

97 Overview of equivalence relations to the ILI RelationPOSSources: TargetsExample eq_synonymsame1:1auto : voiture car eq_near_synonymanymany : manyapparaat, machine, toestel: apparatus, machine, device eq_hyperonymsamemany : 1 (usually)citroenjenever: gin eq_hyponymsame(usually) 1 : manydedo : toe, finger eq_metonymysamemany/1 : 1universiteit, universiteitsgebouw: university eq_diathesissamemany/1 : 1raken (cause), raken: hit eq_generalizationsamemany/1 : 1schoonmaken : clean

98 Progress on restructuring the ILI Clusters added manually and automatically based on: structural properties of WN1.5 mapping to other sources: Levins classes, WN1.6 cross-lingual mapping clusterswordsword sensessynsets Nouns Verbs New ILIs from other wordnets have not yet been added. We estimated that for verbs hardly any new ILIs are needed, for nouns about 30% of non-translated concepts (2,000 synsets based on Dutch).

99 Effects of ILI-clusters Intersection of ILI-references for Dutch, Spanish, Italian and English Nouns 2895 clustered synsets (4,6% of WN1.5 noun synsets) intersection increased from 7736 (23,8%) to 8183 (25,2%) out of the union of synsets Verbs 3839 clustered synsets (31,4% of WN1.5 verb synsets) intersection increased from 1632 (21,9%) to 3051 (40,9%) out of the union of 7455 synsets

100 Superset of all concepts Superset of all concepts. Procedure: Initially, the ILI will only contain WordNet1.5 synsets. a site that cannot find a proper equivalent among the available ILI-concepts will link the meaning to another ILI-record using a so-called complex- equivalence relation and will generate a potential new ILI-record: Dutch MeaningDefinitionComplex-equivalenceTarget concept klunento walk on skates has_eq_hyperonymwalk after a building-phase all potentially-new ILI-records are collected and verified for overlap by one site; a proposal for updating the ILI is distributed to all sites and has to be verified; the ILI is updated and all sites have to reconsider the equivalence relations for all meanings that can potentially be linked to the new ILI-records;

101 Filling gaps in the ILI Types of GAPS 1. genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, Non-productive Non-compositional 1. pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier) Productive Compositional 2. Universality of gaps: Concepts occurring in at least 2 languages

102 Productive and Predictable Lexicalizations exhaustively linked to the ILI beat stamp {doodslaan V } NL {cajera N } ES eq_has_hyperonym {doodschoppen V } NL {doodstampen V } NL kill kick {tottrampeln V } DE {totschlagen V } DE eq_has_hyperonym cashier female young fish {casière} NL eq_has_hyperonym {alevín N } ES eq_has_hyperonym eq_in_state

103 WordNet gaps across languages

104 Towards an efficient, condensed and universal index of sense-distinctions Productive derivations and compounds linked exhaustively WordNet1.5 90,000 concepts Metonymy/ Generalization clusters Universal Core meanings POS Independent Non-predictable Universal systematic polysemy and level of granularity Language and domain specific lexicalizations that do not occur in a large variety of languages Language specific realizations in grammatical forms

105 The EuroWordNet database 1.) The actual wordnets in Flaim database format: an indexing and compression format of Novell. 2.) Polaris (Louw 1997): Re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture. import and export wordnets or wordnet selections from/to ASCII files. resolve links for imported concepts. edit and add concepts, variants and relations in the wordnets. access to the ILI and ontologies and to switch between the wordnets and ontologies via the ILI. extract, import and export clusters of senses based on relations. project synsets or clusters from one wordnet to another wordnet compare clusters of synsets. import new or adapted ILI-records. update ILI-references to updated ILI. 3. Periscope (Cuypers and Adriaens 1997): a graphical interface for viewing the EuroWordNet database.

106 Global Wordnet Association provide a standardized framework to link, compare and build complete wordnets for all the European languages and dialects. initialize the development of wordnets in non-European languages develop more specific definitions, tests and procedures for evaluating and developing wordnets. extend the specification of EuroWordNet to lexical units which are not yet covered (adjectives/adverbs, lexicalized phrases and multi-words). develop (axiomatized) ontologies for Domains and World-Knowledge that can be shared by all languages via the ILI. develop an efficient ILI for linking, sharing, consistency checking and cross-language technology applications. This ILI could function as a gold-standard of sense-distinctions. organize a (annual/bi-annual) workshop or conference.

107 2nd Global Wordnet Conference Location: Masaryk University, Brno (Czech Republic), January, ,

108 Other wordnet initiatives Danish Norway Swedish Portuguese Arabic Korean Russian Welsh Basque, Catalan Chinese BalkaNet IndoWordnet Meaning

109 BalkaNet Funded by the European Union as project IST year project: Follows a strict EuroWordNet approach: Expanded set of base concepts Top-down building approach EWN database extended with: Greek, Romanian, Serbian, Turkish, Bulgarian, Czech Development of new wordnet database system: VisDic

110 IndoWordnet Current Wordnet development in India: Hindi and Marathi at IIT Bombay, Tamil at Anna University-K.B Chandrashekhar Research Centre (AU-KBC) Chennai and Tamil University Tanjavur, Gujarathi at MS University Baroda, Oriya at Utkal University Bhubaneswar and Bengali at IIT Kharagpur. The Hindi WordNet is at an advanced stage of development with about semantically linked synsets and with associated software and user interface.

111 IndoWordnet By the end of 2003 each Indian language will create a WordNet of 5000 synsets. These will be for about 2000 most frequent content words in each language. Use will be made of the wordlist sorted by frequency- available with the CIIL Language specific WordNets developed by the following institutions: CIIL, Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali, Malayalam. IIT Bombay: Hindi, Marathi and Konkani AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam University of Hyderabad: Telegu University of Baroda: Gujarati Utkal University Bhubaneswar: Oriya IIT Kharagpur: Bengali Reserach groups have to be identified for building the WordNets of Assamese, Nepali and Languages of the North East.

112 Developing Multilingual Web-scale Language Technologies Meaning

113 Meaning Objectives IST Funded by the European Union as project IST year project: April April 2005 Large-scale (Lexical) Knowledge Bases Automatic enrichment of EWN Mixed approach (KB + ML) Applied to Q/A, CLIR Problem structural and lexical ambiguity

114 Meaning Approach automatic collection of sense examples (Leacock et al. 98, Mihalcea y Moldovan 99) Large-scale WSD (Boosting, SVM, transductives) Large-scale Knowledge Acquisition (McCarthy 01, Agirre & Martinez 02)

115 Multilingual Central Repository ItalianEWN BasqueEWNSpanishEWN EnglishEWN Basque Web Corpus Italian Web Corpus English Web Corpus Spanish Web Corpus ACQ ACQACQ ACQ UPLOADUPLOAD UPLOADUPLOAD PORT PORT PORT PORT WSD WSDWSD MeaningArchitecture WSD CatalanEWN Catalan Web Corpus WSDACQ PORTUPLOAD

116 A combination of unsupervised Knowledge-based and supervised Machine Learning techniques that will provide a high-precision system that is able to tag running text with word senses A system that acquires a huge number of examples per word from the web The use of sophisticated linguistic information, such as, syntactic relations, semantic classes, selectional restrictions, subcategorization information, domain, etc. Efficient margin-based Machine Learning algorithms. Novel algorithms that combine tagged examples with huge amounts of untagged examples in order to increase the precision of the system. Meaning WP6: Word Sense Disambiguation

117 THE END...

Download ppt "Wordnet, EuroWordNet, Global Wordnet Piek Vossen"

Similar presentations

Ads by Google