Presentation is loading. Please wait.

Presentation is loading. Please wait.

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 1 From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages.

Similar presentations


Presentation on theme: "Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 1 From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages."— Presentation transcript:

1 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 1 From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen VU University Amsterdam

2 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 2 Overview Wordnet, EuroWordNet Global Wordnet Grid Stevin project Cornetto 7 th Frame work project KYOTO

3 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 3 WordNet http://wordnet.princeton.edu/ http://wordnet.princeton.edu/ Lexical semantic database for English Developed by George Miller and his team at Princeton University, as the implementation of a mental model of the lexicon Organized around the notion of a synset: a set of synonyms in a language that represent a single concept Semantic relations between concepts (synsets) and not between words Currently covers over 117,000 concepts (synsets) and over 150,000 English words

4 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 4 Relational model of meaning manwoman boygirl cat kitten dog puppy animal man woman boy cat kitten dog puppy animal

5 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 5 Wordnet: a network of semantically related words {car; auto; automobile; machine; motorcar} {conveyance;transport} {vehicle} {motor vehicle; automotive vehicle} {cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab} {bumper} {car door} {car window} {car mirror} {armrest} {doorlock} {hinge; flexible joint} hyper(o)nym hyponym meronyms Hyponymy and meronymy relations are: transitive directed

6 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 6 Wordnet Semantic Relations WN 1.5 starting point The synset as a weak notion of synonymy: two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value. (Miller et al. 1993) Relations between synsets: Example HYPONYMYnoun-to-nouncar/ vehicle verb-to-verbwalk/ move MERONYMYnoun-to-nounhead/ nose ANTONYMYadjective-to-adjectivegood/bad verb-to-verbopen/ close ENTAILMENTverb-to-verbbuy/ pay CAUSEverb-to-verbkill/ die

7 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 7 Wordnet Data Model bank fiddle violin violist fiddler string rec: 12345 - financial institute rec: 54321 - side of a river rec: 9876 - small string instrument rec: 65438 - musician playing violin rec:42654 - musician rec:25876 - string instrument rec:35576 - string of instrument rec:29551 - underwear type-of part-of Vocabulary of a language ConceptsRelations 1 2 2 1 1 2 polysemy & synonymy polysemy

8 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 8 Some observations on Wordnet synsets are more compact representations for concepts than word meanings in traditional lexicons synonyms and hypernyms are substitutional variants: –begin – commence –I once had a canary. The bird got sick. The poor animal died. hyponymy and meronymy chains are important transitive relations for predicting properties and explaining textual properties: object -> artifact -> vehicle -> 4-wheeled vehicle -> car strict separation of part of speech although concepts are closely related (bed – sleep) and are similar (dead – death) lexicalization patterns reveal important mental structures

9 Lexicalization patterns 25 unique beginners garbage tree organism animal bird canarychurch building artifact object plant flower rose waste threat entity common canary abbey crocodiledog basic level concepts balance of two principles: predict most features apply to most subclasses where most concepts are created amalgamate most parts most abstract level to draw a pictures

10 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 10 Wordnet top level

11 Meronymy & pictures beak tail leg

12 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 12 Meronymy & pictures

13 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 13 Wordnet 3.0 statistics POSUniqueSynsetsTotal Strings Word-Sense Pairs Noun117,79882,115146,312 Verb11,52913,76725,047 Adjective21,47918,15630,002 Adverb4,4813,6215,580 Totals155,287117,659206,941

14 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 14 Wordnet 3.0 statistics POSMonosemousPolysemous Words and SensesWordsSenses Noun101,86315,93544,449 Verb6,2775,25218,770 Adjective16,5034,97614,399 Adverb3,7487331,832 Totals128,39126,89679,450

15 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 15 Wordnet 3.0 statistics POSAverage Polysemy Including Monosemous Words Excluding Monosemous Words Noun1.242.79 Verb2.173.57 Adjective1.42.71 Adverb1.252.5

16 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 16 http://www.visuwords.com

17 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 17

18 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 18 Usage of Wordnet Mostly used database in language technology Enormous impact in language technology development Large Free and downloadable English

19 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 19 Usage of Wordnet Improve recall of textual based analysis: – Query -> Index Synonyms: commence – begin Hypernyms: taxi -> car Hyponyms: car -> taxi Meronyms: trunk -> elephant Lexical entailments: gun -> shoot Inferencing: –what things can burn? Expression in language generation and translation: –alternative words and paraphrases

20 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 20 Improve recall Information retrieval: –effective on small databases without redundancy, e.g. image captions, video text Text classification: –expand small training sets –reduce training effort Question & Answer systems –question classification: who, where, what, when –match answers to question types

21 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 21 Improve recall Anaphora resolution: –The girl fell off the table. She.... –The glass fell of the table. It... Coreference resolution: –When he moved the furniture, the antique table got damaged. Information extraction (unstructed text to structured databases): –generic forms or patterns "vehicle" - > text with specific cases "car"

22 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 22 Improve recall Summarizers: –Sentence selection based on word counts -> concept counts –Avoid repetition in summary -> language generation, pick out another synonym or hypernym Limited inferencing: detect locations, people, organisations, etc.

23 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 23 Enabling technologies Semantic similarity: what sentences or expressions are semantically similar? Semantic relatedness and textual entailment: smoke entails fire, fire entails damage Word-Senses-Disambiguation Erwin Marsi, University of Tilbug, http://daeso.uvt.nl/demos/index.htmlErwin Marsi, University of Tilbug, http://daeso.uvt.nl/demos/index.html

24 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 24

25 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 25

26 Recall & Precision query: cell phone mobile phones nerve cell police cell recall = doorsnede / relevant precision = doorsnede / gevonden foundintersectionrelevant Recall < 20% for basic search engines! (Blair & Maron 1985) jail neuron

27 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 27 Many others Data sparseness for machine learning: hapaxes can be replaced by semantic classes that match classes from the training set Use redundancy for more robustness: spelling correction and speech recognition can built semantic expectations using Wordnet and make better choices Sentiment and opinion mining Natural language learning

28 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 28 EuroWordNet The development of a multilingual database with wordnets for several European languages Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328 March 1996 - September 1999 2.5 Million EURO. http://www.hum.uva.nl/~ewnhttp://www.hum.uva.nl/~ewnhttp://www.hum.uva.nl/~ewn http://www.illc.uva.nl/EuroWordNet/finalresults- ewn.htmlhttp://www.illc.uva.nl/EuroWordNet/finalresults- ewn.html

29 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 29 EuroWordNet Languages covered: –EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian –EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian. Size of vocabulary: –EuroWordNet-1: 30,000 concepts - 50,000 word meanings. –EuroWordNet-2: 15,000 concepts- 25,000 word meaning. Type of vocabulary: –the most frequent words of the languages –all concepts needed to relate more specific concepts

30 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 30 EuroWordNet Model I = Language Independent link II = Link from Language Specific to Inter lingual Index III = Language Dependent Link III Lexical Items Table cavalcare andare muoversi III guidare ILI-record {drive} Inter-Lingual-Index Ontology 2OrderEntity LocationDynamic Domains Traffic AirRoad` III Lexical Items Table bewegen gaan rijden berijden III Lexical Items Table driveride move go III Lexical Items Table cabalgar jinetear III conducir mover transitar III II I I

31 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 31 Differences in relations between EuroWordNet and WordNet Added Features to relations Cross-Part-Of-Speech relations New relations to differentiate shallow hierarchies New interpretations of relations

32 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 32 EWN Relationship Labels {airplane}HAS_MERO_PART: conj1 {door} HAS_MERO_PART: conj2 disj1{jet engine} HAS_MERO_PART: conj2 disj2{propeller} {door}HAS_HOLO_PART: disj1 {car} HAS_HOLO_PART: disj2 {room} HAS_HOLO_PART: disj3 {entrance} Default Interpretation: non-exclusive disjunction

33 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 33 Overview of the Language Internal relations in EuroWordnet Same Part of Speech relations: HYPERONYMY/HYPONYMYcar - vehicle ANTONYMYopen - close HOLONYMY/MERONYMYhead – nose NEAR_SYNONYMYapparatus - machine Cross-Part-of-Speech relations: XPOS_NEAR_SYNONYMYdead - death; to adorn - adornment XPOS_HYPERONYMY/HYPONYMYto love - emotion XPOS_ANTONYMYto live - dead CAUSEdie - death SUBEVENTbuy - pay; sleep - snore ROLE/INVOLVEDwrite - pencil; hammer - hammer STATEthe poor - poor MANNERto slurp - noisily BELONG_TO_CLASSRome - city

34 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 34 Co_Role relations criminalCO_AGENT_PATIENTvictim novel writer/ poetCO_AGENT_RESULTnovel/ poem doughCO_PATIENT_RESULTpastry/ bread photograpic cameraCO_INSTRUMENT_RESULTphoto guitar playerHAS_HYPERONYMplayer CO_AGENT_INSTRUMENTguitar playerHAS_HYPERONYMperson ROLE_AGENTto play music CO_AGENT_INSTRUMENTmusical instrument to play musicHAS_HYPERONYM to make ROLE_INSTRUMENTmusical instrument guitarHAS_HYPERONYMmusical instrument CO_INSTRUMENT_AGENTguitar player

35 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 35 chronical patient ; mental patient patient HYPONYM ρ-PROCEDURE ρ-LOCATION STATE ρ-CAUSE cure ρ-PATIENT treat docter disease; disorder physiotherapy medicine etc. hospital, etc. stomach disease, kidney disorder, ρ-PATIENT ρ-AGENT child docter child co-ρ- AGENT-PATIENT Horizontal & vertical semantic relations HYPONYM

36 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 36 Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages; Index-records are mainly based on WordNet synsets and consist of synonyms, glosses and source references; Various types of complex equivalence relations are distinguished; Equivalence relations from synsets to index records: not on a word-to-word basis; Indirect matching of synsets linked to the same index items; The Multilingual Design

37 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 37 Equivalent Near Synonym 1. Multiple Targets (1:many) Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5: make clean by removing dirt, filth, or unwanted substances from remove unwanted substances from, such as feathers or pits, as of chickens or fruit remove in making clean; "Clean the spots off the rug" remove unwanted substances from - (as in chemistry) 2. Multiple Sources (many:1) Dutch wordnet: versiersel near_synonym versiering ILI-Record:decoration. 3. Multiple Targets and Sources (many:many) Dutch wordnet: toestel near_synonym apparaat ILI-records:machine; device; apparatus; tool

38 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 38 Equivalent Hyperonymy Typically used for gaps in English WordNet: genuine, cultural gaps for things not known in English culture: –Dutch: klunen, to walk on skates over land from one frozen water to the other pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English: –Dutch: kunststof = artifact substance artifact object

39 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 39 EuroWordNet statistics

40 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 40 Wordnets as semantic structures Wordnets are unique language-specific structures: –same organizational principles: synset structure and same set of semantic relations. –different lexicalizations –differences in synonymy and homonymy: "decoration" in English versus "versiersel/versiering" in Dutch "bank" in English (money/river) versus "bank" in Dutch (money/furniture) BUT also different relations for similar synsets

41 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 41 Autonomous & Language-Specific voorwerp {object} lepel {spoon} werktuig{tool} tas {bag} bak {box} blok {block} lichaam {body} Wordnet1.5Dutch Wordnet bag spoon box object natural object (an object occurring naturally) artifact, artefact (a man-made object) instrumentality blockbody container device implement tool instrument

42 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 42 Artificial ontology: better control or performance, or a more compact and coherent structure. introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise ). What properties can we infer for spoons? spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking Linguistic versus Artificial Ontologies

43 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 43 Linguistic ontology: Exactly reflects the relations between all the lexicalized words and expressions in a language. Captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery, Linguistic versus Artificial Ontologies

44 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 44 Wordnets versus ontologies Wordnets: autonomous language-specific lexicalization patterns in a relational network. Usage: to predict substitution in text for information retrieval, text generation, machine translation, word- sense-disambiguation. Ontologies: data structure with formally defined concepts. Usage: making semantic inferences.

45 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 45 From EuroWordNet to Global WordNet EuroWordNet ended in 1999 Global Wordnet Association was founded in 2000 to maintain the framework: http://www.globalwordnet.org Currently, wordnets exist for more than 50 languages, including: –Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu... Many languages are genetically and typologically unrelated

46 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 46 Global Wordnet Association Danish Norway Swedish Portuguese Korean Russian Basque Catalan Thai Arabic Polish Welsh Chinese 20 Indian Languages Brazilian Portuguese Hebrew Latvian Persian Kurdish Avestan Baluchi Hungarian English German Spanish French Italian Dutch Czech Estonian Romanian Bulgarian Turkish Slovenian Greek Serbian EuroWordNet BalkaNet http://www.globalwordnet.org

47 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 47 Some downsides of the EuroWordNet model Construction is not done uniformly Coverage differs Not all wordnets can communicate with one another, i.e. linked to different versions of English wordnet Proprietary rights restrict free access and usage A lot of semantics is duplicated Complex and obscure equivalence relations due to linguistic differences between English and other languages

48 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 48 Inter-Lingual Ontology Device Object TransportDevice English Words vehicle cartrain 1 2 33 Czech Words dopravní prostředník autovlak 2 1 French Words véhicule voituretrain 2 1 Estonian Words liiklusvahend autokillavoor 2 1 German Words Fahrzeug AutoZug 2 1 Spanish Words vehículo autotren 2 1 Italian Words veicolo autotreno 2 1 Dutch Words voertuig autotrein 2 1 Next step: Global WordNet Grid

49 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 49 GWNG: Main Features Construct separate wordnets for each Grid language Contributors from each language encode the same core set of concepts plus culture/language-specific ones Synsets (concepts) are mapped crosslinguistically via an ontology instead of just the English Wordnet

50 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 50 The Ontology: Main Features List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations Ontology contains only upper and mid-level concepts Concepts are related in a type hierarchy Concepts are defined with axioms

51 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 51 The Ontology: Main Features Minimal set of concepts (Reductionist view): –to express equivalence across languages –to support inferencing Ontology need not and cannot provide a concept for all concepts found in the Grid languages –Lexicalization in a language is not sufficient to warrant inclusion in the ontology –Lexicalization in all or many languages may be sufficient Ontological observations will be used to define the concepts in the ontology Ontological framework still must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages Additional lexicalized concepts are related to the ontology through complex relations

52 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 52 Ontological observations Identity criteria as used in OntoClean (Guarino & Welty 2002), : –rigidity: to what extent are properties true for entities in all worlds? You are always a human, but you can be a student for a short while. –essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of. –unicity: what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not.

53 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 53 Type-role distinction Current WordNet treatment, hyponyms of dog: lapdog:1 # toy dog:1, toy:4 # hunting dog:1 # working dog:1, etc. dalmatian:2, coach dog:1, carriage dog:1 # Leonberg:1 # Newfoundland:1 # poodle:1, poodle dog:1, etc. (1) a husky is a kind of dog(type) (2) a husky is a kind of working dog (role) Whats wrong? (2) is defeasible, (1) is not: *This husky is not a dog This husky is not a working dog

54 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 54 Ontology and lexicon Hierarchy of disjunct types: Canine PoodleDog; NewfoundlandDog; GermanShepherdDog; Husky Lexicon: –NAMES for TYPES: {poodle}EN, {poedel}NL, {pudoru}JP ((instance x Poodle) –LABELS for ROLES: {watchdog}EN, {waakhond}NL, {banken}JP ((instance x Canine) and (role x GuardingProcess))

55 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 55 Ontology and lexicon Hierarchy of disjunct types: River; Clay; etc… Lexicon: –NAMES for TYPES: {river}EN, {rivier, stroom}NL ((instance x River) –LABELS for dependent concepts: {rivierwater}NL (water from a river => water is not a unit) {kleibrok}NL (irregularly shared piece of clay=>non-essential) ((instance x water) and (instance y River) and (portion x y) ((instance x Object) and (instance y Clay) and (portion x y) and (shape X Irregular))

56 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 56 {teacher}EN ((instance x Human) and (agent x TeachingProcess)) {Lehrer}DE ((instance x Man) and (agent x TeachingProcess)) {Lehrerin}DE ((instance x Woman) and (agent x TeachingProcess)) KIF expression for gender marking

57 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 57 KIF expression for perspective sell: subj(x), direct obj(z),indirect obj(y) versus buy: subj(y), direct obj(z),indirect obj(x) (and (instance x Human)(instance y Human) (instance z Entity) (instance e FinancialTransaction) (source x e) (destination y e) (patient e) The same process but a different perspective by subject and object realization: marry in Russian two verbs, apprendre in French can mean teach and learn

58 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 58 Aspectual variants Slavic languages: two members of a verb pair for an ongoing event and a completed event. English: can mark perfectivity with particles, as in the phrasal verbs eat up and read through. Romance languages: mark aspect by verb conjugations on the same verb. Dutch, verbs with marked aspect can be created by prefixing a verb with door: doorademen, dooreten, doorfietsen, doorlezen, doorpraten (continue to breathe/eat/bike/read/talk). These verbs are restrictions on phases of the same process Does NOT warrant the extension of the ontology with separate processes for each aspectual variant

59 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 59 Kinship relations in Arabic عَم (Eam~)father's brother, paternal uncle. خَال (xaAl)mother's brother, maternal uncle. عَمَّة (Eam~ap)father's sister, paternal aunt. خَالَة (xaAlap)mother's sister, maternal aunt

60 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 60 Kinship relations in Arabic......... شَقِيقَة ($aqiyqapfull) sister, sister on the paternal and maternal side (as distinct from أُخْت (>uxot): 'sister' which may refer to a 'sister' from paternal or maternal side, or both sides). ثَكْلان (vakolAna)father bereaved of a child (as opposed to يَتِيم (yatiym) or يَتِيمَة (yatiymap) for feminine: 'orphan' a person whose father or mother died or both father and mother died). ثَكْلَى (vakolaYa)other bereaved of a child (as opposed to يَتِيم or يَتِيمَة for feminine: 'orphan' a person whose father or mother died or both father and mother died).

61 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 61 father's brother, paternal uncle WORDNET paternal uncle => uncle => brother of....???? ONTOLOGY (=> (paternalUncle ?P ?UNC) (exists (?F) (and (father ?P ?F) (brother ?F ?UNC)))) Complex Kinship concepts

62 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 62 Universality as evidence English verb cut abstracts from the precise process but there are troponyms that implicate the manner : – snip, clip imply scissors, chop and hack a large knife or an axe Dutch there is no general verb but only specific verbs: knippen clip, snip, cut with scissors or a scissor-like tool', snijden cut with a knife or knife-like tool, hakken chop, hack, to cut with an axe, or similar tool). If lexicalization of the specific process is more universal it can be seen as evidence that the specific processes should be listed in the ontology and not the generic verb

63 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 63 Open Questions/Challenges What is a word, i.e., a lexical unit? What is the status of complex lexemes like English lightning rod, word of mouth, find out, kick the bucket? What is a semantic unit, i.e. a concept?

64 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 64 Open Questions/Challenges Is there a core inventory of concepts that are universally encoded? If so, what are these concepts? How can crosslinguistic equivalence be verified? Is there systematicity to the language-specific extensions? What are the lexicalization patterns of individual languages? Are lexical gaps accidental or systematic?

65 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 65 Coverage: what belongs in a universal lexical database? Formal, linguistic criteria for inclusion Informal, cultural criteria Both are difficult to define and apply!

66 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 66 Advantages of the Global Wordnet Grid Shared and uniform world knowledge: –universal inferencing –uniform text analysis and interpretation More compact and less redundant databases More clear notion how languages map to the knowledge –better criteria for expressing knowledge –better criteria for understanding variation

67 CORNETTO (STEVIN TENDER) Combinatorial and Relational Network as Toolkit for Dutch Language Technology http://www2.let.vu.nl/oz/cornetto

68 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 68 Goals of the Cornetto project Goal: to develop a lexical semantic database for Dutch: –40K Entries: generic and central part of the language –Rich horizontal and vertical semantic relations –Combinatoric information –Ontological information Method: merge data from Dutch Wordnet (DWN) and Referentie bestand Nederlands (RBN) April 2006-March 2008, extended to July 2008 The data of the final results of the Cornetto project available through the TST-centrale of the Nederlandse Taalunie (free for research).

69 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 69 Dutch Wordnet Referentie Bestand English Wordnet SUMO (KIF) WN-DOMAINS Align/Merge Cornetto *** Ontology: Dolce, Sumo Entry -LU/Synset -Pos -DWN data -RBN data -SUMO-pointer -PWN-pointer -Domain *** Acquisition Toolkit Acquisition Toolkit Corpus Validation Corpus Project overview Editing DOLCE (KIF)

70 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 70 Database Collections: Lexical Units (LU): mainly derived from the RBN Synsets (SY): mainly derived from DWN Terms (TE) and axioms: mainly derived on SUMO and MILO Domains (DM): based on Wordnet domains Mappings: LU SY SY SY (within Dutch and from Dutch to English) SY TE SY DM

71 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 71 Data Organization Internal relations Princeton Wordnet Domains Spanish Wordnet Czech Wordnet German Wordnet French Wordnet Korean Wordnet Arabic Wordnet SUMO MILO Collection of Terms and Axioms Correspond to word- meaning pair form morphology syntax semantics pragmatics usage examples Lexical Unit (LU) Model meaning relations Synset Synonyms

72 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 72 Database Implemented in DebVisDic: –http://deb.fi.muni.cz/index.php Demo version available: http://www2.let.vu.nl/oz/cornetto/demo.html

73 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 73

74 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 74

75 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 75 Overview of results ALLNOUNSVERBSADJADVOTHERS Synsets70,37152,8479,0177,689220598 Lexical Units119,10885,44917,31415,712475158 Lemmas (form+pos)92,68670,3159,05112,2881,032n.a. Synonyms in synsets103,76275,47614,13812,914408826 CID records104,55676,53714,21413,132483190 Synonym per synset1.471.431.571.681.851.38 Senses per lemma1.291.221.911.280.46n.a.

76 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 76 Mapping relations No status value5597653.54% Status value4858046.46% manual101089.67% B-9549444.73% BM-9042154.03% D-55 adjectives1710.16% D-58 verbs7740.74% D-75 nouns20851.99% M-972523624.14% RESUME-7510471.00% TOTAL104556 DWN and RBN matches35,28937.74% LUs only in DWN54,98358.81% LUs only in RBN3,2233.45% Total93,495

77 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 77 Overview of synset data Synsets70371 Synonyms103762 InternalRelations153370 EquivalenceRelations86830 Definitions35620 WordNet Domains mappings93822 Sumo mappings70654 Base Level Concepts8828

78 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 78 English Wordnet to SUMO mapping through two-place relations =the synset is equivalent to the SUMO concept, circle (= Circle) +the synset is subsumed by the SUMO concept, branch (+ PlantBranch) @the synset is an instance of the SUMO concept, Amsterdam (@ City)

79 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 79 Cornetto SUMO Mappings through triplets Equality: –cirkel: (=, 0, Circle) or (=,, Circle) Subsumption: –tak: (+, 0, PlantBranch) or (+,, PlantBranch) Related: –blad: (part, 0, PlantBranch) or (part,, PlantBranch) Axiomatized: –theewater: (instance, 0, Water) (instance, 1, Making) (instance, 2, Tea) (resource, 0, 1) (result, 2,1) OR (instance,, Water) (instance, 1, Making) (instance, 2, Tea) (resource,, 1) (result, 2,1)

80 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 80 Ontology mapping: female/male variants teacher (a person whose occupation is teaching) SUMO: equivalent to Teacher In Dutch: no neutral form leraar (male teacher) (+,,Teacher), (+,, Man) lerares (female teacher) (+,,Teacher), (+,, Woman)

81 KYOTO (ICT-211423) Yielding Ontologies for Transition-Based Organization FP7: Intelligent Content and Semantics http://www.kyoto-project.eu/

82 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 82 KYOTO (ICT-211423) Overview Title : Yielding Ontologies for Transition-Based Organization Funded: –7 th Framework Program-ICT of the European Union: Intelligent Content and Semantics –Taiwan and Japan funded by national grants Goal: –Platform for knowledge sharing across languages and cultures –Enables knowledge transition and information search across different target groups, transgressing linguistic, cultural and geographic boundaries. –Open text mining and deep semantic search –Wiki environment that allows people in the field to maintain their knowledge and agree on meaning without knowledge engineering skills URL: http://www.kyoto-project.eu/ Duration: –March 2008 – March 2011 Effort : –364 person months of work.

83 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 83 KYOTO cycle frog endemic frogs common frog poison frog Golden poison frog gopher frog Dusky gopher frog forest frog Garden ponds are havens for wildlife. They provide food and shelter for frogs, newts and aquatic insects, including damselflies and dragonflies, (garden pont, haven, wild life) (garden pont, has_food, frog) (garden pont, has_food, newt) (garden pont, has_food, aquatic insect) (garden pont, is_shelter, frog) (garden pont, is_shelter, newt) (garden pont, is_shelter, aquatic insect)

84 Top Middle H20CO2 Substance Abstract Process Physical Ontology Environmental organizations Tybot: term yielding robot Kybot: knowledge yielding robot Wordnets Distributed, diverse & dynamic data 1 Capture text: "Sudden increase of CO2 emissions in 2008 in Europe" 2 CO2 emission 3 Wikyoto maintain terms & concepts 4 Index facts: Process:Increase Involves: CO2 emission When: 2008 Where: Europe 5 Text & Fact Index Semantic Search 6 Citizens Governments Companies Domain CO2 Emission H20 Pollution Greenhouse Gas

85 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 85 Kyoto main application Wikyoto (Wiki platform) –Connects people with shared interest as a community –Upload documents and sources –View and edit terms and concepts learned from these documents –Combines concepts with other taxonomies –Discuss and agree with others in the community, different languages, regions and cultures

86 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 86 Kyoto main application Tybots –Learns terms and concepts from document collection –Organizes terms as a hierarchy –Connects terms to other hierarchies –Defines: definitions relations to other terms properties and criteria for terms

87 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 87 Kyoto main application Kybot: –Detects facts of interest in text and combines these in a comprehensive overview –Uses knowledge represented for terms to detect facts in any document, regardless of language –Allows you to specify any collection of types of knowledge of your interest

88 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 88 Kyoto databases Database of users that forms the community Database of sources and documents provided by the users Database of terms, presented as a domain wordnet in each language Database of concepts (so-called ontology) that connects the terms of the different languages Databases of facts derived from various document and source collections provided by the user

89 Thank you for your attention


Download ppt "Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven 1 From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages."

Similar presentations


Ads by Google