Presentation is loading. Please wait.

Presentation is loading. Please wait.

Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Similar presentations


Presentation on theme: "Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl."— Presentation transcript:

1 Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl Beyond the Transfer and Merge Wordnet Construction: plWordNet and a Comparison with WordNet

2 Wordnet {samochód 1, pojazd samochodowy 1, auto 1, wóz 1 `car, automobile’ } {pogotowie 3, karetka 1, sanitarka 1, karetka pogotowia 1 `ambulance’ } meronymy { samochodzik 2 `small car’ } deminutiveness {bagażnik 1 `boot’ } hypernymy/hyponymy

3 plWordNet 2.0

4 Independent vs. Translation-based Wordnet Construction Transfer and merge. Examples: – EuroWordNet – most component wordnets built by the transfer method (Vossen 2002) – MultiWordNet – semi-automatic acquisition method from the Princeton WordNet (Bentivogli et. al. 2000) – IndoWordNet – expansion from Hindi Wordnet (Sinha et al. 2006, Bhattacharyya 2010) – FinWordNet – directly translated from the Princeton WordNet

5 Independent vs. Translation-based Wordnet Construction From scratch. Examples: – GermaNet – the core built independently – plWordNet – a unique, corpus-based method; largely independent of the Princeton WordNet

6 Synonymy and synsets “A wordnet is a collection of synsets linked by semantic relations.” A synset is a set of synonyms which represent the same lexicalised concept Synonyms are members of the same synset Wordnet development deserves better: an operational theory with precise guidelines for wordnet editors.

7 Basic building block: synset vs lexical unit? Synset relations link lexicalised concepts But are named after linguistic lexico-semantic relations Substitution tests are defined for lexical units Synsets group lexical units Every wordnet includes relations between lexical units (lexical relations), e.g., antonymy Lexical units can be observed in text, concepts cannot

8 Constitutive relations Synset = a group of lexical units which share all constitutive relations Constitutive relation = a lexico-semantic relation which – is frequent enough – and frequently shared by groups Also – is established in linguistics – and accepted in the wordnet tradition Examples: hypernymy, meronymy, cause

9 Synset as an abbreviation Synset as a notational convention for a group of lexical units sharing certain relations represents synonyms {afekt 1 `passion’, uczucie 2 `feeling’}  hypernym  {miłość 1 `love’, umiłowanie 1 `affection’, kochanie 1 `loving’} This is based on constitutive relations Additional distinctions: stylistic register and aspect Minimal committment principle: make as few assumptions as possible

10 Relations in plWordNet Starting point: relations in Princeton WordNet, EuroWordNet and GermaNet e.g., hyponymy, meronymy, antonymy, cause, instance for proper names Additional constitutive relations – e.g., verb meronymy, preceding, presupposition, – gradation for adjectives

11 Relations in plWordNet Specific: derivationally based lexico- semantic relations, e.g., – inhabitant (góral ‘highlander’ – góry ‘highlands’) – inchoativity (zapalić się perfect `light, start burning' -- palić się imperfect `burn, produce light') – process (chamieć imperfect `to become a boor‘ – cham `boor‘)

12 Construction process 1.Data collection: 1.8 billion words corpus 2.Data selection phase – corpus browsing – WSD-based word usage example extraction – WordnetWeaver: semi-automatic expansion 3.Data analysis – questions is it a correct Polish lemma? how many lexical units does it have? how to describe them with relations? Other knowledge sources: available Polish dictionaries, thesauri, encyclopaedias, lexicons, the Web, and intuition.

13 The result – size matters compared with Princeton WordNet: General statistics Lexical coverage Polysemy Synset size Relation density Hypernymy depth www.plwordnet.pwr.wroc.pl

14 General statistics Number of synsets, lemmas and LUs in the largest wordnets

15 Lexical coverage Proportion of lemmas from PWN/plWN found among vocabulary with a given corpus frequency

16 Polysemy Proportion of polysemous lemmas with regard to POS

17 Relation density Synset relation density in PWN 3.1 and in plWordNet 2.0

18 Hypernymy depth Hypernymy path length for nouns in PWN 3.1 and plWordNet 2.0

19 Hypernymy depth Polish WordNet Princeton WordNet

20 Hypernymy depth Computer ElectricDevice Device Artifact Object Physical Entity Polish WordNet Princeton WordNet SUMO

21 Mapping procedure: plWordNet onto Princeton WordNet 1.Recognise the sense of the source synset: the position in the network structure existing relations, commentaries; other synsets containing the given lemma 2.Search the target synset candidates for the target synset: intuitions, automatic prompting and dictionaries verifying candidates: comparing hypernymy and hyponymy structures existing inter-lingual relations; definitions, commentaries; dictionaries 3.Link the source synset with the target synset

22 Hierarchy of inter-lingual relations Inter-lingual Synonymy (only one per synset) Inter-lingual inter-register synonymy I-partial synonymy I-hyponymy I-hypernymy I-meronymy for parts, elements or materials of bigger wholes I-holonymy for a whole made of smaller parts, elements or materials

23 Results of inter-lingual mapping Mapping direction: plWordNet – Princeton WordNet Bottom-up – from the lowest levels in the hierarchy up ~48 300 synsets mapped (~64 400 lexical units/senses) – Synonymy: 15268 – Partial synonymy:971 – Inter-register synonymy:676 – Hyponymy: 23677 – Hypernymy:3526 – Meronymy:1898 – Holonymy:555 Mapped branches – people, artefacts, places, food, time units: all communication, states and processes, body parts, group names: partially

24 Different relations for coding the same conceptual dependencies

25 Applications Free WordNet-type licence facilitate applications. Examples: Semantic annotation in a corpus of referential gestures (Lis, 2012) Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012) Features for text mining from Web pages (Maciolek and Dobrowolski, 2013) Mapping between a lexicon and an ontology (Wróblewska et al., 2013) Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009) Text similarity for Information Retrieval (Siemiński, 2012) Text classification (Maciołek, 2010) Terminology extraction and clustering (Mykowiecka and Marciniak, 2012) Automated extraction of Opinion Attribute Lexicons (Wawer and Gołuchowski, 2012) Named Entity Recognition Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012) Anaphora resolution More than 500 registered users, ~70 declared commercial applications

26 Conclusions plWordNet 2.0 – a national wordnet not adapted from Princeton WordNet plWordNet 2.0 is comparable to WordNet 3.1 in size, as well as in lexical coverage, hypernymy depth and relation density Synset membership depends only on constitutive relations between lexical units. A unique mapping strategy and a unique opportunity to compare the two lexical systems plWordNet 3.0 (2015): – a comprehensive wordnet of Polish – 200k of lemmas and 260k of LUs, mapped to PWN 3.?

27 Thank-you www.plwordnet.pwr.wroc.pl Thank you!

28 Differences between plWN and PWN Inter-lingual lexico-grammatical differences: – marked forms (diminutives, augmentatives) – lexicalised gender – lexical gaps Differences in the definition of synonymy and synset: – 'Mixed' PWN synsets – marked and unmarked forms, feminine and masculine, countable and uncountable, hypernym and hyponym- hypernymy and (plWN) vs. and/or (PWN)

29 Differences between plWN and PWN Other differences: – synset definitions incompatible with relations (PWN) – different relations used for coding the same conceptual dependencies – more fine-grained meaning differentiation – differences boiling down to the content and size of resource

30 Differences in lexicalisation

31 Relation density Synset relation density in PWN 3.1 and in plWordNet 2.0 in the select semantic domains Semantic domain


Download ppt "Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl."

Similar presentations


Ads by Google