Presentation is loading. Please wait.

Presentation is loading. Please wait.

PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Similar presentations

Presentation on theme: "PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research."— Presentation transcript:

1 plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research Group, Institute of Informatics Wroc ł aw University of Technology * School of Electrical Engineering and Computer Science University of Ottawa

2 Wordnet as a Lexical Resource Princeton WordNet defines de facto standard –large size and coverage –open access –thousands of applications Applications: dictionary vs knowledge representation Range of description Ideal size and natural development limits

3 plWordNet model: linguistic resource Wordnet vs ontology –O: a strict knowledge representation –W: concepts expressed entirely in a natural language –W: synonymy is a matter of degree –O: certainty and a rigorous construction –W: shaped by the lexico-semantic dependencies Alternative to formalisation –Corpus analysis and substitution tests –Minimal commitment: defining lexico-semantic relations without committing to any particular theory of lexical semantic or human cognition

4 plWordNet model: corpus-based development Main source of lexical knowledge: a very large monolingual corpus –tools for corpus browsing –semi-automatic knowledge extraction Additional sources: dictionaries and encyclopedias Lexical unit –lemma-sense pair –a linguistically motivated primitive

5 plWordNet model: synset definition Synsets –groups of lexical units sharing certain relations {afekt 1 `passion’, uczucie 2 `feeling’}  hypernym  {mi ł o ść 1 `love’, umi ł owanie 1 `affection’, kochanie 1 ~`loving’} Constitutive relations –fairly frequent (to describe many LUs) –shared among LUs (to define groups) –grounded in the linguistic tradition (to facilitate their consistent understanding) –used in other wordnets (to improve compatibility)

6 plWordNet model: non-relational aspects Constitutive features –stylistic registers, –verb aspect –and semantic verb classes Referred to in the relation definitions –e.g. relations limited to verbs of the same aspect and semantic class Glosses helps wordnet editors Usage examples: direct links to the corpus

7 Relation density Synset relation density in PWN 3.1 and in plWordNet 2.0

8 Size matters: lexical coverage Coverage of PWN/plWN for lemmas of different frequency in two similar 1.2G words corpora (Wikipedia)

9 Size matters: plWordNet 2.2 POSSynsetsLemmasLUsAverage synset Nouns102 613105 883140 701 1.37 Verbs21 89717 55432 1801.47 Adjectives15 14511 67718 787 1.24 All139 656135 115191 6691.37

10 plWordNet: ongoing work

11 Size matters: comparison of wordnets

12 How many words are there? - existing dictionaries ● Woordenboek der Nederlandsche Taal 430k lemmas ● dictionary of Grimm brothers 330k lemmas ● Oxford English Dictionary 300k lemmas ● `Warsaw’ Polish Dictionary 280k lemmas ● contemporary Polish dictionaries 130k lemmas unabridged dictionaries

13 ~174k (10+ lemmas) COBUILD data How many words are there? - approximation

14 # entries Polish dictionaries100-280k plWordNet corpus (10+ lemmas) [K]174k doubled plWordNet corpus (0+ lemmas) [GT]+200k How many words are there? K - Krishnamurthy’s data (2002), GT - Good & Toulmin approximation (1956) plWordNet 3.0 200k lemmas

15 Toolkit of Lexico-semantic Resources Lexicon of lexico-syntactic structures of multi-word expressions plWordNet 3.0 (Słowosieć 3.0) plWordNet 3.0 to WordNet 3.1 mapping Semantic lexicon of proper names Mapping to an ontology And a valency lexicon linked to plWordNet

16 Lexicon of multi-word expressions Non-trivial morphology of Polish MWEs –more than 100 nominal structural patterns Description of the lexico-syntactic structures of MWEs Multi-word LUs as semantic atoms –no internal semantic relations Dynamic lexicon –a tool for automatic MWE extraction –60 000 described in the lexicon and plWordNet

17 Lexicon of Proper Names PNs are not a part of the lexicon PN is an instance of a type –characterised by referents –not by their semantic properties Linking PNs via a wordnet –some lexico-syntactic contexts signal instance of –PNs are represented in wordnets PNs as derivational bases for Common Nouns Dynamic lexicon with 2.5 milion PNs verified manually

18 plWordNet to WordNet 3.1 mapping plWordNet: built independently to obtain faithful description Manual mapping –bottom-up order –comparison of the relations structures –a cascading list of Interlingual-relations plWordNet verification as an important side effect Present state: 72 000 N and Adj synsets mapped Target: complete plWordNet 3.0 mapped

19 Wordnet editor: WordnetLoom

20 WordnetLoom: editing the mapping

21 Mapping to ontology Ontology: unambiguous concepts defined formally Lexical meanings –imprecisely delimited –constrained by usage, stylistic register and sentiment Mapping to ontology –precise, formal description for meanings –association: concepts – their lexical embodiment SUMO selected –Princeton WordNet mapping –Semi-automated mapping of plWordNet

22 Expectations plWordNet 3.0 Valence lexiconMWE lexicon WordNet 3.1 + extension Proper Names Ontology: SUMO + intermediate level describes

23 Applications Strong universal basis –a comprehensive wordnet >200 000 lemmas resulting in ~285 000 LUs and ~210 000 synsets –one of the largest ever Polish dictionaries Modularly constructed toolkit –a layered architecture of large software systems –separate but linked layers –each layer based on limited set of notions and principles and exchangeable The core of the CLARIN-PL language technology infrastructure

24 Thank-you Thank you!

Download ppt "PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research."

Similar presentations

Ads by Google