Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković University of Belgrade Faculty.

Similar presentations


Presentation on theme: "Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković University of Belgrade Faculty."— Presentation transcript:

1 Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty of Mining and Geology Đušina 7, 11000 Belgrade, Serbia

2 CALP 07 Workshop, Borovets, September 30, 2007 2 PWN – the Princeton WordNet Conceived in 1985 by George Miller and his associates from the Cognitive Science Laboratory Conceived in 1985 by George Miller and his associates from the Cognitive Science Laboratory A linguistic database that maps the way the mind stores and uses language A linguistic database that maps the way the mind stores and uses language Formalized as a semantic network of concepts : abstract ideas that denote objects in a given category or class Formalized as a semantic network of concepts : abstract ideas that denote objects in a given category or class Concepts represented by synsets : sets of synonymous word- sense pairs accompanied by a definition of the concept Concepts represented by synsets : sets of synonymous word- sense pairs accompanied by a definition of the concept Concepts are interconnected by various semantic relations, such as hypernym/hyponym (kind of, e.g. animal/dog) or holonym/meronym (part of, e.g. hand/finger) Concepts are interconnected by various semantic relations, such as hypernym/hyponym (kind of, e.g. animal/dog) or holonym/meronym (part of, e.g. hand/finger) Contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs Contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs

3 CALP 07 Workshop, Borovets, September 30, 2007 3 The wordnets that followed Developed for other languages by individual teams or through multilingual projects Developed for other languages by individual teams or through multilingual projects EuroWordNet - wordnets for English, Dutch, Italian, Spanish, French, German, Czech, and Estonian based on PWN and aligned by interconnecting synsets representing the same concept in different languages via an Inter-Lingual-Index (ILI) EuroWordNet - wordnets for English, Dutch, Italian, Spanish, French, German, Czech, and Estonian based on PWN and aligned by interconnecting synsets representing the same concept in different languages via an Inter-Lingual-Index (ILI) ILI also gives access to a shared top-ontology that provides a common semantic framework for all the languages with language specific properties maintained in the individual wordnets ILI also gives access to a shared top-ontology that provides a common semantic framework for all the languages with language specific properties maintained in the individual wordnets BalkaNet - wordnets for Bulgarian, Greek, Romanian, Serbian and Turkish and expanded Czech wordnet, followed an approach similar to EuroWordNet: wordnets were developed on basis of PWN and the top-ontology accepted in EuroWordNet, and also aligned by using ILI BalkaNet - wordnets for Bulgarian, Greek, Romanian, Serbian and Turkish and expanded Czech wordnet, followed an approach similar to EuroWordNet: wordnets were developed on basis of PWN and the top-ontology accepted in EuroWordNet, and also aligned by using ILI

4 CALP 07 Workshop, Borovets, September 30, 2007 4 Wordnet development tools A number of software tools for wordnets have been developed in the past decades A number of software tools for wordnets have been developed in the past decades As it could have been expected the first wordnet browser was developed for PWN As it could have been expected the first wordnet browser was developed for PWN Its latest version is freely distributed with the version 2.1 for Windows of the Princeton wordnet, while a web application for PWN browsing is also available Its latest version is freely distributed with the version 2.1 for Windows of the Princeton wordnet, while a web application for PWN browsing is also available Other wordnet tools have been initialized within larger projects, such as EuroWordNet and BalkaNet Other wordnet tools have been initialized within larger projects, such as EuroWordNet and BalkaNet There are also many other tools, developed for individual languages, such as Russian There are also many other tools, developed for individual languages, such as Russian

5 CALP 07 Workshop, Borovets, September 30, 2007 5 EuroWordNet tools Polaris - used for creating, editing and exporting wordnets Polaris - used for creating, editing and exporting wordnets –import of wordnets, editing and adding relations and query formulation –visualization of semantic relations as a tree-structure that can directly be edited –trees and sub-trees can be stored as distinct sets of synsets –matching sets of synsets across wordnets via the ILI –licensed from Lernout and Hauspie or from ELRA Periscope - a graphical database viewer for viewing and exporting wordnets Periscope - a graphical database viewer for viewing and exporting wordnets –a public viewer used to look at wordnets created by Polaris –cannot be used for importing or changing wordnets –freely distributed Other tools, such as WEI (Web EuroWordNet Interface) Other tools, such as WEI (Web EuroWordNet Interface) Development of all tools ceased with the termination of EuroWordNet Development of all tools ceased with the termination of EuroWordNet

6 CALP 07 Workshop, Borovets, September 30, 2007 6 VisDic Developed within the framework of the BalkaNet project and used as the main tool for building all BalkaNet wordnets Developed within the framework of the BalkaNet project and used as the main tool for building all BalkaNet wordnets Primarily aimed at browsing and editing wordnets, but expanded into a more general tool for viewing and editing various types of dictionary databases stored in XML format Primarily aimed at browsing and editing wordnets, but expanded into a more general tool for viewing and editing various types of dictionary databases stored in XML format Handles simultaneously up to 10 dictionaries, which can be monolingual or translational dictionaries, but also thesauri or plain corpora Handles simultaneously up to 10 dictionaries, which can be monolingual or translational dictionaries, but also thesauri or plain corpora Available for both Linux and Windows platforms Available for both Linux and Windows platforms The development of VisDic itself has finished but a completely new client-server version of this tool DEBVisDic is now being developed, and can be obtained free of charge, subject to registration The development of VisDic itself has finished but a completely new client-server version of this tool DEBVisDic is now being developed, and can be obtained free of charge, subject to registration

7 CALP 07 Workshop, Borovets, September 30, 2007 7 The two lexicographers problems The concept placement problem: The concept placement problem: Where should a new concept be placed and how should links with existing concepts be established? The synonym selection problem: The synonym selection problem: How should the concept be lexicalized, namely, how two select the set of word-sense pairs for the synset that represents the concept? In some cases wordnet development tools can offer support to the user in solving the first problem, but are of very little use for solving the other In some cases wordnet development tools can offer support to the user in solving the first problem, but are of very little use for solving the other

8 CALP 07 Workshop, Borovets, September 30, 2007 8 On the concept placement problem Many wordnets approached this problem by relying on the conceptual network of PWN as the basis for development Many wordnets approached this problem by relying on the conceptual network of PWN as the basis for development If this approach is adopted wordnet development tools can offer support in solving the concept placement problem If this approach is adopted wordnet development tools can offer support in solving the concept placement problem Using PWN as a common conceptual network is especially convenient in cases of aligned multilingual wordnets, such as EuroWordNet and BalkaNet Using PWN as a common conceptual network is especially convenient in cases of aligned multilingual wordnets, such as EuroWordNet and BalkaNet Open questions: Open questions: –Are concepts linguistically independent or not? –Are the lexicalization patterns for concepts universal? –Is the structure of PWN valid for other languages as well? –Is the set of semantic relations built in PWN sufficient for all languages?

9 CALP 07 Workshop, Borovets, September 30, 2007 9 On the synonym selection problem Once a concept has been accepted and placed within the conceptual framework of a particular language the lexicographer is confronted with the problem of its lexicalization Once a concept has been accepted and placed within the conceptual framework of a particular language the lexicographer is confronted with the problem of its lexicalization Besides selecting the appropriate synonyms he/she also needs to provide a gloss, and preferably usage examples Besides selecting the appropriate synonyms he/she also needs to provide a gloss, and preferably usage examples As synset elements appear as word-sense pairs the lexicographer has to assign senses to all chosen words As synset elements appear as word-sense pairs the lexicographer has to assign senses to all chosen words The use of linguistic resources, such as electronic dictionaries, bilingual word lists and corpora can be of invaluable help to the lexicographer in accomplishing this task The use of linguistic resources, such as electronic dictionaries, bilingual word lists and corpora can be of invaluable help to the lexicographer in accomplishing this task

10 CALP 07 Workshop, Borovets, September 30, 2007 10 WS4LR (WorkStation for Lexical Resources) A software tool developed within the Human Language Technology group at the University of Belgrade A software tool developed within the Human Language Technology group at the University of Belgrade Enables integrated handling of electronic dictionaries, wordnets, aligned texts and transducers Enables integrated handling of electronic dictionaries, wordnets, aligned texts and transducers When wordnets are concerned, builds on the features developed by previous tools, especially VisDic When wordnets are concerned, builds on the features developed by previous tools, especially VisDic Differs from other wordnet tools by the fact that handling wordnets is only one of its functionalities Differs from other wordnet tools by the fact that handling wordnets is only one of its functionalities Allows exploitation of other resources during wordnet development, giving the lexicographer more support in his/her task Allows exploitation of other resources during wordnet development, giving the lexicographer more support in his/her task

11 CALP 07 Workshop, Borovets, September 30, 2007 11 Motivation for WS4LR The variety of lexical resources developed for many years within different projects and different conceptual and technological frameworks The variety of lexical resources developed for many years within different projects and different conceptual and technological frameworks A certain level of heterogeneity despite efforts to keep the growing pool of resources coherent and standardized A certain level of heterogeneity despite efforts to keep the growing pool of resources coherent and standardized The necessity to develop a tool that would facilitate the maintenance, exploitation and integration of available resources as well as their further development The necessity to develop a tool that would facilitate the maintenance, exploitation and integration of available resources as well as their further development A need for an integrated and easily adjustable tool that would enhance the potentials of each particular resource A need for an integrated and easily adjustable tool that would enhance the potentials of each particular resource The idea of exploiting the synergy of various resources for different HLT tasks, including wordnet development The idea of exploiting the synergy of various resources for different HLT tasks, including wordnet development

12 CALP 07 Workshop, Borovets, September 30, 2007 12 Structure and characteristics Composed of several modules which perform the following functions: Composed of several modules which perform the following functions: –development and refinement of wordnets –management of a system of morphological, bilingual and multilingual electronic dictionaries –manipulation of parallel aligned texts –conversions between different character encodings and resource formats Developed in C# and operates on the.NET platform Developed in C# and operates on the.NET platform Enables invoking command-line routines and external Perl, Awk, and XSLT scripts Enables invoking command-line routines and external Perl, Awk, and XSLT scripts

13 CALP 07 Workshop, Borovets, September 30, 2007 13 Dictionary management The main task of this module is to enable the manipulation of a system of morphological dictionaries of canonical forms, or lemmas, for both simple and compound words The main task of this module is to enable the manipulation of a system of morphological dictionaries of canonical forms, or lemmas, for both simple and compound words Morphological dictionaries are of great importance for highly inflective languages, such as the group of Slavic languages Morphological dictionaries are of great importance for highly inflective languages, such as the group of Slavic languages The absence of morphological information in wordnets has turned out to be a serious flaw in many applications The absence of morphological information in wordnets has turned out to be a serious flaw in many applications The possibility offered by WS4LR to simultaneously exploit both resources proved to be a great advantage in wordnet development The possibility offered by WS4LR to simultaneously exploit both resources proved to be a great advantage in wordnet development WS4LR also manipulates bilingual word list and a multilingual dictionary of proper names which can also be used in wordnet development WS4LR also manipulates bilingual word list and a multilingual dictionary of proper names which can also be used in wordnet development

14 CALP 07 Workshop, Borovets, September 30, 2007 14 The lemma format The lemma in a morphological dictionary of simple words has the following format: The lemma in a morphological dictionary of simple words has the following format: lemma.Knnn [+SinSem]* where lemma is the word form used in traditional dictionaries, K represents the part of speech (noun, verb, adjective, etc.), and nnn the inflectional class code of the lemma, whose characteristics are described by a corresponding transducer labeled Knnn +SinSem is a set of optional tags which describe the syntactic, semantic, derivational and other properties of the lemma +SinSem is a set of optional tags which describe the syntactic, semantic, derivational and other properties of the lemma The format of the lemmas for compound words is more complex, but it basically relies on the same principles The format of the lemmas for compound words is more complex, but it basically relies on the same principles

15 CALP 07 Workshop, Borovets, September 30, 2007 15 Intex The format used in the system of morphological dictionaries is based on the LADL format developed in the Laboratoire d'Automatique Documentaire et Linguistique under the direction of Maurice Gross The format used in the system of morphological dictionaries is based on the LADL format developed in the Laboratoire d'Automatique Documentaire et Linguistique under the direction of Maurice Gross The first system developed for processing of texts using dictionaries in LADL format was a system called Intex The first system developed for processing of texts using dictionaries in LADL format was a system called Intex Intex uses dictionaries in combination with regular expressions and inflectional and morphological finite state transducers (FSTs) to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words in texts Intex uses dictionaries in combination with regular expressions and inflectional and morphological finite state transducers (FSTs) to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words in texts Text parsing possibilities offered by regular expressions and FSTs proved very useful in wordnet development Text parsing possibilities offered by regular expressions and FSTs proved very useful in wordnet development

16 CALP 07 Workshop, Borovets, September 30, 2007 16 NooJ and Unitex Although Intex has been developed for many years and used by over 80 HLT laboratories it does not support the processing of texts in Unicode Although Intex has been developed for many years and used by over 80 HLT laboratories it does not support the processing of texts in Unicode As the usage of Unicode became more and more frequent the development of a new tool that could handle text in Unicode became inevitable As the usage of Unicode became more and more frequent the development of a new tool that could handle text in Unicode became inevitable Building on the functionalities of Intex, but allowing the processing of texts in Unicode, such a new tool has been developed under the name of NooJ Building on the functionalities of Intex, but allowing the processing of texts in Unicode, such a new tool has been developed under the name of NooJ Another system, Unitex, based on LADL format and supporting resources in Unicode has been developed in parallel, and is also available Another system, Unitex, based on LADL format and supporting resources in Unicode has been developed in parallel, and is also available

17 CALP 07 Workshop, Borovets, September 30, 2007 17 Integrating the three systems in WS4LR As each of the three systems has some useful specific features WS4LR allows the user to activate the functions of Intex, Unitex and/or NooJ system, and select a list of dictionaries he/she wants to use, As each of the three systems has some useful specific features WS4LR allows the user to activate the functions of Intex, Unitex and/or NooJ system, and select a list of dictionaries he/she wants to use, As none of the three systems offers possibilities for managing the content of dictionaries themselves, WS4LR provides entry, editing and review of lemmas of simple and compound words, for all three solutions As none of the three systems offers possibilities for managing the content of dictionaries themselves, WS4LR provides entry, editing and review of lemmas of simple and compound words, for all three solutions Dictionaries are organized in a modular fashion - in several sub-dictionaries as separate files Dictionaries are organized in a modular fashion - in several sub-dictionaries as separate files Smaller files are easier to manipulate, and in text recognition by Intex/Unitex the usage of all dictionaries is not always necessary, or even recommended Smaller files are easier to manipulate, and in text recognition by Intex/Unitex the usage of all dictionaries is not always necessary, or even recommended

18 CALP 07 Workshop, Borovets, September 30, 2007 18 Lemma management The user can modify or delete all the information attached to a lemma, or the lemma itself, as well as to add new entries The user can modify or delete all the information attached to a lemma, or the lemma itself, as well as to add new entries A new entry can be generated from scratch or by copying an existing lemma, which in some cases facilitates the work A new entry can be generated from scratch or by copying an existing lemma, which in some cases facilitates the work The regular expression or a FST graph describing the inflectional properties of the selected lemma can be inspected and corrected if found inadequate The regular expression or a FST graph describing the inflectional properties of the selected lemma can be inspected and corrected if found inadequate Subsets of lemmas can be extracted by matching the lemmas, their part of speech, inflectional class code, syntactic and semantic markers or their Boolean combination Subsets of lemmas can be extracted by matching the lemmas, their part of speech, inflectional class code, syntactic and semantic markers or their Boolean combination For instance, one can look for all the dictionary entries starting or ending with a search string which is particularly useful when the inflectional class code of a new lemma is being established, since this code depends on the lemma ending For instance, one can look for all the dictionary entries starting or ending with a search string which is particularly useful when the inflectional class code of a new lemma is being established, since this code depends on the lemma ending

19

20 CALP 07 Workshop, Borovets, September 30, 2007 20

21 CALP 07 Workshop, Borovets, September 30, 2007 21 Compound words Dictionaries of compound words can be a valuable resource in the wordnet development task Dictionaries of compound words can be a valuable resource in the wordnet development task The form for new entries in these dictionaries is more complex since more information need to be supplied: The form for new entries in these dictionaries is more complex since more information need to be supplied: –information pertaining to the entry as a whole –information associated to the compound lemma constituents For inflected compound constituents additional information is needed: the lemma, its inflectional class code, as well as the list of grammatical categories of the form that appears in the compound lemma For inflected compound constituents additional information is needed: the lemma, its inflectional class code, as well as the list of grammatical categories of the form that appears in the compound lemma

22 CALP 07 Workshop, Borovets, September 30, 2007 22

23 CALP 07 Workshop, Borovets, September 30, 2007 23 Bilingual word lists WS4LR also handles bilingual word lists, as well as multilingual dictionaries, such as Prolex, the multilingual dictionary of proper names based on an ontology built around the conceptual proper name and its relations WS4LR also handles bilingual word lists, as well as multilingual dictionaries, such as Prolex, the multilingual dictionary of proper names based on an ontology built around the conceptual proper name and its relations This adds additional functionality to the integration of lexical resources offered by WS4LR in various tasks, including wordnet development This adds additional functionality to the integration of lexical resources offered by WS4LR in various tasks, including wordnet development

24 CALP 07 Workshop, Borovets, September 30, 2007 24 Management of aligned parallel texts Parallel texts, which usually originate from a text in one language and its translation in another, are often aligned at a certain level (paragraph, sentence, etc) by matching the corresponding segments of the original and its translation Parallel texts, which usually originate from a text in one language and its translation in another, are often aligned at a certain level (paragraph, sentence, etc) by matching the corresponding segments of the original and its translation Aligned parallel texts are a valuable lexical resource which can be used for many HLT tasks, including wordnet development Aligned parallel texts are a valuable lexical resource which can be used for many HLT tasks, including wordnet development The WS4LR module for management of aligned parallel texts uses texts which have previously been aligned using Xalign as an alignment tool The WS4LR module for management of aligned parallel texts uses texts which have previously been aligned using Xalign as an alignment tool The module converts these texts to the Translation Memory eXchange (TMX) format, which is becoming the standard format for aligned texts The module converts these texts to the Translation Memory eXchange (TMX) format, which is becoming the standard format for aligned texts The module can also use texts that are already in that format The module can also use texts that are already in that format

25 CALP 07 Workshop, Borovets, September 30, 2007 25

26 CALP 07 Workshop, Borovets, September 30, 2007 26 Conversion Adds to the flexibility of resource exploitation Adds to the flexibility of resource exploitation Conversion from one character encoding set to another enables the exploitation of language resources both in Cyrillic and Latin alphabet Conversion from one character encoding set to another enables the exploitation of language resources both in Cyrillic and Latin alphabet The transformation can be applied to only a part of the file, e.g., when a dictionary type file is transformed, only lemmas and word forms are converted, not the part of speech and grammatical codes The transformation can be applied to only a part of the file, e.g., when a dictionary type file is transformed, only lemmas and word forms are converted, not the part of speech and grammatical codes The module also makes switching between resources in Intex and Unitex quick and easy The module also makes switching between resources in Intex and Unitex quick and easy The user can also choose a conversion Perl or awk script suitable for a specific file type, or even produce his/her own script The user can also choose a conversion Perl or awk script suitable for a specific file type, or even produce his/her own script

27 CALP 07 Workshop, Borovets, September 30, 2007 27 Wordnet management The wordnet management module supports search of wordnets, their visualization, as well as their development and refinement The wordnet management module supports search of wordnets, their visualization, as well as their development and refinement When this module is activated, the main form opens with two wordnet windows, thus offering to the user the possibility to work with one or two wordnets When this module is activated, the main form opens with two wordnet windows, thus offering to the user the possibility to work with one or two wordnets In the current version of WS4LR these two wordnets are the Serbian and English wordnet, but the tool can be easily adapted for any two wordnets In the current version of WS4LR these two wordnets are the Serbian and English wordnet, but the tool can be easily adapted for any two wordnets If the user decides to work with both wordnets in parallel, he/she can always synchronize them via the ILI If the user decides to work with both wordnets in parallel, he/she can always synchronize them via the ILI The main form for wordnet management also opens a window with a bilingual word list The main form for wordnet management also opens a window with a bilingual word list

28 CALP 07 Workshop, Borovets, September 30, 2007 28 Searching wordnets The user can choose to search just one wordnet or both of them The user can choose to search just one wordnet or both of them Synsets can be retrieved using various methods, from simple string matching to complex Xpath expressions Synsets can be retrieved using various methods, from simple string matching to complex Xpath expressions In simple string matching the user can specify whether an exact match is required or not, and in the latter case the system will also retrieve synsets that contain words which have the specified string(s) as their part In simple string matching the user can specify whether an exact match is required or not, and in the latter case the system will also retrieve synsets that contain words which have the specified string(s) as their part The user can use Xpath expressions to retrieve synsets on basis of various other criteria, such as the domain synsets belong to The user can use Xpath expressions to retrieve synsets on basis of various other criteria, such as the domain synsets belong to WS4LR offers predefined Xpath expressions, but the user can also define these expressions him/herself WS4LR offers predefined Xpath expressions, but the user can also define these expressions him/herself

29 CALP 07 Workshop, Borovets, September 30, 2007 29

30 CALP 07 Workshop, Borovets, September 30, 2007 30 Adding new concepts using its hypernym With a particular concept in mind, the lexicographer can inspect the wordnet for the existence of its hypernym With a particular concept in mind, the lexicographer can inspect the wordnet for the existence of its hypernym If an appropriate hypernym is found, the new synset can be placed as its hyponym If an appropriate hypernym is found, the new synset can be placed as its hyponym In order to find such a hypernym the search possibilities offered by WS4LR can be used In order to find such a hypernym the search possibilities offered by WS4LR can be used As synsets can be visualized in various forms: as text, XML or hypernym/hyponym the possibility of navigation through the hypernym/hyponym tree can also be used to locate a hypernym As synsets can be visualized in various forms: as text, XML or hypernym/hyponym the possibility of navigation through the hypernym/hyponym tree can also be used to locate a hypernym

31 CALP 07 Workshop, Borovets, September 30, 2007 31

32 CALP 07 Workshop, Borovets, September 30, 2007 32 Adding new concepts using PWN Starting with a word that denotes the concept the user can locate the candidate PWN synsets available using the bilingual word list Starting with a word that denotes the concept the user can locate the candidate PWN synsets available using the bilingual word list Using the option Match ID the user can first identify the synsets in the source and target wordnet that already have a match Using the option Match ID the user can first identify the synsets in the source and target wordnet that already have a match If the matching PWN synset for the new concept is found, the new synset can be inserted in the appropriate place in the target wordnet using the option Create synset in the other language If the matching PWN synset for the new concept is found, the new synset can be inserted in the appropriate place in the target wordnet using the option Create synset in the other language If necessary, this option also creates copies of all its missing hypernyms, to prevent the new synset of becoming a dangling synset If necessary, this option also creates copies of all its missing hypernyms, to prevent the new synset of becoming a dangling synset The user can then proceed with modifications The user can then proceed with modifications

33 CALP 07 Workshop, Borovets, September 30, 2007 33

34 CALP 07 Workshop, Borovets, September 30, 2007 34 Selection of synonymous words WS4LR also offers substantial aid in solving the synonym selection problem - the selection of synonymous words for the synset and the assignment of meanings to these words WS4LR also offers substantial aid in solving the synonym selection problem - the selection of synonymous words for the synset and the assignment of meanings to these words Although it is reasonable to assume that the wordnet developer has a pretty good idea of the candidate words for the synset of the concept he/she wants to add to the wordnet, it is also possible that he/she might neglect some of them Although it is reasonable to assume that the wordnet developer has a pretty good idea of the candidate words for the synset of the concept he/she wants to add to the wordnet, it is also possible that he/she might neglect some of them As the simplest and most straightforward aid the bilingual wordlist can be used As the simplest and most straightforward aid the bilingual wordlist can be used Words from the source (English) synset can be matched with words in the target language as probable candidates Words from the source (English) synset can be matched with words in the target language as probable candidates The multilingual dictionary Prolex could be used in a similar manner The multilingual dictionary Prolex could be used in a similar manner

35 CALP 07 Workshop, Borovets, September 30, 2007 35

36 CALP 07 Workshop, Borovets, September 30, 2007 36 Using aligned texts The synonym selection problem can be approached by combining two wordnets and aligned texts The synonym selection problem can be approached by combining two wordnets and aligned texts WS4LR search both aligned texts in parallel using selected words from both languages WS4LR search both aligned texts in parallel using selected words from both languages All of the words found in both texts are highlighted All of the words found in both texts are highlighted A lexicographer can use this option to extract possible candidate words for a synset by searching aligned texts with words from the original PWN synset and words he/she has already selected for the target synset A lexicographer can use this option to extract possible candidate words for a synset by searching aligned texts with words from the original PWN synset and words he/she has already selected for the target synset If a highlighted word found in the text in English does not have a highlighted match in the text in the target language, the lexicographer should inspect the sentence in the target language for a possible match, which would then be a new candidate for the synset If a highlighted word found in the text in English does not have a highlighted match in the text in the target language, the lexicographer should inspect the sentence in the target language for a possible match, which would then be a new candidate for the synset

37 CALP 07 Workshop, Borovets, September 30, 2007 37

38 CALP 07 Workshop, Borovets, September 30, 2007 38 ? ?

39 CALP 07 Workshop, Borovets, September 30, 2007 39 Checking synset words in context Once the user has rounded all the candidate words for the synset he/she might be in doubt whether one or more words properly fit into the synset Once the user has rounded all the candidate words for the synset he/she might be in doubt whether one or more words properly fit into the synset In that case the user might want to observe these words within a context, which can be done by searching a corpus for these words and obtaining concordances In that case the user might want to observe these words within a context, which can be done by searching a corpus for these words and obtaining concordances By getting the occurrences of the words within the context, the user will be able to make a better assessment whether they are really appropriate or not By getting the occurrences of the words within the context, the user will be able to make a better assessment whether they are really appropriate or not In WS4LR this can be realized by creating a regular expression or FST graph from one or more words, and using it to search a text in the target language In WS4LR this can be realized by creating a regular expression or FST graph from one or more words, and using it to search a text in the target language

40 CALP 07 Workshop, Borovets, September 30, 2007 40

41 CALP 07 Workshop, Borovets, September 30, 2007 41

42 CALP 07 Workshop, Borovets, September 30, 2007 42 Wordnet consistency checks The WS4LR wordnet module also performs various consistency checks on wordnets The WS4LR wordnet module also performs various consistency checks on wordnets For example, when word senses are in question, WS4LR provides information of the senses that have already been used for a word, so the user can assign a sense tag that has not previously been assigned, thus preventing duplicate word-sense pairs For example, when word senses are in question, WS4LR provides information of the senses that have already been used for a word, so the user can assign a sense tag that has not previously been assigned, thus preventing duplicate word-sense pairs The wordnet module can also detect dangling relations, and the use of the same word in a hypernym/hyponym pair, which is not allowed The wordnet module can also detect dangling relations, and the use of the same word in a hypernym/hyponym pair, which is not allowed

43 CALP 07 Workshop, Borovets, September 30, 2007 43 Using morphological information Morphological dictionaries extend the search possibilities by enabling searches with all inflected forms of the words which is of great importance in the case of highly inflective languages, such as Serbian Morphological dictionaries extend the search possibilities by enabling searches with all inflected forms of the words which is of great importance in the case of highly inflective languages, such as Serbian WS4LR also enables the enrichment of synsets with morphosyntactic information from morphological dictionaries WS4LR also enables the enrichment of synsets with morphosyntactic information from morphological dictionaries The tool can search for all synset words in morphological dictionaries of simple or compound lemmas, retrieve their inflectional class codes, and assign them to synset words using the XML tag The tool can search for all synset words in morphological dictionaries of simple or compound lemmas, retrieve their inflectional class codes, and assign them to synset words using the XML tag If more lemmas of the same form exist, they are all offered to the user to choose the appropriate one If more lemmas of the same form exist, they are all offered to the user to choose the appropriate one The missing morphosyntactic information can thus be added to wordnets The missing morphosyntactic information can thus be added to wordnets

44 CALP 07 Workshop, Borovets, September 30, 2007 44

45 CALP 07 Workshop, Borovets, September 30, 2007 45 Concluding remarks The desktop version of WS4LR is fully operational and is already being used as the main tool for developing resources in Serbian, including the Serbian wordnet, but its commercial applications have not yet been considered The desktop version of WS4LR is fully operational and is already being used as the main tool for developing resources in Serbian, including the Serbian wordnet, but its commercial applications have not yet been considered Although a systematic evaluation of WS4LR has not been performed, there have already been several enhancements of the tool on basis of user feedback Although a systematic evaluation of WS4LR has not been performed, there have already been several enhancements of the tool on basis of user feedback A full-scale web version of this tool is planned, which would enable its usage in wordnet development by several lexicographers concurrently, with all the possibilities the desktop version now offers A full-scale web version of this tool is planned, which would enable its usage in wordnet development by several lexicographers concurrently, with all the possibilities the desktop version now offers Presently, some of the WS4LR functions are available on the web for searches based on morphological (using dictionaries) semantic (using wordnets) and multilingual (using aligned multilingual wordnets) expansions of the initial query Presently, some of the WS4LR functions are available on the web for searches based on morphological (using dictionaries) semantic (using wordnets) and multilingual (using aligned multilingual wordnets) expansions of the initial query


Download ppt "Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković University of Belgrade Faculty."

Similar presentations


Ads by Google