Download presentation
Presentation is loading. Please wait.
Published bySheila Washington Modified over 9 years ago
1
1 Electronic dictionaries Duško Vitas University of Belgrade, Faculty of Mathematics
2
2 One definition of a (traditional) dictionary A dictionary is a book in which the words and phrases of a language are listed alphabetically, together with their meanings or their translations in another language. (Collins Cobuild, English Dictionary for Advanced Users)
3
3 From Dictionary.comDictionary.com ... a book, optical disc, mobile device, or online lexical resource containing a selection of the words of a language, giving information about their meanings, pronunciations, etymologies, inflected forms, derived forms, etc., expressed in either the same or another language... Print dictionaries of various sizes, ranging from small pocket dictionaries to multivolume books, usually sort entries alphabetically... All electronic dictionaries, whether online or installed on a device, can provide immediate, direct access to a search term, its meanings...
4
4 Some definitions of e-dictionaries According to one definition: (Schryver, 2003) “any dictionary that can be used in an automated environment” The other definition says (Jacquet-Pfau, 2002) that “electronic dictionary intended for automated processing of texts (corpora) differ from machine-readable dictionaries that are intended to human users”
5
5 Characteristics of e-dictionaries E-dictionaries have to fulfill two basic criteria: They have to be formally established so that computer programs can process them; besides that, e- dictionaries complement grammars as all exceptions are listed in them. E-dictionaries have to be exhaustive since they have to cover 100% of lexica of a language in question; a parser that processes a text should not be impeded by unknown words. This aim is difficult to achieve. As opposed to grammars, an e-dictionary tends to desribe extensively lexical properties of lemmas.
6
6 Development of e-dictionaries Can development of an e-dictionary rely on some excellent traditional dictionary? Traditional dictionaries are often limited in size (e.g. for commercial reasons); Information in them is often implicit – they rely on the belief that a human will easily supply all missing data, for instance, a human will correctly deduce a whole paradigm if offered one or two inflective endings. Information is often partial (e.g. in Serbian a noun otac has two possible plural forms očevi, oci; for automatic processing it is necessary to explicitly know whether it is possible to say: ?očevi nacije ‘national founding fathers’, ?oci dece ‘fathers of the childrens’, even Očevi i oci (title of a novel)
7
7 From a list of words to an e- dictionary Many computer scientists in the past thought that a list of words taken from a traditional dictionary is good starting point for the development of an e-dictionary. This attitude was influenced by work done for English which is not a typical example of an European language (from the point of view of the automatic processing, because of its modest inflection). Before one should start to develop an e-dictionary it should be clear what is going to be its basic unit (lemma), and then how its other forms can be generated from it.
8
8 Defining a basic unit of an e- dictionary Automatic text processing usually begins with simple words as basic units of texts. This is a natural starting point because they are formalized for most of European languages. However, simple words are not always a natural unit of processing, because they are: ambiguous (dictionaries offer for them several meanings); pointless (many terms have several constituents, and each of them does not contribute directly to the meaning of a term) Because of that dictionaries of simple words have to be complemented with other types of dictionaries and grammars that will provide a natural units of processing.
9
9 Types of e-dictionaries E-dictionaries of simple words (dictionaries of simple graphemic units– these are usually entries in traditional dictionaries); E-dictionaries of multi-word units (multi-word units that contain non-letter characters, terminology, collocations, phrases,...); Phonological e-dictionaries (pronunciation of simple words, with rules of how to pronounce inflected forms, words in contact, etc.); Semantic e-dictionaries (simple words and multi-word units with encoded senses – network of senses?)
10
10 A prerequisite for the development of an e-dictionary of simple words A selection of lexical categories and a way to represent them. Traditional categories: Part-of-Speech: noun, verb, adjective,... subordinated categories: possessive, indefinite, definite,... inflectional categories: masculine, feminine, neuter, nominative, genitive,... syntactic categories: transitive, intransitive,... semantic categories: human, abstraction, concrete object,...
11
11 The selection of categories is not a straightforward task A sat of tags used to annotate the Brown corpus Brown Brown A sat of tags used to annotate the Penn tree bank Penn Penn A sat of tags used for the Multext-East project Multext-East Multext-East
12
12 LADL format of electronic dictionaries Unitex works with dictionaries that were developed by members of the Relex network. Relex is an international informal network of laboratories that work on computational linguistics. It was established by Maurice Gross and his LADL team. (LADL is shortened for Laboratoire d'Automatique Documentaire et Linguistique) Members of the Relex network developed exhaustive e- dictionaries of simple words and compounds (http://infolingu.univ-mlv.fr/Relex/Relex.html )http://infolingu.univ-mlv.fr/Relex/Relex.html
13
13 A selection of canonic forms In the case that a word has several surface forms, one of them is chosen as a canonic representative for other, subordinate forms. What are canonic forms in e-dictionaries of French? For nouns, as a rule that is the singular masculine form; For verbs, that is the infinitive form...
14
14 Is the selection of a canonic form unique? It is neither simple nor unique. For instance, In French, the gender of nouns is an inflectional category, that is lecteur, lecteurs, lectrice, lectrices are four forms of the same word – its canonic form is lecteur In Serbian, the gender of nouns is not an inflectional category; so, učitelj and učiteljica (traditionally, as well as in the Serbian e-dictionary) are two canonic forms, each with its own subordinated forms. Similarly in Bulgarian: учител and учителка
15
15 Why is the adequate selection of an canonic form so important? A lot of information about a word is attached to its canonic form – all subordinate forms share that information: učiteljica has semantic features +Hum+Prof and the same features have all its inflected forms: učiteljice, učiteljici, učiteljicu,... Is this a rule that release us from further from making other decisions? No, in Serbian the gender and the animacy are features of subordinate forms, not canonic forms. Why? Nouns can change gender in plural forms, vladika (m) vladike (f) In order to treat the same category always in the same way (for nouns, adjectives, pronouns, numerals, etc.)
16
16 More about categories attached to canonic and subordinate forms First, there was a mouse This mouse is alive Its canonic form is miš,N+Zool Then came a mouse This mouse is not alive Its canonic form is miš,N+Conc What is the value of the grammatical category “animacy” for this new mouse? Da biste se prebacili na sledeće poglavlje DVD-a, pomerite miša Da biste kontrolisali reprodukciju televizije uživo, pomerite miš kako bi se prikazale kontrole za reprodukciju Google: 28,300 Google:19,000
17
17 More on the selection of a canonic form Passive past participles are not separate entries in Serbian traditional dictionaries – these forms belong to the verb paradigm. What about passive past participles that are used as adjectives? A program for automatic text processing has to recognize them somehow and to tag them appropriately. For instance, a sample of “Politika” having 582,000 simple word tokens contains only in the feminine gender accusative 228 adjectives derived from the past participle (they are not all correct)adjectives derived from the past participle
18
18 And what about... Present past participle (functioning as an adjective) – “Politika” – occurrences in the feminine accusative singular forms; occurrences Present gerund (functioning as an adjective) – Politika – occurrences in the feminine accusative singular forms occurrences Derivational forms Possessive adjectives – dečakov, partizanov,... Diminutives – tkaninica, futrolica, telefonče,... Gender motion – druidica, gutačica, gudačica, guvernerka,... Šezedestogodišnjakinja, četvoroipomesečni, dvestopedestogodišnjica,... They all have in Serbian e-dictionary separate canonic forms, each with its own subordinate forms.
19
19 In order to obtain (close to) 100% coverage of a text, it is necessary to include: colors – skerletnocrven, bledoplav, mlečnožut,... Proper names – personal names, geopolitical names organizations (Ozna, Gestapo, Metropoliten,...) objects – trademarks (lajka, spitfajer, mercedes,...) Titles and characters of novels, films, operas… (Dezdemona, Asteriks, Plavobradi...) events (Anšlus,...) And then also – donžuanstvo, arsenlupenovski, neotitoizam, nedićevština,...
20
20 Details of the LADL format There are two dictionaries (or lists) of simple words in the LADL format: First dictionary – DELAS – is a dictionary of canonic forms (lemmas). This dictionary is used to generate the second dictionary. Second dictionary – DELAF – is a dictionary of subordinate (or inflected) forms. Only this dictionary is used in the automatic text processing.
21
21 An entry in a DELAS dictionary lemma,Kn+Prop K: A Part-of-Speech code; Usually that is a code consisting of one or more upper-case letters. n: A relation with subordinate forms, if they exist; Usually that is an alphanumeric code that together with a PoS code enables the generation of all subordinate forms for a DELAF dictionary. Prop: Syntactic, semantic, dialect, usage, domain,… markers Markers that can be freely attached to any canonic form – they are in a form of alphanumeric codes.
22
22 An example of a DELAS entry from the Serbian e-dictionary učiteljica,N651+Hum+GM učiteljica canonic form (lemma) N Part-of-Speech (noun) (N)651 Inflection class code used to generate all inflected forms +Hum human +GM feminine gender noun derived from the corresponding masculine gender noun učitelj
23
23 Examples of Serbian DELAS entries for various PoS učiteljica,N651+Hum+GM zagasitocrven,A6+Col smejati,V516+Imperf+It+Ref+Ek ćutke,ADV deset,NUM+v5 poneko,PRO+ProN+Indef+Sr ali,CONJ od,PREP+p2 jaoj,INT naime,PAR
24
24 An example of a DELAS entry from the Bulgarian e-dictionary глава,C600 +Ж глава canonic form (lemma) N Part-of-Speech (noun) (N)600 Inflection class code used to generate all inflected forms +Ж feminine
25
25 Examples of Bulgarian DELAS entries for various PoS глава, С(600)+Ж червен, ПРИ(3) абе,абе.МЕЖ ако,ако.СЮ+П вместо,вместо.ПРЕД вредно,вредно.НАР даже,даже.ЧА дам.Г+С+Т …
26
26 An entry in a DELAF dictionary: word form,lemma.K+Prop(:gc)* Canonic form (or lemma); K: A Part-of-Speech code (inherited from its lemma) Prop: Syntactic, semantic, dialect, usage, domain,… markers (inherited from its lemma) gc: A set of codes that represent values of grammatical categories describing a form: Grammatical categories depend on the PoS; These are one character alphanumeric codes.
27
27 An example of a DELAF entry from the Serbian e-dictionary učiteljicu,učiteljica.N+Hum+GM:fs4v učiteljicu subordinate form (realization) učiteljica canonic form (lemma) N PoS (inherited from the canonic form) +Hum+GM markers (inherited from the canonic form) fs4v values of grammatical categories: f category gender (value feminine) s category number (value singular) 4 category case (value accusative) v category animacy (value animate)
28
28 The whole paradigm of the lemma učiteljica u č iteljica,u č iteljica.N:fp2v u č iteljica,u č iteljica.N:fs1v u č iteljice,u č iteljica.N:fp5v u č iteljice,u č iteljica.N:fp4v u č iteljice,u č iteljica.N:fp1v u č iteljice,u č iteljica.N:fs5v u č iteljice,u č iteljica.N:fw4v u č iteljice,u č iteljica.N:fw2v u č iteljice,u č iteljica.N:fs2v u č iteljici,u č iteljica.N:fs7v u č iteljici,u č iteljica.N:fs3v u č iteljicu,u č iteljica.N:fs4v u č iteljicom,u č iteljica.N:fs6v u č iteljicama,u č iteljica.N:fp7v u č iteljicama,u č iteljica.N:fp6v u č iteljicama,u č iteljica.N:fp3v The numeric code 651 that connects a canonic form with all of its subordinate forms is deleted because it is of no use anymore.
29
29 The whole paradigm of the lemma глава глава,глава.С+Ж:s0 главата,глава.С+Ж:sd глави,глава.С+Ж:p0 главите,глава.С+Ж:pd
30
30 The whole paradigm of the lemma дам дадете,дам.Г+С+Т:R2p дадеш,дам.Г+С+Т:R2s дадеше,дам.Г+С+Т:D2s:D3s дадох,дам.Г+С+Т:E1s дадоха,дам.Г+С+Т:E3p дадохме,дам.Г+С+Т:E1p дадохте,дам.Г+С+Т:E2p дадял,дам.Г+С+Т:Wsm дадяла,дам.Г+С+Т:Wsf дадяло,дам.Г+С+Т:Wsn дадях,дам.Г+С+Т:D1s дадяха,дам.Г+С+Т:D3p дадяхме,дам.Г+С+Т:D1p дадяхте,дам.Г+С+Т:D2p дай,дам.Г+С+Т:I2s дайте,дам.Г+С+Т:I2p дал,дам.Г+С+Т:Xsm0 дала,дам.Г+С+Т:Xsf0 далата,дам.Г+С+Т:Xsfd дадат,дам.Г+С+Т:R3p даде,дам.Г+С+Т:E2s:E3s даде,дам.Г+С+Т:R3s дадели,дам.Г+С+Т:Wp дадем,дам.Г+С+Т:R1p даден,дам.Г+С+Т:Qsm0 дадена,дам.Г+С+Т:Qsf0 дадената,дам.Г+С+Т:Qsfd дадени,дам.Г+С+Т:Qp0 дадените,дам.Г+С+Т:Qpd дадения,дам.Г+С+Т:Qsmh даденият,дам.Г+С+Т:Qsml дадено,дам.Г+С+Т:Qsn0 даденото,дам.Г+С+Т:Qsnd дали,дам.Г+С+Т:Xp0 далите,дам.Г+С+Т:Xpd далия,дам.Г+С+Т:Xsmh далият,дам.Г+С+Т:Xsml дало,дам.Г+С+Т:Xsn0 далото,дам.Г+С+Т:Xsnd дам,дам.Г+С+Т:R1s
31
31 How is relation between canonic form and its subordinate form (inflected forms) established? In Unitex system Finite State Transducers – FST – are used for this. Inflection class code used that follows PoS code in DELAS (dictionary of lemmas) is used to generate all inflected forms. One transducer is usually used to generate forms for many lemmas. For instance, transducer N2 generates inflected forms for: emir, evrofil, dijetetičar, forenzičar, leptir, šegrt, and many other lemmas.
32
32 FST defines classes (BG) In most of the languages the relation between a lemma and its forms is an intuitive relation of equivalence that is formalized, in the case of the LADL format by FSTs, син/sg,indef синове/pl,indef сине/sg,voc сина/pl,count син( /sg,indef+ове/pl,indef+е/sg,voc+а/pl,count) (...?...)( /sg,indef+ове/pl,indef+е/sg,voc+а/pl,count) N01: ( /sg,indef+ове/pl,indef+е/sg,voc+а/pl,count)
33
33 Dictionaries for other languages Russian - developed at CIS, Munich (CISLEX- RU) derived mostly from Zaliznyak, A. Grammaticheskij slovar' russkogo jazyka) and contains approximately 44,000 lemmas (930.000 forms)
34
34 разговориться,.V+intr+sv:AI при,пря.N+anim(j)+gen(F):geF:nm:ajm при,.PRAEP+gov(q) при,переть.V+nsv+tr:A2eb первой,первый.A+Ord:geF:deF:teF:qeF встрече,встреча.N+anim(j)+gen(F):deF:qeF объявил,объявить.V+sv+tr:AeMVi нынешним,нынешний.A:teM:teN:dm летом,.ADV летом,лето.N+anim(j)+gen(N):teN летом,лет.N+anim(j)+gen(M):teM Капе,Капа.N+PN+VORN+anim(o)+gen(M)+style(colloq):deM:qeM Капе,Капа.N+PN+VORN+anim(o)+gen(F)+style(colloq):deF:qeF
35
35 Капе,Капа.N+PN+VORN+anim(o)+g en(M)+style(colloq):deM:qeM Капе,Капа.N+PN+VORN+anim(o)+gen(M) +style(colloq):deM:qeM Капе – word form Капа - lemma N – noun PN – proper noun VORN – given name anim(o) - animate gen(M) – masculin gender style(colloq) deM:qeM – dative or prepositional case, singular (e), masculin
36
36 Dictionaries for other languages Polish (Z. Vetulany, Adam Mickiewicz University, 1996) marcu,marzec.N+Gi+Ns+Cl marcu,marzec.N+Gi+Ns+Cv marcu,marzec.N+month:L marynarka,.N+Gf+Ns+Cn masową,masowy.ADJ+Dp+Ns+Cai+Gf masowe,masowy.ADJ+Dp+Np+Cnav+Gaifn
37
37 Dictionaries for other languages Latin dictionary derived from Perseus project, based on the Lewis&Short dictionary (1879) abaculus,.N:Nms abacum,abax.N:Gmp abacum,abacus.N:Ams abacum,abacus.N+poet:Gmp abaddon,ab-addo.V:1siPC abagmentum,.N:Vns abagmentum,.N:Nns abagmentum,.N:Ans
38
38 Comparison between L&S, Georges and Whiteker dictionaries All three dictionaries are available in e-from. L&S supports processing on the site Perseus Georges is available on-line Whitaker’s Words is an application that performs morhological anaylsis But their content is different. E.g. abacinus exists in Georges and Whitaker, but not in L&S
39
39 What else should be known? The use of upper-case and lower-case letters in a dictionary: Canonic forms written with lower-case letters can match in a text both lower–case and upper-case occurrences. Canonic forms written with (some) upper-case letters can match in a text only occurrences that use upper-case letters on that position(s). For instance, vlada,N600 Vlada,N1741+NProp+Hum+First some results from corpus “Politika” vlada and Vlada vlada Vlada
40
40 What else should be known, or not? A user that will not produce a new dictionary (e.g. for a new language, or a dictionary for some sub-domain) need not know the format of DELAS dictionaries, neither he/she has to know what are inflectional transducers and how some of them look like. A user that wants to use dictionaries for text processing needs to know what is the content of DELAF dictionaries he plans to use and what does different codes and markers mean. Dictionary that he/she is using are compiled dictionaries (two files with the extensions.bin and.inf ) and their usage by Unitex is very effective. These dictionaries cannot be “seen”.
41
41 E-dictionary as statistical tagger Filtering the results of word form tagging by Tnt, TreeTagger, etc. with e-dictionaries transform the results to „real“ lemmas (a part of ambiguity is lost, but the result is >95% correct :-)
42
42 Numbers that illustrate the content of Serbian e-dictionaries The number of inflection transducers (April 2010) for nouns369 for verbs371 for adjectives66
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.