Download presentation
Presentation is loading. Please wait.
Published byGracie Craze Modified over 9 years ago
1
1 NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur Conception and realization of grammatical & lexical resources for the Russian language for Max Silberztein’s Nooj software Russian Module for NooJ: design and implementation
2
2 Design linguistics resources Design linguistics resources Description of the realization Dictionaries / paradigms /grammars Dictionaries / paradigms /grammars Job left to be done… Russian Module for NooJ: design and implementation
3
3 Writing lexical resources for the Russian language Build dictionairies from texts Create one « small » dictionary and many grammars for derivational forms раб + a (slave) раб + oт + а + ть (work) за + раб + от + к + а (salary) Complete one « big » existing dictionary and create many grammars
4
4 Writing lexical resources for the Russian language ZALIZNIAK’s grammatical dictionary : 96 000 entries complete dictionary, in inverted alphabetical order, with all grammatical annotation To obtain, to reach : Достигать нсв нп 1a$3 (доcтигнуть//доcтичь) имеется страд Dostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form
5
5 Writing lexical resources for the Russian language The problem of accent markers was delayed Encountered problems Classification complete but some tags are absent ( V, N…) Classification based on accent markers A lot Unformal unclassified added annotations Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use
6
6 The design of lexical resources for the Russian language has consisted in: 3. sorting the dictionary (inverted alphabetical order for each word) 1. creating grammatical tags 2. recoding the dictionary with this tags 6. problem with ë / e 4. fixing a paradigm model list (karta instead of 4. fixing a paradigm model list (karta instead of zh1a ) 5. writing paradigms 7. allocating models to the words 8. verifying the results 9. testing with texts 10. Correcting and proofreading
7
7 Writing lexical ressources for Russian 1. Creating tags and properties N, A, V, ADV …. A_Forme = fc | fl | adv; A_Genre = m | f | n ; A_SGenr = an | inan ; A_Nombre = s | p; A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv; A_Deg = Comp | Sup ; ADV_Deg = Comp; V_Pers = 1 | 2 | 3 ; V_Asp = Ipf | Pf ; V_Type = Mvt ; V_Morph = Pvb | Simp | Sufx | PvbSufx ; V_SsAsp = Det | Indet ; V_Temps = Pre | Pa | Fu ; V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ; V_Voix = Act | Pss ; V_Genre = m | f | n ; V_Nombre = s | p ; V_Constr = intr | tr | sja ; V_Cas = Im | Vi | Ro | Da | Tv | Pr ;
8
8 Writing lexical ressources for Russian 2. recoding the dictionary 3. Sorting the dictionary to get inverted aphabetical ordering
9
9 #j1a=karta #jo1a=korova #j2a=nedelja #jo2a=boginja #j3a=kniga #jo3a=sobaka #j4a=tuča #jo4a=kassirša #j5a=ulica #jo5a=volčica #j6a=statuja #jo6a=feja #j7a=linija #jo7a=furija 4. Paradigm model list карта = /Im+f+s + у/Vi+f+s + ы/Ro+f+s + е/Da+f+s + ой/Tv+f+s + е/Pr+f+s + ы/Im+f+p + ы/Vi+f+p + /Ro+f+p + ам/Da+f+p + ами/Tv+f+p + ах/Pr+f+p ; 5. writing paradigms Writing lexical Russian resources
10
10 5. Paradigm for verbs взять = /Inf | озьму/1+s+Pre | озьмешь/2+s+Pre | озьмет/3+s+Pre | озьмем/1+p+Pre | озьмете/2+p+Pre | озьмёшь/2+s+Pre | озьмёт/3+s+Pre | озьмём/1+p+Pre | озьмёте/2+p+Pre | озьмут/3+p+Pr | л/m+s+Pa | ла/f+s+Pa | ло/n+s+Pa | ли/p+Pa | озьми/2+s+Imp | озьмите/2+p+Imp | в/Ger | вши/Ger | вший/Prtp+Pa+Act+m+s+Im | вший/Prtp+Pa+Act+m+s+Vi | вшего/Prtp+Pa+Act+m+an+s+Vi | вшего/Prtp+Pa+Act+m+s+Ro | вшему/Prtp+Pa+Act+m+s+Da | вшим/Prtp+Pa+Act+m+s+Tv | вшем/Prtp+Pa+Act+m+s+Pr | вшая/Prtp+Pa+Act+f+s+Im | вшую/Prtp+Pa+Act+f+s+Vi | вшую/Prtp+Pa+Act+f+s+Vi | вшей/Prtp+Pa+Act+f+s+Ro | вшей/Prtp+Pa+Act+f+s+Da | вшей/Prtp+Pa+Act+f+s+Tv | вшею/Prtp+Pa+Act+f+s+Tv | вшей/Prtp+Pa+Act+f+s+Pr | вшее/Prtp+Pa+Act+n+s+Im | вшее/Prtp+Pa+Act+n+s+Vi | вшего/Prtp+Pa+Act+n+s+Vi | вшего/Prtp+Pa+Act+n+s+Ro | вшему/Prtp+Pa+Act+n+s+Da | вшим/Prtp+Pa+Act+n+s+Tv | вшем/Prtp+Pa+Act+n+s+Pr | вшие/Prtp+Pa+Act+p+Im | вшие/Prtp+Pa+Act+p+Vi | вших/Prtp+Pa+Act+an+p+Vi | вших/Prtp+Pa+Act+p+Ro | вшим/Prtp+Pa+Act+p+Da | вшими/Prtp+Pa+Act+p+Tv | вших/Prtp+Pa+Act+p+Pr | тый/Prtp+Pa+Pss+m+s+Im | тый/Prtp+Pa+Pss+m+s+Vi | того/Prtp+Pa+Pss+m+an+s+Vi | того/Prtp+Pa+Pss+m+s+Ro | тому/Prtp+Pa+Pss+m+s+Da | тым/Prtp+Pa+Pss+mo+s+Tv | том/Prtp+Pa+Pss+mo+s+Pr | тая/Prtp+Pa+Pss+f+s+Im | тую/Prtp+Pa+Pss+f+s+Vi | той/Prtp+Pa+Pss+f+s+Ro | той/Prtp+Pa+Pss+f+s+Da | той/Prtp+Pa+Pss+f+s+Tv | тою/Prtp+Pa+Pss+f+s+Tv | той/Prtp+Pa+Pss+f+s+Pr | тое/Prtp+Pa+Pss+n+s+Im | тое/Prtp+Pa+Pss+n+s+Vi | того/Prtp+Pa+Pss+n+s+Ro | тому/Prtp+Pa+Pss+n+s+Da | тым/Prtp+Pa+Pss+n+s+Tv | том/Prtp+Pa+Pss+n+s+Pr | тые/Prtp+Pa+Pss+p+Im | тые/Prtp+Pa+Pss+p+Vi | тых/Prtp+Pa+Pss+an+p+Vi | тых/Prtp+Pa+Pss+p+Ro | тым/Prtp+Pa+Pss+p+Da | тыми/Prtp+Pa+Pss+p+Tv | тых/Prtp+Pa+Pss+p+Pr | т/Prtp+Pa+Pss+m+s+fc | та/Prtp+Pa+Pss+f+s+fc | то/Prtp+Pa+Pss+n+s+fc | ты/Prtp+Pa+Pss+p+fc; Writing lexical Russian resources
11
11 Writing lexical ressources for Russian 6. Problem of letter ë / e (partially solved: two entries or two paradigms) ёжик,N+m+an+FLX=бульдог ёж,N+m+an+FLX=богач ежик,N+m+an+FLX=бульдог еж,N+m+an+FLX=богач жевать = /Inf | ую/1+s+Pre | уёшь/2+s+Pre | уёт/3+s+Pre | уём/1+p+Pre | уёте/2+p+Pre | уешь/2+s+Pre | ует/3+s+Pre | уем/1+p+Pre | уете/2+p+Pre | уют/3+p+Pre
12
12 7. Allocating models to words Writing lexical Russian resources abažur,N+m+inan+FLX=zavod abazinec,N+m+an+FLX=ukrainec abazin,N+m+an+FLX=artist abaz,N+m+inan+FLX=zavod abak,N+m+inan+FLX=čajnik abbat,N+m+an+FLX=artist 8. verifiying paradigms
13
13 Writing lexical resources for Russian 9. Testing with russian texts : « The nose » by Gogol « The gambler » by Dostoievsky « The Prisoner of the Caucasus » by Tolstoy « The lady with the dog » by Chekhov « Short stories » by Harms
14
14 Writing lexical resources for Russian 10. Correcting errors : - bad encoding (mixed latin/cyrillic letters) A B E K M H O P C y X MOCKBA - - errors in paradigms - - bad allocation of model to words mobile vowel / palatalization
15
15 Improving lexical resources - useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, я archaic unused words. - repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis ) Increase the number of different models ? To avoid generating unexpected or incongruous forms or failing to recognize existing forms. Читав ? Čitav ? Пиша ? Piša ? Счастие ? Ŝastiе ? Suppress word entries and / or forms ?
16
16 1 COMPILED BASIC DICTIONAIRY containing 1 COMPILED BASIC DICTIONAIRY containing : Available lexical resources for Russian 1 dictionary of 45,000 nouns (350 paradigms) 1 dictionary of 20,000 adjectives (50 paradigms) 1 dictionary of 25,000 verbs (600 paradigms) 1 dictionary of 880 prepositions & conjunctions, numerals, pronouns, 1600 adverbs, parenthetical words etc… 2COMPILED ADDITONNALS DICTIONARIES: (with facultative use) 1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives) 1 dictionary of substantives-adjectives
17
17 Writing Russian grammars for Nooj designing disambiguation grammars for designing disambiguation grammars for - grammatical agreement between adjectives & nouns - -case usage with numerals - -case usage with prepositions - -case usage with verbs - date and time expression - adverbial phrases of time, place … - - idiomatic structures ( my name is, I’m.. old - - verbs of motion designing grammars to locate syntagms
18
18 Writing Russian grammars for Nooj Syntactic grammar for Russian
19
19 Writing Russian grammars for Nooj Syntactic grammar for Russian
20
20 Grammar to locate the verbs of motion
21
21 Grammar to locate the verbs of motion
22
22 The prepositions in Russian
23
23 The disambiguation of « NA » (on, onto)
24
24 Annotating and disambiguating texts the text with its ambiguities :
25
25 Verifying grammars The text was disambiguated with the grammar of « NA » :
26
26 The disambiguation of « V » (in, into)
27
27 Russian grammars for Nooj All these grammars need improvement: They are very sensitive to syntactic order : - -fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences. There are no grammars (yet) : - -to disambiguate adverbs / adjectives - -to disambiguate adjectives / nouns - -to disambiguate conjunctions / interjections
28
28 To get reliable ressources for the Russian language : Data bank of verified and annotated texts design and implement: Efficient syntactic grammars Develop semantic tagging Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment The job left to be done is to
29
29 Russian Module for NooJ http://www.nooj4nlp.net/pages/russian.html
30
30 NOOJ Conference Inalco June 16th, 2012 vincent.benet@inalco.fr INALCO Russian Module for NooJ: design and implementation Спасибо за внимание Thank you for your attention Merci de votre attention
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.