Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur Conception and realization of grammatical &

Similar presentations


Presentation on theme: "1 NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur Conception and realization of grammatical &"— Presentation transcript:

1 1 NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur Conception and realization of grammatical & lexical resources for the Russian language for Max Silberztein’s Nooj software Russian Module for NooJ: design and implementation

2 2  Design linguistics resources  Design linguistics resources  Description of the realization Dictionaries / paradigms /grammars Dictionaries / paradigms /grammars  Job left to be done… Russian Module for NooJ: design and implementation

3 3 Writing lexical resources for the Russian language  Build dictionairies from texts  Create one « small » dictionary and many grammars for derivational forms раб + a (slave) раб + oт + а + ть (work) за + раб + от + к + а (salary)  Complete one « big » existing dictionary and create many grammars

4 4 Writing lexical resources for the Russian language ZALIZNIAK’s grammatical dictionary : 96 000 entries complete dictionary, in inverted alphabetical order, with all grammatical annotation To obtain, to reach : Достигать нсв нп 1a$3 (доcтигнуть//доcтичь) имеется страд Dostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form

5 5 Writing lexical resources for the Russian language The problem of accent markers was delayed Encountered problems Classification complete but some tags are absent ( V, N…) Classification based on accent markers A lot Unformal unclassified added annotations Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use

6 6 The design of lexical resources for the Russian language has consisted in: 3. sorting the dictionary (inverted alphabetical order for each word) 1. creating grammatical tags 2. recoding the dictionary with this tags 6. problem with ë / e 4. fixing a paradigm model list (karta instead of 4. fixing a paradigm model list (karta instead of zh1a ) 5. writing paradigms 7. allocating models to the words 8. verifying the results 9. testing with texts 10. Correcting and proofreading

7 7 Writing lexical ressources for Russian 1. Creating tags and properties N, A, V, ADV …. A_Forme = fc | fl | adv; A_Genre = m | f | n ; A_SGenr = an | inan ; A_Nombre = s | p; A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv; A_Deg = Comp | Sup ; ADV_Deg = Comp; V_Pers = 1 | 2 | 3 ; V_Asp = Ipf | Pf ; V_Type = Mvt ; V_Morph = Pvb | Simp | Sufx | PvbSufx ; V_SsAsp = Det | Indet ; V_Temps = Pre | Pa | Fu ; V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ; V_Voix = Act | Pss ; V_Genre = m | f | n ; V_Nombre = s | p ; V_Constr = intr | tr | sja ; V_Cas = Im | Vi | Ro | Da | Tv | Pr ;

8 8 Writing lexical ressources for Russian 2. recoding the dictionary 3. Sorting the dictionary to get inverted aphabetical ordering

9 9 #j1a=karta #jo1a=korova #j2a=nedelja #jo2a=boginja #j3a=kniga #jo3a=sobaka #j4a=tuča #jo4a=kassirša #j5a=ulica #jo5a=volčica #j6a=statuja #jo6a=feja #j7a=linija #jo7a=furija 4. Paradigm model list карта = /Im+f+s + у/Vi+f+s + ы/Ro+f+s + е/Da+f+s + ой/Tv+f+s + е/Pr+f+s + ы/Im+f+p + ы/Vi+f+p + /Ro+f+p + ам/Da+f+p + ами/Tv+f+p + ах/Pr+f+p ; 5. writing paradigms Writing lexical Russian resources

10 10 5. Paradigm for verbs взять = /Inf | озьму/1+s+Pre | озьмешь/2+s+Pre | озьмет/3+s+Pre | озьмем/1+p+Pre | озьмете/2+p+Pre | озьмёшь/2+s+Pre | озьмёт/3+s+Pre | озьмём/1+p+Pre | озьмёте/2+p+Pre | озьмут/3+p+Pr | л/m+s+Pa | ла/f+s+Pa | ло/n+s+Pa | ли/p+Pa | озьми/2+s+Imp | озьмите/2+p+Imp | в/Ger | вши/Ger | вший/Prtp+Pa+Act+m+s+Im | вший/Prtp+Pa+Act+m+s+Vi | вшего/Prtp+Pa+Act+m+an+s+Vi | вшего/Prtp+Pa+Act+m+s+Ro | вшему/Prtp+Pa+Act+m+s+Da | вшим/Prtp+Pa+Act+m+s+Tv | вшем/Prtp+Pa+Act+m+s+Pr | вшая/Prtp+Pa+Act+f+s+Im | вшую/Prtp+Pa+Act+f+s+Vi | вшую/Prtp+Pa+Act+f+s+Vi | вшей/Prtp+Pa+Act+f+s+Ro | вшей/Prtp+Pa+Act+f+s+Da | вшей/Prtp+Pa+Act+f+s+Tv | вшею/Prtp+Pa+Act+f+s+Tv | вшей/Prtp+Pa+Act+f+s+Pr | вшее/Prtp+Pa+Act+n+s+Im | вшее/Prtp+Pa+Act+n+s+Vi | вшего/Prtp+Pa+Act+n+s+Vi | вшего/Prtp+Pa+Act+n+s+Ro | вшему/Prtp+Pa+Act+n+s+Da | вшим/Prtp+Pa+Act+n+s+Tv | вшем/Prtp+Pa+Act+n+s+Pr | вшие/Prtp+Pa+Act+p+Im | вшие/Prtp+Pa+Act+p+Vi | вших/Prtp+Pa+Act+an+p+Vi | вших/Prtp+Pa+Act+p+Ro | вшим/Prtp+Pa+Act+p+Da | вшими/Prtp+Pa+Act+p+Tv | вших/Prtp+Pa+Act+p+Pr | тый/Prtp+Pa+Pss+m+s+Im | тый/Prtp+Pa+Pss+m+s+Vi | того/Prtp+Pa+Pss+m+an+s+Vi | того/Prtp+Pa+Pss+m+s+Ro | тому/Prtp+Pa+Pss+m+s+Da | тым/Prtp+Pa+Pss+mo+s+Tv | том/Prtp+Pa+Pss+mo+s+Pr | тая/Prtp+Pa+Pss+f+s+Im | тую/Prtp+Pa+Pss+f+s+Vi | той/Prtp+Pa+Pss+f+s+Ro | той/Prtp+Pa+Pss+f+s+Da | той/Prtp+Pa+Pss+f+s+Tv | тою/Prtp+Pa+Pss+f+s+Tv | той/Prtp+Pa+Pss+f+s+Pr | тое/Prtp+Pa+Pss+n+s+Im | тое/Prtp+Pa+Pss+n+s+Vi | того/Prtp+Pa+Pss+n+s+Ro | тому/Prtp+Pa+Pss+n+s+Da | тым/Prtp+Pa+Pss+n+s+Tv | том/Prtp+Pa+Pss+n+s+Pr | тые/Prtp+Pa+Pss+p+Im | тые/Prtp+Pa+Pss+p+Vi | тых/Prtp+Pa+Pss+an+p+Vi | тых/Prtp+Pa+Pss+p+Ro | тым/Prtp+Pa+Pss+p+Da | тыми/Prtp+Pa+Pss+p+Tv | тых/Prtp+Pa+Pss+p+Pr | т/Prtp+Pa+Pss+m+s+fc | та/Prtp+Pa+Pss+f+s+fc | то/Prtp+Pa+Pss+n+s+fc | ты/Prtp+Pa+Pss+p+fc; Writing lexical Russian resources

11 11 Writing lexical ressources for Russian 6. Problem of letter ë / e (partially solved: two entries or two paradigms) ёжик,N+m+an+FLX=бульдог ёж,N+m+an+FLX=богач ежик,N+m+an+FLX=бульдог еж,N+m+an+FLX=богач жевать = /Inf | ую/1+s+Pre | уёшь/2+s+Pre | уёт/3+s+Pre | уём/1+p+Pre | уёте/2+p+Pre | уешь/2+s+Pre | ует/3+s+Pre | уем/1+p+Pre | уете/2+p+Pre | уют/3+p+Pre

12 12 7. Allocating models to words Writing lexical Russian resources abažur,N+m+inan+FLX=zavod abazinec,N+m+an+FLX=ukrainec abazin,N+m+an+FLX=artist abaz,N+m+inan+FLX=zavod abak,N+m+inan+FLX=čajnik abbat,N+m+an+FLX=artist 8. verifiying paradigms

13 13 Writing lexical resources for Russian 9. Testing with russian texts : « The nose » by Gogol « The gambler » by Dostoievsky « The Prisoner of the Caucasus » by Tolstoy « The lady with the dog » by Chekhov « Short stories » by Harms

14 14 Writing lexical resources for Russian 10. Correcting errors : - bad encoding (mixed latin/cyrillic letters) A B E K M H O P C y X MOCKBA - - errors in paradigms - - bad allocation of model to words  mobile vowel / palatalization

15 15 Improving lexical resources - useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, я archaic unused words. - repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis ) Increase the number of different models ?  To avoid generating unexpected or incongruous forms or failing to recognize existing forms. Читав ? Čitav ? Пиша ? Piša ? Счастие ? Ŝastiе ? Suppress word entries and / or forms ?

16 16 1 COMPILED BASIC DICTIONAIRY containing 1 COMPILED BASIC DICTIONAIRY containing : Available lexical resources for Russian 1 dictionary of 45,000 nouns (350 paradigms) 1 dictionary of 20,000 adjectives (50 paradigms) 1 dictionary of 25,000 verbs (600 paradigms) 1 dictionary of 880 prepositions & conjunctions, numerals, pronouns, 1600 adverbs, parenthetical words etc… 2COMPILED ADDITONNALS DICTIONARIES: (with facultative use) 1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives) 1 dictionary of substantives-adjectives

17 17 Writing Russian grammars for Nooj designing disambiguation grammars for designing disambiguation grammars for - grammatical agreement between adjectives & nouns - -case usage with numerals - -case usage with prepositions - -case usage with verbs - date and time expression - adverbial phrases of time, place … - - idiomatic structures ( my name is, I’m.. old - - verbs of motion designing grammars to locate syntagms

18 18 Writing Russian grammars for Nooj Syntactic grammar for Russian

19 19 Writing Russian grammars for Nooj Syntactic grammar for Russian

20 20 Grammar to locate the verbs of motion

21 21 Grammar to locate the verbs of motion

22 22 The prepositions in Russian

23 23 The disambiguation of « NA » (on, onto)

24 24 Annotating and disambiguating texts the text with its ambiguities :

25 25 Verifying grammars The text was disambiguated with the grammar of « NA » :

26 26 The disambiguation of « V » (in, into)

27 27 Russian grammars for Nooj All these grammars need improvement: They are very sensitive to syntactic order : - -fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences. There are no grammars (yet) : - -to disambiguate adverbs / adjectives - -to disambiguate adjectives / nouns - -to disambiguate conjunctions / interjections

28 28 To get reliable ressources for the Russian language : Data bank of verified and annotated texts design and implement: Efficient syntactic grammars Develop semantic tagging Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment The job left to be done is to

29 29 Russian Module for NooJ http://www.nooj4nlp.net/pages/russian.html

30 30 NOOJ Conference Inalco June 16th, 2012 vincent.benet@inalco.fr INALCO Russian Module for NooJ: design and implementation Спасибо за внимание Thank you for your attention Merci de votre attention


Download ppt "1 NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur Conception and realization of grammatical &"

Similar presentations


Ads by Google