Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Similar presentations


Presentation on theme: "SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,"— Presentation transcript:

1 SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP, CNRS Toulouse André WLODARCZYK & Hélène WLODARCZYK CELTA, Université Paris Sorbonne

2 CASE STUDY Why Polish Adjective Declension ? Answer: Polish Adjective Declension is an application domain with a well-defined borderline; i.e.: in which the total function generates all the combinatory possibilities.

3 Case= {Nominative, Accusative, Genitive, Dative, Instrumental, Locative} Number= {singular, plural} Gender= {masculine, feminine, neuter, X, Y, Z*} POLISH DECLENSION In Polish School Grammar, the Adjective declension consists in amalgamation of 3 “morphological categories”. In our experimentation, we interpreted these categories as attributes of an information system. (Rough Set Theory, Pawlak Z., 1982) * X, Y, Z will be analyzed in the sequel.

4 THE PROBLEM OF GENDER IN POLISH In Slavic languages, Gender is a classificatory category as for Nouns while it is an inflectional category as for Adjectives. In order elucidate the problem of Gender in Polish noun morphology, we built a database of usages (not uses) of the proximal deictic adjectives. The root of these adjectives is very short: one single phoneme t-.

5 THE DEICTIC MORPHEMES IN POLISH The Nominative form of Polish morphemes with proximal (with respect to the speaker) deictic meaning are: TEN, TA, TO They correspond to : TENTATO Englishthis Frenchcecettece Germandieserdiesedieses Japanesekono

6 SAMPLES FROM OUR DATABASE Some samples from the db (examples only in the Nominative case) PolishEnglish translation SingularPlural Feminine ta deskate deskithis/these board(s) ta gęśte gęsithis/these goose/geese ta panite paniethis/these lady/ladies Masculine ten domte domythis/these house(s) ten pieste psythis/these dog(s) ten panci panowiethis/these sir(s) Neuter to pi ó rote pi ó rathis/these feather(s) to kurczęte kurczętathis/these chicken(s) to dzieckote dziecithis/these child/children Our database contains 108 different noun phrases totally combining all the categories involved in the declension: Case, Number, Gender and Animacy)

7 Defining Gender in Polish 7 “Genders” In Polish Linguistics (cf. SALONI, Z. 1976), Gender is defined as a morpho-syntactic category. It is in the Accusative Case that Gender forms of Polish Adjectives are mostly differentiated. Sub-genders are distinguished in singular and in plural. Doing so, surprisingly, up to 7 gender classes have been proposed : * “ Animal ” corresponds to the feature “ animate ” in other European languages descriptions. ** “ Personal ” corresponds to the feature “ human ”. *** Pluralia tantum are defective nouns with no singular form). Singular : 1.feminine (with a specific Accusative form) 2.neuter (with the same form in Accusative as in Nominative) 3.animal* masculine (with the same form in Accusative as in Genitive) 4.non animal masculine (with the same form in Accusative as in Nominative) Plural : 1.personal** masculine (with the same form in Accusative as in Genitive), 2.non personal masculine (with the same form in Accusative as in Nominative) 3.“pluralia tantum”*** (with the same form in Accusative as in Nominative)

8 Defining Gender in Polish 5 “Genders” In fact, Saloni’s theory derives from that of Mańczak, W. (1956) who distinguished the following five “sub- genders” only : 1.personal masculine 2.animal masculine 3.non animal masculine 4.feminine 5.neuter

9 DATABASE WITH 7 GENDERS Nb of objects : 108 Nb of duplicates : 65 Nb of attributes : 3 (with respectively 2, 7, 6 values) Nb->{plur or sing} Gnd->{fem or mascAn or mascHum or mascInan or neu or nMasHum or plTant} Case->{A or D or G or I or L or N} Theoretical Combinations : 84 Apparent Saturation Index : 51.19% Non Attested Pairs of Values (10) If all non-attested pairs are inconsistent, the maximum number of combinations is : 54 Corrected Saturation Index : 79.63% Our knowledge reduction algorithm cannot reduce the different descriptions. Instead 45 decision rules are proposed.

10 CRITICAL REMARKS ON SUB-GENDERS We observed that the 5 or 7 “ sub-genders ” of Polish School Grammars (a) neither correspond to any known semantic or ontological categories (b) nor to any known grammatical sub- gender in other languages. In inflectional languages, morphological amalgamation of several different categories in one single form may be the source of difficulties in discerning properly the semantic categories in question.

11 ANALYSIS of GENDER SUBCATEGORIZATION in POLISH GRAMMAR

12 FIRST TRIAL SPLITTING GENDER Observing the singular/plural oppositions in Adjective declension, we first divided the 7 “ sub-genders ” valued Gender attribute into 3 attributes : gender = {feminine, neuter, masculine) animacy = {animate, inanimate} humanity = {human, non human} We split the 7 “ sub-genders ” -valued Gender attribute into more than one attribute (with less values each).

13 FIRST TRIAL - RESULTS SPLITTING GENDER Objects : 108 Duplicates : 0 Duplicate ratio : 0% The following pairs of attributes could be merged: [HUM|INA] Confidence index = 99.9% [HUM|nHUM]Confidence index = 99.9% [INA|nHUM]Confidence index = 99.9% Attributes : 5 (with resp. 6,2,3,2,2 values) case, number, gender, animacy and humanity Theoretical Combinations : 144 Apparent Saturation Index : 75% Non-Attested Pairs of Values (1) If all non-attested pairs were inconsistent, the maximum number of combinations would be: 108 Corrected Saturation Index : 100% ====================================================== Non Attested Pairs of Values (1) inanimate, human, 2, 4 Our knowledge reduction algorithm reduces the 108 different descriptions to 34 decision rules.

14 SECOND TRIAL MERGING ANIMACY with HUMANITY Considering the results of the first trial - one pair of values ( inanimate and human ) being not attested in the db (in fact, this pair is clearly contradictory) Non Attested Pairs of Values (1) inanimate, human, 2, 4 - and the confidence indices being computed as below The following pairs of attributes could be merged: [HUM|INA]Confidence index = 99.9% [HUM|nHUM]Confidence index = 99.9% [INA|nHUM]Confidence index = 99.9% we decided to merge both binary attributes ANIMACY with HUMANITY into one three-valued attribute as follows : ANIMACY-*-{ANY}=[nhuman|inanimate|human]

15 SECOND TRIAL - RESULTS MERGING ANIMACY with HUMANITY Nb of objects : 108 Nb of duplicates : 0 Nb of attributes : 4 (with respectively 2, 3, 3 and 6 values) Nb-->{plur or sing} Gnd-->{fem or masc or neu } Anim--> {inanim or anim or animHum} Case-->{A or D or G or I or L or N} Duplicate ratio : 0% Theoretical Combinations : 108 Apparent Saturation Index : 100% Non-Attested Pairs of Values (0) Corrected Saturation Index : 100% Again our knowledge reduction algorithm reduces the 108 different descriptions to 34 decision rules.

16 Establishing an ANIMACY CATEGORY for Polish Grammar

17 KNOWLEDGE REDUCTION using SEMANA The knowledge reduction algorithm reduces the 108 different descriptions of Polish Proximal Deictic Morphemes to 34 decision rules.

18 34 Morphological Rules r1 (9) : CASdat,NBRplu --> tym r2 (3) : CASins,GNDmas,NBRsin --> tym r3 (3) : CASins,GNDneu,NBRsin --> tym r4 (3) : CASloc,GNDmas,NBRsin --> tym r5 (3) : CASloc,GNDneu,NBRsin --> tym r6 (9) : CASins,NBRplu --> tymi r7 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tych r8 (9) : CASgen,NBRplu --> tych r9 (9) : CASloc,NBRplu --> tych r10 (3) : CASacc,GNDneu,NBRsin --> to r11 (3) : CASnom,GNDneu,NBRsin --> to r12 (3) : CASacc,ANYina,NBRplu --> te r13 (3) : CASacc,ANYnhu,NBRplu --> te r14 (3) : CASacc,GNDfem,NBRplu --> te r15 (3) : CASacc,GNDneu,NBRplu --> te r16 (3) : CASnom,ANYina,NBRplu --> te r17 (3) : CASnom,ANYnhu,NBRplu --> te r18 (3) : CASnom,GNDfem,NBRplu --> te r19 (3) : CASnom,GNDneu,NBRplu --> te r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> ten r21 (3) : CASnom,GNDmas,NBRsin --> ten r22 (3) : CASdat,GNDmas,NBRsin --> temu r23 (3) : CASdat,GNDneu,NBRsin --> temu r24 (3) : CASdat,GNDfem,NBRsin --> tej r25 (3) : CASgen,GNDfem,NBRsin --> tej r26 (3) : CASloc,GNDfem,NBRsin --> tej r27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tego r28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tego r29 (3) : CASgen,GNDmas,NBRsin --> tego r30 (3) : CASgen,GNDneu,NBRsin --> tego r31 (3) : CASacc,GNDfem,NBRsin --> te* r32 (3) : CASnom,GNDfem,NBRsin --> ta r33 (3) : CASins,GNDfem,NBRsin --> ta* r34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci

19 DISCOVERED KNOWLEDGE 1.All the 108 different descriptions can be represented by 34 rules only rules represent the singular forms and 14 rules represent the plural forms. 3.The Gender attribute is not necessary in 8 rules in plural and in cases other than Nominative. This confirms the generally observed fact that, in Polish grammar, in the plural oblique cases, gender is neutralized (no Gender distinction). 4. The Attribute “Animacy” is present in 9/34 rules and 17/108 samples. 3 rules contain the value Human ( hum ) r07 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tych r27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tego r34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci 3 rules contain the value Inanimate ( ina ) r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> ten r12 (3) : CASacc,ANYina,NBRplu --> te r16 (3) : CASnom,ANYina,NBRplu --> te 3 rules contain the value non Human ( nhu ) r17 (3) : CASnom,ANYnhu,NBRplu --> te r13 (3) : CASacc,ANYnhu,NBRplu --> te r28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tego

20 GENDER and ANIMACY The 7 genders theory proposed a too coarse-grained analysis of the domain using only one attribute supposed to represent the Gender category. In our “ first trial ”, in addition to Gender, two binary categories ( Human and Animate ) were introduced resulting, as a matter of fact, in a too fine-grained description of the domain. In our “ second trial ”, after having merged the two binary categories, we got one three-valued Animacy category. As a result, the Analyser (1) detects none of the following anomalies: duplicates (of usages, not uses), non attested pairs of values and (2) proposed no attribute merging possibilities. Needless to say that our theory takes into account the definition of Gender category such as it is generally used in grammars of other languages.

21 The ONTOLOGICAL STRUCTURE of ANIMACY Interestingly, we noticed that the Feature Structure of Animacy Attribute being a binary tree, it is normal that its values are all exclusive by the law of the excluded middle: nothing can be true and false at the same time. ANIMACY HUMANITY non animatenon human human -+ -+

22 RELATIVE WEIGHT OF THE ANIMACY ATTRIBUTE If we consider the relative weight of the ANIMACY attribute (only 5.4%), we can better understand the difficulties that Polish linguists encountered in their work. Relative weight of attributes N weight(%) 1.CAS NBR GND ANY It becomes clear that ANIMACY is not as important a category as the other three ones (Case, Number and Gender) which co-occur in the amalgamated adjective paradigm.

23 Step 1: DB building Using our “Dynamic db Builder”… morpheme sample attribute, value (features chosen for each entry)

24 Step 2: Multi-valued Contingency Table The 108 samples are collected into a Multi-valued Contingency Table

25 Step 3: One-valued Contingency Table The Multi-valued Table is unfolded as a One-valued Table...

26 Step 4: Table of co-occurrences (Burt Table) Syntactic relators animacy gender number morphemes The One-valued Table is transformed in a Burt Table...

27 Step 5: Correspondence Factor Analysis (CFA) Numbers in the Table are considered as coordinates of points in a N-dimensional space. z x y F1 F2 F3 CFA calculates the axes of inertia of the cloud of points (F1, F2, F3 … ) and displays projections in planes [F1,F2], [F1,F3], etc. CFA is implemented as “Stat-3” in“Semana” 

28 Correspondence Factor Analysis Contribution percent of each axis to the overall inertia of the cloud Note that, in this case, the first 4 axes have almost equal contributions. This means that the cloud is strongly multidimensional. Output by “stat-3” “Stat-3” gives useful information about axes of inertia.

29 Correspondence Factor Analysis Output by “stat-3” Contribution of object J to the overall inertia of the cloud Weight of object J / total Quality of the description of object J on the first 7 coordinates Contribution of object J to the definition of factor 1 Contribution of factor 1 to the description of object J Coordinate of object J on factor 1 “Stat-3” gives useful information about objects/features.

30 Correspondence Factor Analysis Output by “stat-3” Note that the number (singular/plural) has the highest contrib. to axis 1 Note that the quality of the description of attribute “ animacy ” is very poor and that these elements have no contribution to the first 4 factors.

31 Proj. In plane [1,2] PROJECTION DANS LE PLAN FACTORIEL [1,2] | Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%) | Largeur: ; Hauteur: ; Nombre de points : tem | | | 00 | | tej | 00 | | | 00 | | dat | 00 | | | 00 | sin| | 00 | te* tego | | 10 ta to ten | | 00 | | | 00 | | tym ta* | 00 | | | inahum---gen | nhumas | 20 | nom acc fem| | 10 | neu| loc | 00 | | | 00 | | ins | 00 | | | 00 | plu | 00 | | | 00 | ci | tych | 10 | te | | 00 | | | 00 | | tymi| axis 2 axis Qualifiers = animacy, gender Quantifiers = number Syntactic relators = cases morphemes Qualifiers = animacy, gender Quantifiers = number Syntactic relators = cases morphemes Projection in plane [1,2]

32 PROJECTION DANS LE PLAN FACTORIEL [1,2] | Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%) | Largeur: ; Hauteur: ; Nombre de points : tem | | | 00 | | tej | 00 | | | 00 | | dat | 00 | | | 00 | sin| | 00 | te* tego | | 10 ta to ten | | 00 | | | 00 | | tym ta* | 00 | | | inahum---gen | nhumas | 20 | nom acc fem| | 10 | neu| loc | 00 | | | 00 | | ins | 00 | | | 00 | plu | 00 | | | 00 | ci | tych | 10 | te | | 00 | | | 00 | | tymi| axis 2 axis 1 Syntactic relators (on axis 2) quantifiers (on axis 1) « qualifiers » Quantifiers and syntactic relators Axis 1 separates quantifiers => singular vs plural Axis 2 separates syntactic relators => {nom,acc} vs {gen,loc,dat, ins} « Qualifiers » (animacy & gender) are not differenciated on axes 1 and 2 Morphemes are spread over plane [1,2] Axis 1 separates quantifiers => singular vs plural Axis 2 separates syntactic relators => {nom,acc} vs {gen,loc,dat, ins} « Qualifiers » (animacy & gender) are not differenciated on axes 1 and 2 Morphemes are spread over plane [1,2]

33 PROJECTION DANS LE PLAN FACTORIEL [1,2] | Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%) | Largeur: ; Hauteur: ; Nombre de points : temu | | | 00 | | tej | 00 | | | 00 | | dat | 00 | | | 00 | sin| | 00 | te* tego | | 10 ta to ten | | 00 | | | 00 | | tym ta* | 00 | | | inahum---gen | nhumas | 20 | nom acc fem| | 10 | neu| loc | 00 | | | 00 | | ins | 00 | | | 00 | plu | 00 | | | 00 | ci | tych | 10 | te | | 00 | | | 00 | | tymi| axis 2 axis Morphemes strictly associated to singular: => ta, to, ten, te*, tego, tej, temu, ta* Morphemes strictly associated to singular: => ta, to, ten, te*, tego, tej, temu, ta* Morphemes strictly associated to plural: => ci, te, tych, tymi Morphemes strictly associated to plural: => ci, te, tych, tymi tym may be either singular or plural Axis 1 separates quantifiers

34 PROJECTION DANS LE PLAN FACTORIEL [1,2] | Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%) | Largeur: ; Hauteur: ; Nombre de points : tem | | | 00 | | tej | 00 | | | 00 | | dat | 00 | | | 00 | sin| | 00 | te* tego | | 10 ta to ten | | 00 | | | 00 | | tym ta* | 00 | | | inahum---gen | nhumas | 20 | nom acc fem| | 10 | neu| loc | 00 | | | 00 | | ins | 00 | | | 00 | plu | 00 | | | 00 | ci | tych | 10 | te | | 00 | | | 00 | | tymi| axis 2 axis On one side:  ta, to, ten, te*, ci, te are only nomin. and/or accus. On the other side:  tej, tych, temu, tymi, ta*, tymi are only genitive, locative, dative and/or instrum. On one side:  ta, to, ten, te*, ci, te are only nomin. and/or accus. On the other side:  tej, tych, temu, tymi, ta*, tymi are only genitive, locative, dative and/or instrum. tego may be either accusative or genitive Axis 2 separates syntactic relators

35 | ta* | | | 00 | | | 10 tymi | | 00 | ins | | 00 | | | 00 | | ta | 00 | | | 00 | nom | to | 00 | | te* | 10 | | ten | 00 | te | tym tem| 00 | ci | dat | 00 | accfem| sin | ina | hum| | 02 | plu | | 00 | mas| | 00 | | | 00 | | tej | 00 | | | 00 | loc | | 00 | | | 00 | | tego | 00 | | | 00 | tych gen| | axis 3 axis 1 8 morphemes tych, tego, tej are only associated to genitive or locative morphemes tymi, ta* are only associated to instrum. morphemes tych, tego, tej are only associated to genitive or locative morphemes tymi, ta* are only associated to instrum. tym may be either instrumental or locative or dative Axis 3 separates {genitive, locative} vs {instrumental}

36 PROJECTION DANS LE PLAN FACTORIEL [1,2] | Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%) | Largeur: ; Hauteur: ; Nombre de points : temu+--10 | | | 00 | tej | | 00 | | | 00 | | dat | 00 | | | 00 | sin | | 00 | te* tego | 00 | ta | to ten | 00 | | | 00 ta* | tym | 00 | | | gen------inahum | nhu| mas | 20 | fem acc| nom | 10 | loc | neu | 00 | | | 00 | ins| | 00 | | | 00 | |plu | 00 | | | 00 | tych ci | 10 | | te | 00 | | | 00 |tymi | | axis 4 axis 1 morphemes ta*, te*, tej, ta are only associated to feminine morphemes tego, to, ten, temu, ci are only associated to masculine or neutral morphemes ta*, te*, tej, ta are only associated to feminine morphemes tego, to, ten, temu, ci are only associated to masculine or neutral Again, tym is ambiguous and may be associated to any gender Axis 4 separates gender: vs Axis 4 separates gender: feminine vs {masculine, neutral} Note that animacy is still not differenciated on axis 4. Differenciation appears only on axis 9 ! Note that animacy is still not differenciated on axis 4. Differenciation appears only on axis 9 !

37 Differenciation of Animacy does not appear before factor 9 FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————— ——————————————————————— hum | | | | | ina | | | | | nhu | | | | | acc | | | | | dat | | | | | gen | | | | | ins | | | | | loc | | | | | nom | | | | fem | | | | | mas | | | | neu | | | | | plu | | | | | sin | | | | | ci | | | | | ta | | | | | ta* | | | | | te | | | | | te* | | | | | tego | | | | | tej | | | | | temu | | | | | ten | | | | | to | | | | | tych | | | | | tym | | | | | tymi | | | | | | F#5 COR CTR | F#6 COR CTR | F#7 COR CTR | F#8 COR CTR | F#9 COR CTR | —————————————————————————————————————————————————————————— —————————————————————————— hum | | | | | | ina | | | | | | nhu | | | | | | acc | | | | | | dat | | | | | | gen | | | | | | ins | | | | | | loc | | | | | | nom | | | | | | fem | | | | | | mas | | | | | | neu | | | | | | plu | | | | | | sin | | | | | | ci | | | | | | ta | | | | | | ta* | | | | | | te | | | | | | te* | | | | | | tego | | | | | | tej | | | | | | temu | | | | | | ten | | | | | | to | | | | | | tych | | | | | | tym | | | | | | tymi | | | | | | Animacy first appears on factor 9

38 PROJECTION DANS LE PLAN FACTORIEL [1,9] | Horizontal: Axe #1 (Inertie: 13.05%) ——— Vertical: Axe #9 (Inertie: %) | Largeur: ; Hauteur: ; Nombre de points : ci | | | 00 | | te* | 00 | | | 00 | hum| | 00 | | | 00 | | to | 00 | | | 00 | acc | ta* | tycplu---ins locfem+-----tym tegsindat tem+--40 tymi nomgen| ta tej | 02 | te nhu| | 00 | | | 00 | ina| | 00 | | | 00 | | ten | axis 1 axis 9 (inertia = 4.35 %) Axis 9 separates vs Axis 9 separates human vs {nonHuman, inanimate} morpheme ci applies only to human entities

39 Comparing Theories of Polish Noun Categories in Grammar THEORIES GRAMMATICALIZED ATTRIBUTES Mańczak W. (1956) Saloni Z. (1976) GENDERNUMBER feminine neuter non animal masculine animal masculine personal masculine non personal masculine “pluralia tantum” singular plural This proposal GENDERANIMACYNUMBER feminine neuter masculine inanimate non human animate human animate singular plural


Download ppt "SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,"

Similar presentations


Ads by Google