Presentation is loading. Please wait.

Presentation is loading. Please wait.

Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre.

Similar presentations


Presentation on theme: "Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre."— Presentation transcript:

1 Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre de Recherche et dEtude de lArt Préhistorique UMR 5608: Travaux et Recherches Archéologiques sur les Cultures, les Espaces et les Sociétés CASK Sorbonne 2008, Paris, June 13th

2 SEMANA and Data Mining sampling Data coding KDD techniques (Rough Set, FCA, statistical analysis, etc.) interpretation Data warehouse After B. W ü thrich, 1998 SEMANA, a bundle of tools aimed at makink these tasks easier

3 Architecture of the SEMANA platform A software bundle written in Transcript®, the programming language of Revolution® Standalone applications for Macintosh and Windows Dynamic DB Builder Data sheets Data coding Data storage Formal Concept Analysis Statistical tools Galois lattice central concepts Correlation Matrix Correspondence Factor Analysis, Hierarchical Classifications (Wille, Ganter) (Benzecri) Tables (various formats) Multi-valued tablesOne-valued tables Tree Builder Assistant Aid to code structuration Rough Set Theory Decision Logic Upper approx. Lower approx. Reducts, Core Discriminating power Minimal rules Attribute strength (Pawlak) ( Bolc, Cytowski and Stacewicz) Attribute Editor Discretization Logical scaling …

4 Working with the SEMANA platform Three illustrations: Ten-ta-to: the proximal deictic adjectives in Polish The category of Aspect in Polish Representations of women in Palaeolithic Art SEMANA is twofold: 1)Tools for Intelligent Database Designing => Dynamic DB Builder providing statistical information about the use of AV suggesting iterative restructuration of AV 2) Tools for KDD research : integration of RST, FCA, Statistical Data Analyses

5 Case 1: the Proximal Deictic Adjectives in Polish

6 The proximal deictic adjectives in Polish Case = {Nominative, Accusative, Genitive, Dative, Instrumental, Locative} Number = {singular, plural} Gender = In Polish Linguistics (cf. SALONI, Z. 1976), up to 7 gender classes have been proposed: In Polish School Grammar, the adjective declension consists in the amalgamation of three morphological categories. Singular : 1.feminine 2.neuter 3.animal masculine (animal corresponds to the feature animate in other European languages descriptions) 4.non animal masculine Plural : 1.personal masculine ( personal corresponds to the feature human ) 2.non personal masculine 3.pluralia tantum (defective nouns with no singular form).

7 The proximal deictic adjectives in Polish The root of these adjectives is a single phoneme t-. 13 forms are used: ten, ta, to, tym, tymi, tych, te, te*,temu, tej, tego, ta*,ci Examples (only Nominative case) PolishEnglish translation SingularPlural Masculine ten domte domy this/these house(s) ten pieste psy this/these dog(s) ten panci panowie this/these sir(s) Feminine ta deska te deskithis/these board(s) ta gęśte gęsithis/these goose/geese ta pani te paniethis/these lady/ladies Neuter to pi ó rote pi ó rathis/these feather(s) to kurczęte kurczętathis/these chicken(s) to dzieckote dziecithis/these child/children.........

8 The proximal deictic adjectives in Polish In order to elucidate the problem of Gender in Polish noun morphology, H. and A. Wlodarczyk have built a database of usages of the proximal deictic adjectives. As the 7 sub-genders of Polish School Grammars neither correspond to any known semantic or ontological categories nor to any known grammatical sub-gender in other languages, they proposed to split the sub-genders of the Gender attribute into three attributes : gender = {feminine, neuter, masculine) animacy = {animate, inanimate} humanity = {human, non_human}

9 TENTATO: database first version sample morpheme attribute, value (features chosen for each entry)

10 An AV Table is automatically collected TENTATO: database first version Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0 Attributes = 5 (with resp. 6,2,3,2,2 values) NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt') ================================================== Theoretical Number of Combinations = 144 Apparent Saturation Index : 75% ================================================== The following pairs of attributes could be merged: [hum|ina] Confidence index = 99.9% [hum|nhu] Confidence index = 99.9% [ina|nhu] Confidence index = 99.9% ================================================== STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36 Case A 18 Case D 18 Case G 18 Case I 18 Case L 18 Case N 18 Gnd fem 36 Gnd masc 36 Gnd neu 36 Hum hum 36 Hum nhum 72 Nb plur 54 Nb sing 54 ================================================== Non-Attested Pairs of Values = 1 ina,hum,2,4 -------------------------------------------------- Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100% -------------------------------------------------- Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0 Attributes = 5 (with resp. 6,2,3,2,2 values) NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt') ================================================== Theoretical Number of Combinations = 144 Apparent Saturation Index : 75% ================================================== The following pairs of attributes could be merged: [hum|ina] Confidence index = 99.9% [hum|nhu] Confidence index = 99.9% [ina|nhu] Confidence index = 99.9% ================================================== STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36 Case A 18 Case D 18 Case G 18 Case I 18 Case L 18 Case N 18 Gnd fem 36 Gnd masc 36 Gnd neu 36 Hum hum 36 Hum nhum 72 Nb plur 54 Nb sing 54 ================================================== Non-Attested Pairs of Values = 1 ina,hum,2,4 -------------------------------------------------- Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100% -------------------------------------------------- The program suggests the possibility to merge these attributes The program indicates that the pair {inanimate-human} does not exist (for obvious reason)

11 TENTATO (Version 1): Formal Concept Analysis TENTATO Version 1 simplified lattice complete lattice Inanimate depends on non human Human depends on animate Test of dependence Total Dependence ina => nhu (36/36) hum => an (36/36) High probability (>90%): none Total Dependence ina => nhu (36/36) hum => an (36/36) High probability (>90%): none

12 TENTATO: second version Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0 Attributes = 4 (with resp. 3,6,3,2 values) NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt') ================================================== Theoretical Number of Combinations = 108 Apparent Saturation Index : 100% ================================================== No attributes could be merged ================================================== STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36 CAS accusative 18 CAS dative 18 CAS genetive 18 CAS instrumental 18 CAS locative 18 CAS nominative 18 GND feminine 36 GND masculine 36 GND neuter 36 NBR plural 54 NBR singular 54 ================================================== Non-Attested Pairs of Values = 0 ------------------------------------------------- Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100% Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0 Attributes = 4 (with resp. 3,6,3,2 values) NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt') ================================================== Theoretical Number of Combinations = 108 Apparent Saturation Index : 100% ================================================== No attributes could be merged ================================================== STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36 CAS accusative 18 CAS dative 18 CAS genetive 18 CAS instrumental 18 CAS locative 18 CAS nominative 18 GND feminine 36 GND masculine 36 GND neuter 36 NBR plural 54 NBR singular 54 ================================================== Non-Attested Pairs of Values = 0 ------------------------------------------------- Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100% In a second trial, the attributes ANIMACY ({ANI}=[animate|inamimate]) and HUMANITY ({HUM}=[human|nhuman]) are merged into a three-valued attribute : {ANY}=[nhuman|inanimate|human] No attribute merging is possible; all pairs of values are attested.

13 TENTATO: Formal Concept Analysis TENTATO Version 1 simplified lattice complete lattice TENTATO Version 2 simplified lattice complete lattice All the attributes at the same level : no hierarchy Total Dependence none High probability (>90%): none Total Dependence none High probability (>90%): none Total Dependence ina => nhu (36/36) hum => an (36/36) High probability (>90%): none Total Dependence ina => nhu (36/36) hum => an (36/36) High probability (>90%): none Test of dependence => Inanimate depends on non human Human depends on animate

14 TENTATO-2: Rough Set Theory and Minimal Rules r1 (9) : CASdat,NBRplu --> tym r2 (3) : CASins,GNDmas,NBRsin --> tym r3 (3) : CASins,GNDneu,NBRsin --> tym r4 (3) : CASloc,GNDmas,NBRsin --> tym r5 (3) : CASloc,GNDneu,NBRsin --> tym r6 (9) : CASins,NBRplu --> tymi r7 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tych r8 (9) : CASgen,NBRplu --> tych r9 (9) : CASloc,NBRplu --> tych r10 (3) : CASacc,GNDneu,NBRsin --> to r11 (3) : CASnom,GNDneu,NBRsin --> to r12 (3) : CASacc,ANYina,NBRplu --> te r13 (3) : CASacc,ANYnhu,NBRplu --> te r14 (3) : CASacc,GNDfem,NBRplu --> te r15 (3) : CASacc,GNDneu,NBRplu --> te r16 (3) : CASnom,ANYina,NBRplu --> te r17 (3) : CASnom,ANYnhu,NBRplu --> te r18 (3) : CASnom,GNDfem,NBRplu --> te r19 (3) : CASnom,GNDneu,NBRplu --> te r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> ten r21 (3) : CASnom,GNDmas,NBRsin --> ten r22 (3) : CASdat,GNDmas,NBRsin --> temu r23 (3) : CASdat,GNDneu,NBRsin --> temu r24 (3) : CASdat,GNDfem,NBRsin --> tej r25 (3) : CASgen,GNDfem,NBRsin --> tej r26 (3) : CASloc,GNDfem,NBRsin --> tej r27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tego r28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tego r29 (3) : CASgen,GNDmas,NBRsin --> tego r30 (3) : CASgen,GNDneu,NBRsin --> tego r31 (3) : CASacc,GNDfem,NBRsin --> te* r32 (3) : CASnom,GNDfem,NBRsin --> ta r33 (3) : CASins,GNDfem,NBRsin --> ta* r34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci The 108 distinct objects of the DB can be described by only 34 morphological rules. Note that CAS and NBR are required in every rule, GND in 26/34 and ANY in only 9/34. A procedure derived from Rough Set Theory allows us to calculate the minimal rules (i.e. the values of the attributes which condition the morpheme to be used)

15 TENTATO-2: Statistical analysis The Multi-valued Table is unfolded in a One-value Table... …and the One-value Table is transformed in a Burts Table… A Burts Table is a square symmetrical table giving the number of cooccurrences of the attributes

16 TENTATO-2: Correspondence Factor Analysis (CFA) Numbers in the Table are considered as coordinates of points in a N-dimensional space. z x y F1 F2 F3 CFA calculates the axes of inertia of the cloud of points (F1, F2, F3 … ) and displays projections in planes [F1,F2], [F1,F3], etc. CFA is implemented inSemana

17 TENTATO-2: Correspondence Factor Analysis (CFA) Note that the number (singular/plural) has the highest contrib. to axis 1 Note that the quality of the description of attribute animacy is very poor: these elements have no contribution to the first 4 factors. Contribution of object J to the definition of factor 1 Contribution of factor 1 to the description of object J Coordinate of object J on factor 1 Output by Stat-3

18 TENTATO-2: CFA representation in plane [1,2] Output by Stat-3 Axis 1 Axis 2 Axis 2 separates syntactic relators (CASE) => {nom,acc} vs {gen,loc,dat, ins} Axis 2 separates syntactic relators (CASE) => {nom,acc} vs {gen,loc,dat, ins} Axis 1 separates NUMBER => singular vs plural ANIMACY & GENDER are not differenciated on axes 1 and 2 Morphemes are widely spread over plane [1,2]

19 TENTATO-2: Axis 1 separates quantifiers Output by Stat-3 Axis 1 Axis 2 plural singular Morphemes strictly associated to singular: => ta, to, ten, te*, tego, tej, temu, ta* Morphemes strictly associated to singular: => ta, to, ten, te*, tego, tej, temu, ta* One exception: tym may be either singular or plural One exception: tym may be either singular or plural Morphemes strictly associated to plural: => ci, te, tych, tymi Morphemes strictly associated to plural: => ci, te, tych, tymi

20 TENTATO-2: Axis 2 separates syntactic relators Output by Stat-3 Axis 1 Axis 2 ins nom acc loc gen dat Morphemes strictly associated to genitive, locative, dative and/or instrumental: => tej, tych, temu, tymi, ta*, tymi Morphemes strictly associated to genitive, locative, dative and/or instrumental: => tej, tych, temu, tymi, ta*, tymi One exception: tego may be either accusative or genitive One exception: tego may be either accusative or genitive Morphemes strictly associated to nominative and/or accusative: => ta, to, ten, te*, ci, te Morphemes strictly associated to nominative and/or accusative: => ta, to, ten, te*, ci, te

21 Output by Stat-3 Axis 1 Axis 3 ins acc loc gen dat Morphemes tymi, ta* strictly associated to instrumental Morphemes tymi, ta* strictly associated to instrumental One exception: tym may be either instrumental or locative One exception: tym may be either instrumental or locative Morphemes tych, tego, tej strictly associated to genitive or locative Morphemes tych, tego, tej strictly associated to genitive or locative TENTATO-2: Axis 3 separates {gen, loc} vs {inst]} nom

22 Morphemes tego,to, ten, temu, ci strictly associated to masculine or neutral Morphemes tego,to, ten, temu, ci strictly associated to masculine or neutral Output by Stat-3 Axis 1 Axis 4 TENTATO-2: Axis 4 separates gender {fem} vs (mas, neu} fem mas neu One exception: tym may be associated to any gender One exception: tym may be associated to any gender Morphemes ta*, te*,tej, ta strictly associated to feminine Morphemes ta*, te*,tej, ta strictly associated to feminine Note that the attribute [ANIMACY]={human, nhuman, inanimate} is still not differenciated on axis 4. Note that the attribute [ANIMACY]={human, nhuman, inanimate} is still not differenciated on axis 4.

23 Output by Stat-3 Axis 1 Axis 9 TENTATO-2: Animacy appears only on axis 9 !!! hum nhu ina Morpheme ci strictly associated to human Morpheme ci strictly associated to human

24 Axis (% inertia) Axis 1 (13.05%) ………………………………………………………………………………………………………. Axis 2 (12.81%) ………………………………………………………………………………………………………. Axis 3 (11.27%) ………………………………………………………………………………………………………. Axis 4 (10.0%) ………………………………………………………………………………………………………. ……………….. ……………………………………………………………………………………………………. Axis 9 (4.35%) ………………………………………………………………………………………………………. TENTATO-2: CFA and Minimal Rules (RST) NUMBER CASE GENDER ANIMACY (36/36 rules) (36/36 rules) (26/36 rules) (9/36 rules) singular plural nom, acc gen,loc,dat,inst gen,loc (dat) inst feminine masculine human nhum,ina The relative strength of the attributes is revealed both by their contribution to the axes of inertia in Factor Analysis and by their weight in Minimal Rules.

25 Case 2: the category of Aspect in Polish

26 A Database built with Dynamic DB-Builder A classical data sheet to fill for each specimen… Attributes and values are chosen in a list… … and the resulting AVs appear in a field the grammatical form of each specimen is used as index

27 A test of consistency Each specimen is characterized by a set of AV and by its grammatical form (used as index). It may be written as a rule : if {given set of AV} then index This allows index inconsistencies to be detected (a test of consistency is provided in Semana) the grammatical form of each specimen is used as index

28 A test of consistency Each specimen is characterized by a set of AV and by its grammatical form (used as index). It may be written as a rule : if {given set of AV} then index This allows index inconsistencies to be detected (a test of consistency is provided in Semana) 9 different forms applying to exactly the same situation ? the grammatical form of each specimen is used as index This is a warning to the expert: probably the AV do not describe properly the different aspectual situations!

29 Polish Aspect using Dynamic DB Builder All specimens are automatically collected in a contingency table…and statistics are reported. In this initial version, there was more than 2 millions of theoretical combinations and 9 pairs of attributes could be merged!

30 Polish Aspect using Dynamic DB Builder DB versionDistinct objectsNumber of attributes Number of theor. combin. Number of merging attributes HW-Aspect-V161122,064,3849 HW-Aspect-V260111,032,1929 HW-Aspect-V37711829,0006 HW-Aspect-V4799408,2401 HW-Aspect-V5798136,0801 HW-Aspect-V669845,3601 HW-Aspect-V774861,4400 HW-Aspect-V878758,3200 Improvements by « trials and errors »

31 From Dynamic DB Builder to STAT-3 The multi-valued table is transformed into a one-valued table for STAT analyses

32 Polish Aspect : Correspondence Factor Analysis Factor Analysis of the contingency table shows a clear Gutmanns effect (i.e. a sequential order of the attributes) axis 1 axis 2

33 Polish Aspect : Correspondence Factor Analysis Ascending Hierarchical Classification shows two well-defined classes

34 Polish Aspect : Correspondence Factor Analysis A clear partition in two classes according to the attribute [VAL] = {perfective | imperfective} perfective imperfective

35 Polish Aspect : Correspondence Factor Analysis Gutmanns effect shows that attributes are sequentially ordered attribute MCMP (morph. comp.) : pip > ip > pp > pi >ii attribute MOD : parallel > sequential > trans > resume > stop > interrupt > keep > OffAndOn

36 Polish Aspect: Correspondence Factor Analysis perfective imperfective VAL perfective imperfective MCMP pip ip pp pi ii 0 0 0 100 100 CRE defnb nRe ndefnb 0 30 89 MOD par seq trans resume stop inter keep OaO 0 0 0 0 35 0 60 100 ANA after finish enter start end before nan begin run 0 0 0 0 33 44 69 40 84 ITS decr incr strong weak 0 0 28 54 TYP ordPr event state refPr 29 17 75 67 perfective imperfective VAL perfective imperfective MCMP pip ip pp pi ii 0 0 0 100 100 CRE defnb nRe ndefnb 0 30 89 MOD par seq trans resume stop inter keep OaO 0 0 0 0 35 0 60 100 ANA after finish enter start end before nan begin run 0 0 0 0 33 44 69 40 84 ITS decr incr strong weak 0 0 28 54 TYP ordPr event state refPr 29 17 75 67 Distribution of features along the perfective-to-imperfective path (% association with imperfective) All these features require imperatively perfective

37 Case 3 : Images of the Woman in Palaeolithic Art

38 Images of the Woman in Palaeolithic Art Customized DB-builder: for each figure, AV are selected with check box buttons Raphaëlle Bourrillon, PhD, Univ.Toulouse-Le Mirail

39 Images of the Woman in Palaeolithic Art CFA and HAC show three classes of representations Realist and fatty Realist and slim Schematic / abstract

40 Detailed study of the schematic women representations CFA and HAC split the schematic feminine figures into five sub-classes Schematic / abstract

41 Detailed study of the schematic women representations Formal concept analysis

42 SEMANA : a bundle of tools for KDD research at hand in a single box with applications in many domains (within and out of Linguistics!) FROM PREPROCESSING … … TO MINING Building /Editing DB - Structuration of AV - Statistics - AV edition (merging, splitting, etc.) - Edition/conversion of tables in various formats Complementary KDD procedures (RST, FCA...) … with special emphasis on the powerful tools of statistical data analyses (CFA, HAC)


Download ppt "Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre."

Similar presentations


Ads by Google