Advances and directions of research in Symbolic Data Analysis E. Diday CEREMADE. Paris–Dauphine University June 14, 2014 SDA Workshop – Tutorial Academica.

Advances and directions of research in Symbolic Data Analysis E. Diday CEREMADE. Paris–Dauphine University June 14, 2014 SDA Workshop – Tutorial Academica Sinica

OUTLINE PART 1 BUILDING SYMBOLIC DATA PART 2 OPEN DIRECTION OF RESEARH. PART 3 AN ILLUSTRATIVE EXAMPLE : TRACHOMA STUDY

PART 1 Building Symbolic data:. Some principles. Ten kinds of Symbolic Variables

Some principles  Symbolic Data are not given or found like standard or complex data.  They are build from classes of individuals in case of standard data or from classes of several kinds of individuals in case of complex data.  Symbolic data are not only distributions.

Ten examples of Symbolic variables

PART 2: OPEN DIRECTION OF RESEARH Building Symbolic Data. Extending methods to Symbolic Data Four theorems of convergence needed to be proved on any extended method to Symbolic Data Models of models Law of parameters of laws and Laws of vectors of laws. Copulas needing. Optimisation in non supervised learning (hierarchical and pyramidal clustering).

BUILDING SYMBOLIC DATA The discretization of the initial classical variables has to be donne in order to optimize at least three kinds of aims: 1) The quality of the obtained distribution  It can be measured by model selection criteria BIC, MDL, AIC, MML like or other criterion of this kind based on the likelihood estimation.  Flat distributions are not interesting so criterion of “information” like (Sum of p i Log(p i )) can be used. 2) The level of discrimination between the obtained symbolic description. It can be measured by the sum of their dissimilarities two by two. 3) The correlation between the bins associated to the different symbolic variables (metabins).

- Graphical visualisation of Symbolic Data - Correlation, Mean, Mean Square, distribution of a symbolic variables. - Dissimilarities between symbolic descriptions, K-nearest neighbourg - Clustering, spatial hierarchies and pyramids of symbolic descriptions, S- Kohonen Mappings - S-Decision Trees - S-Principal Component, Discriminant Factorial Analysis - S- Canonical Analysis, Regression  S- Bayesian trees, Multilevel analysis, Variance Analysis, Vector Support Machine, Mixture decomposition, Multilevel Analysis, Learnong machine by groups. - Etc... EXTENDING METHODS ON SYMBOLIC DATA: MUCH REMAINS TO BE DONE

M(n, k) is supposed to be a SDA method where k is the number of classes obtained on n initial individuals. THEOREME 1 : If the k classes are fixed and n tends towards infinity, then M(n, k) converges towards a stable position. THEOREME 2 : If k increases until getting a single individual by class, then M(n, k) converges towards a standard method. THEOREME 3 : If k and n increase simulataneously towards infinity, then M(n, k) converges towards a stable position. THEOREME 4 If the k laws associated to the k classes are considered as a sample of a law of laws, then M(n, k) applied to this sample converges to M(n, k) applied to this law. Exemples : Théorème 1: il a été démontré dans Diday, Emilion (CRAS, Choquet 1998), pour les treillis de Galois: à mesure que la taille de la population augmente les classes (décrites par des vecteurs de distributions), s’organisent dans un treillis de Galois qui converge. Emilion (CRAS, 2002) donne aussi un théorème dans le cas de mélanges de lois de lois utilisant les martingales et un modèle de Dirichlet. Théorème 2: Par ex, l’ACP classique M O est un cas particulier de l’ACP notée M(n, k) construite sur les vecteurs d’intervalles. Théorème 3: c’est le cadre de données qui arrivent séquentiellement (de type « Data Stream ») et des algorithmes de type one pass (voir par ex Diday, Murty (2005)). Théorème 4: Dans le cas d'une classification hiérarchique ou pyramidale 2D, 3D etc. la convergence signifie que les grands paliers et leur structure se stabilisent. Dans le cas d’une ACP la convergence signifie que les axes factoriels se stabilisent. FOUR THEOREM TO BE PROVED ON ANY EXTENDED METHOD TO SYMBOLIC DATA

MODELS OF MODELS ARE NEEDED Individual X1X1 XjXj ind 1 Messi X ij ind n X’ j X’ 1 Team s CiCi CkCk C1C1 A symbolic data (age of Messi team) Table 1 Table 2 A number (age of Messi) X j is a standard random numerical variable X’ j is a random variable with histogram value  Question: if the law of Xj is given what is the law of X’ j ? (Dirichlet models useful).

Law of parameters of laws Y1Y1 YjYj C1C1 CiCi Par ij CkCk Example: Par ij = (  ij, ϭ ij ) Estimated parameters of the law X ij of the class C i Y1Y1 YjYj YpYp Law(P ar j ) Law (Par p ) Find the law of the parameters for each symbolic variable Y j and the law of the associated vector of parameters laws. Example: If f is the density of the parameters of the uniform law of intervals and g the law of intervals then: g(y) = 6 p f(x) /  j = 1,p (x j max - x j min ) (Diday à SFC 2011 Orléans).

In each ll of the symbolic data table, we supose to have a density function f(i,j) f(i, j, j’) is the joint probability of the variables j and j’ for the individual i.  In case of independency, we have f(i, j, j’) = f(i, j’). f(i, j’),  If there is no independancy: f(i, j, j’) = Copula(f(i, j’). f(i, j’)) Aim of Copula model in SDA:  find the Copula which minimise the differences with the joint.  In order to avoid the restriction to independency hypotheses and to reduce the cost of f(i, j, j’) computing.  In that way we can obtain a Copular PCA, Regression, Canonical, Analysis, …. Copulas needing in Symbolic Data Analysis

Bi-plot of histogram variables The joint probability can be inferred by a copula model Y2Y2 Y1Y1 CiCi CkCk C1C1 Copula

Each class is described by symbolic data C2 A 1 B1 C1 C3 3D Spatial Pyramid x1 x2x3x4 x5 Pyramides Hierarchies x1 x2 x3x4 x5 S2S2 S1 Ultrametric dissimilarity = U Robinsonian dissimilarity = R Yadidean dissimilarity = Y W = |d - U | W = |d - R | W = |d - Y | Optimisation in clustering d is the given dissimilarity

 Trachoma, caused by repeated ocular infections with Chlamydia tra- chomatis whose vector is a ﬂy, is an important cause of blindness in the world. This study was conducted in Mali.  The first aim was to choose among three antibiotic strategies those with the best cost-eﬀectiveness ratio.  The second aim was to ﬁnd the demographic and environmental parameters on which we could try to intervene. PART 3 ILLUSTRATIVE EXAMPLE ON TRACHOMA

Symbolic Table of Degradation The classes 0x0, 0x1, 1x0 and 1x1 of degradation (0 = healthy, 1 = ill at the (beginning x end) of the one year study. These classes are directly issued from the given data and not from a clustering process. INTERPRETATION: The THIRD STRATEGY is the most frequent in the worth class (0x1). Nevertheless we cannot conclude that it is the worth strategy as the degradation can come from the environmental of this class 0X1.

The third strategy remains the worse in three homogeneous environmental conditions obtained by clustering

PCA OF THE SYMBOLIC DATA TABLE OF DEGRADATION A Standard PCA is applied on the categories of the symbolic fariables (considered as numerical variables) of the “degradation symbolic data table” on which the piecharts of the strategies are projected.

ANY PIECHART oF SYMBOLIC VARIABLE CAN BE SEEN: Borehole well

CORRELATION CIRCLE OF ALL THE CATEGORIES ( ie BINS) OF THE SYMBOLIC VARIABLES ON THE FIRST AXIS.

SYMBOLIC VARIABLES PROJECTION IN HYPERCUBE QUADRANT SYMBOLIC VARIABLES PROJECTION

THE SDA STRATEGY  The classes are generally not obtained from a clustering process. The classes 0x0, 0x1, 1x0 and 1x1 of degradation are directly issued from the given data.  the clustering strategy in SDA is not much used to build the classes to be studied, it is mainly used in order to show dependencies or independencies between groups of symbolic variables. Here the environmental conditions

CONCLUSION Classical, Complex and Big Data are GIVEN. Symbolic data are BUILD. Complex and Big Data data can be simplified and reduced in Symbolic Data. The quality the obtained Symbolic Data can be improved by optimization of several criteria. The number of papers for building Symbolic Data remains few. Much remains to do in this direction. Symbolic data are not only distributions.  SYMBOLIC DATA ARE THE NUMBERS OF THE FUTURE.

