# Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.

## Presentation on theme: "Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information."— Presentation transcript:

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information from large data tables. Tree categories of Data Analysis methods : Description : to describe a phenomenon without prejudice Structuring : to synthesize information by structuring the population in homogeneous groups Explanation : to determine the observed values of a variable by means of those observed for other variables.

Mutidimensional Data Analysis one-dimensional descriptive statistics: summarize information for each character. Data Analysis : describe relations between characters and their effects on the structuring of the population. Principal Component Analysis (PCA) Factorial correspondences analysis (FCA)

Principal Component Analysis PCA is used when we have a measure data table. Here an example of a measure data file. : Columns : quantitative variables Rows : observations observations heightweightPulmonary capacity Durand1.7772.42.69 Dupont1.5268.03.90 Dupond1.6468.03.40 Martin1.7650.02.00

Objectives of the ACP : locate homogeneous groups of observations, across from the set of variables. A large number of variables can be systematically reduced to a smaller, conceptually more coherent set of variables. From the set of the initial statistical variables we can build explicative artificial statistical variables. The principal components are a linear combination of the original variables. Its goal is to reduce the dimensionality of the original data set. A small set of uncorrelated variables is much easier to understand and use in further analyses than a large set of correlated variables. Principal Component Analysis

3 types of PCA General PCA : apply PCA method to the initial Data Table. Centered PCA : apply PCA method to the centered variables. reduced PCA : apply PCA method to the centered and reduced variables Principal Component Analysis

- Centered PCA X : statistical variable (Mean of X) (Centered variable)

Reduced PCA Principal Component Analysis (Centered and reduced variable)

The PCA provides a method of representation of a population in order to Locate homogeneous groups of observations, across from the variables. Reveal differences between observations or groups of observations, across from the set of variables. Highlight observations with the atypical behavior. Reduce the information which allows to describe the position of an observation in the set of the population. Principal Component Analysis

The population defined variables on the population. defined variables on the population. Example: Principle Principal Component Analysis

Two types of analysis Analysis of the observations. Analysis of the variables. The reduced analysis: Principal Component Analysis

Observations analysis Each observation is represented by a point in a three dimensional space. How to compute a distance between two observations? Principal Component Analysis

The distance measures the resemblance between these two observations. More the distance is small more the two points are nearby and thus more the two observations resemble each other. Conversely, more the distance is large, more the points are distant and less the observations resemble each other. Principal Component Analysis Observations analysis The 3 axes are defined by variables Y 1 (.), Y 2 (.) and Y 3 (.) calculated from initial variables The distance between two observations  i and  k is given by :

It is impossible to carry out a representation of the observations in a dimensional space greater than 3. It is thus necessary to find a good representation of the observations group in a space of lower size (2 for example). How to pass from a space of size greater or equal to 3 at a space of more restricted size?  Look for a "good subspace" of representation by using a mathematical operator. Two problems are posed : 1.Give a meaning to the expression "good representation", 2.Characterize the subspace Principal Component Analysis Observations analysis

To find under space F such that the distance between points is preserved in the process of projection on this subspace. Thus, the resemblance between observations is preserved in this operation of projection Principal Component Analysis Observations analysis Find a sub-space F such as Find a sub-space F such as :

Solution : To determine the subspace F, of dimension q, by determining q first eigenvalues and q eigenvectors associated of the matrix Y' Y Principal Component Analysis Observations analysis (Correlation matrix)

Principal Component Analysis Z=Y’.Y 1, 2, 3 …. m : Eigenvalues of Z 1, 2, 3 …. m : Eigenvalues of Z u 1, u 2, u 3 …. u m : Eigenvectors of Z Z. u 1 : Vector of the n observations coordinates on the first principal axis Z. u 2 : Vector of the n observations coordinates on the second principal axis ……….. Z. u m : Vector of the n observations coordinates on the m th principal axis Observations analysis

It is necessary to build indicators to know quality of the obtained results. These indicators are : an indicator of global quality an indicator of contribution of the observation to total inertia an indicator of contribution of the observation to the inertia explained by the subspace F an indicator of error of perspective. Principal Component Analysis Observations analysis

Global quality Eigenvalues of Y'Y : The One dimensional Subspace F : we obtain IQG(F) = 0.6896. The first axis (of the analysis) provide 68.96% of initial information. The subspace generated by the two first axis : IQG(F)=1. (100% of initial info.) 1 = 0.689 2 = 0.310 3 = 0.00 Principal Component Analysis Observations analysis q : subspace dimension n : number of variables Eigenvalues numbered in the descending order

Contribution of the observation to total inertia  CIT(  i ) = 1. CIT allows to locate easily the observations far distant from center of gravity. Principal Component Analysis Observations analysis N : number of individuals (observations) in the CPA

contribution of the observation to the inertia explained by the subspace The CIE determines the observations which contribute more to create a subspace F. In general, this parameter is calculated for all the observations for each axis CIE values for nine observations of our example. Principal Component Analysis Observations analysis

Error of perspective : COS 2 (.,.) has the following properties : Principal Component Analysis Observations analysis The quality of representation of an observation on the subspace

Objective: to determine synthetic statistical variables which "explain" the initial variables. Problem: to fix the criterion which allows to determine these synthetic variables, then to interpret these variables. In our example, the problem can be posed mathematically as following : Variables analysis Principal Component Analysis Y 1 (.), Y 2 (.) and Y 3 (.) are explained linearly by the synthetic variables Z 1 (.) and Z 2 (.) d 1 (.), d 2 (.) and d 3 (.) are the residual variables, which one want to minimize the variances a ij are the solutions of the optimization problem : Min ( V(( d 1 (.))+V(d 2 (.))+V(d 3 (.))) V(d i (.) : variance of d i (.)

Solution : calculation of the eigenvectors associated to q greater eigenvalues of matrix YY ' Notice : Matrix YY' has the same no null eigenvalues as the matrix Y' Y. These two eigenvectors define the two sought synthetic variables. Principal Component Analysis Variables analysis

The same previous indicators are used in the variables analysis. A significant indicator is IQG(F), the indicator of quality of the subspace F (in which the variables are projected). This indicator allows to calculate the "residual variance” (not taken into account in the representation by the subspace): Residual variance = m.[1 - IQG(F)] Principal Component Analysis Variables analysis

it is shown that the coordinate of the projection of a variable on an axis of the subspace is proportional to the linear coefficient of correlation between this variable and the "synthetic" variable corresponding to the axis:¶ Note: Taking into account this proportionality, the program carries out a calculation of reduction which involves that the co-ordinates of projected variables on each axis are directly the linear coefficients of correlation. Principal Component Analysis Variables analysis

For each variable, the coefficient of multiple correlation with the variables corresponding to the axes of a subspace F on which it is projected is proportional to the square of the norm of the projected vector.  A variable will be explained better by the axis of a subspace when the norm of the projected associated vector is large. Principal Component Analysis Variables analysis

Simulated example Number of variables : 8 Number of observation : 300 NumX1X2X3X4X5X6X7X8 11.6923.046-7.4612-2.03682.5122.95840.8168-1.2608 218.31611.358-25.7476-8.68641.9522.56640.0328-0.7568 316.37710.3885-23.6147-7.91081.822.474-0.152-0.638 46.6885.544-12.9568-4.03521.9182.5426-0.0148-0.7262 5-2.6660.867-2.6674-0.29362.2912.80370.5074-1.0619 67.1035.7515-13.4133-4.20121.1181.9826-1.1348-0.0062 712.5588.479-19.4138-6.38321.6642.3648-0.3704-0.4976 89.0646.732-15.5704-4.98562.3162.82120.5424-1.0844 910.6687.534-17.3348-5.62721.4362.2052-0.6896-0.2924 107.1365.768-13.4496-4.21442.5142.95980.8196-1.2626 Principal Component Analysis

Linear correlation between the variables ¶

Variables non correlated with X 1

--- Eigenvalues - Cumulated - Cumulated percentage 1 4.09407 4.09407 0.51176 2 3.90593 8.00000 1.00000 3 0.00000 8.00000 1.00000 4 0.00000 8.00000 1.00000 5 0.00000 8.00000 1.00000 6 0.00000 8.00000 1.00000 7 0.00000 8.00000 1.00000 8 0.00000 8.00000 1.00000 Les valeurs propres 100% of inertia is obtained with the two first axes

Variables coordinates U1 U2 0.7150.699 X1 0.715 0.699 0.7150.699 X2 0.715 0.699 -0.715 -0.699 X3 -0.715 -0.699 -0.715 -0.699 X4 -0.715 -0.699 -0.715 0.699 X5 -0.715 0.699 -0.715 0.699 X6 -0.715 0.699 -0.715 0.699 X7 -0.715 0.699 0.715 -0.699 X8 0.715 -0.699 1 1 0 #1 #2 X1 X2 X3 X4 X5 X6 X7 X8 U1 : First principal component All the variables are located inside a unit circle (Reduced ACP)

1 1 0 #1 #2 X1 X2 X3 X4 X5 X6 X7 X8 two dimensions are highlighted Variables coordinates

observations coordinates

Factorial Correspondences Analysis The factorial correspondences analysis is used to extract information starting from the contingency tables. contingency tables: crossing of 2 variables X and Y. contingency tables (Frequency tables) : crossing of 2 variables X and Y. X : m modalities Y : p modalities Objectives of FCA To build a modalities map of two variables X and Y. To determine if there are correlations between certain modalities of X and some modalities of Y.

Example : 2 variables : ward and expenditure. 5 wards (division in hospital) 5 expenditures (post of expenditure) Factorial Correspondences Analysis

A row modality is represented by a point of a p dimensions space (27 18 12 19 8) represents second row Row 2 : Point of R 5 A rowmodality : 5 points in 5 dimensions space A row modality : 5 points in 5 dimensions space Analysis of row modalities Factorial correspondences Analysis

How to find a subspace of reduced size Q (q=2 for example) to represent these points? The distance between "points represented" (in the subspace) must be the nearest distance between the initial points.  one must define a distance between the points (between modalities). A row modality is represented by a vector X i whose his coordinates are computed by : Factorial Correspondences Analysis

Distance between two modalities is given by : This distance is called Chi-square distance Example : distances between modalities of wards are given in this table : Factorial Correspondences Analysis

The problem formulation Find a q-dimensional subspace F, where : is maximized Factorial Correspondences Analysis

Center of gravity of x I having a weight f l. Centering operation Each vector z i has p coordinates noted z ij. We can define a Matrix Z where the general term is : z ij It is shown that the q-dimensional subspace F is generated by the eigenvectors of the matrix Z' Z Factorial Correspondences Analysis

Example : Center of gravity Vector x i Vector y i Matrix Z Eigenvalues : 1 = 0.01 1 = 0.01 2 = 0.00176 2 = 0.00176 3 = 0 3 = 0 Factorial Correspondences Analysis

Quality of representation indicators Quality of sub-space engendered : q : dimension of sub-space P : number of column modalities Factorial Correspondences Analysis

Contribution of a row modality i to making axis k: 0  CIE(i,u k )  1. if CIE is close to 1, the row modality has a significant weight in the determination of the subspace F. Example : Contribution of row modalities Factorial Correspondences Analysis

Quality of representation (perspective effect) : measure the degree of deformation during projection. Factorial Correspondences Analysis

Columns modalities analysis : Columns modalities are analyzed same manner as the rows modalities. Coordinates of x i are such as: The matrices Z' Z and ZZ' have the same ones no null eigenvalues Factorial Correspondences Analysis

contributions of columns modalities quality of representation of columns modalities These indicators have the same definitions, adapted to the columns modalities Factorial Correspondences Analysis

the simultaneous representation of the rows and the columns projected in the first factorial plane (axes 1 and 2) of our example Factorial Correspondences Analysis

Illustrations

Download ppt "Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information."

Similar presentations