Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.

Slides:



Advertisements
Similar presentations
Tables, Figures, and Equations
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Canonical Correlation
Chapter Nineteen Factor Analysis.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Principal Component Analysis
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Face Recognition Jeremy Wyatt.
Chapter 11 Multiple Regression.
Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Olga Sorkine’s slides Tel Aviv University. 2 Spectra and diagonalization A If A is symmetric, the eigenvectors are orthogonal (and there’s always an eigenbasis).
Principal Component Analysis Principles and Application.
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
Separate multivariate observations
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Factor Analysis Psy 524 Ainsworth.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Linear Least Squares Approximation. 2 Definition (point set case) Given a point set x 1, x 2, …, x n  R d, linear least squares fitting amounts to find.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
CHAPTER 26 Discriminant Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Multivariate Statistics Matrix Algebra II W. M. van der Veld University of Amsterdam.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 12 Factor Analysis.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Presented by: Muhammad Wasif Laeeq (BSIT07-1) Muhammad Aatif Aneeq (BSIT07-15) Shah Rukh (BSIT07-22) Mudasir Abbas (BSIT07-34) Ahmad Mushtaq (BSIT07-45)
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Principal Component Analysis (PCA)
Exploring Microarray data
Principal Component Analysis (PCA)
Descriptive Statistics vs. Factor Analysis
Principal Component Analysis (PCA)
Chapter_19 Factor Analysis
Principal Component Analysis
Presentation transcript:

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information from large data tables. Tree categories of Data Analysis methods : Description : to describe a phenomenon without prejudice Structuring : to synthesize information by structuring the population in homogeneous groups Explanation : to determine the observed values of a variable by means of those observed for other variables.

Mutidimensional Data Analysis one-dimensional descriptive statistics: summarize information for each character. Data Analysis : describe relations between characters and their effects on the structuring of the population. Principal Component Analysis (PCA) Factorial correspondences analysis (FCA)

Principal Component Analysis PCA is used when we have a measure data table. Here an example of a measure data file. : Columns : quantitative variables Rows : observations observations heightweightPulmonary capacity Durand Dupont Dupond Martin

Objectives of the ACP : locate homogeneous groups of observations, across from the set of variables. A large number of variables can be systematically reduced to a smaller, conceptually more coherent set of variables. From the set of the initial statistical variables we can build explicative artificial statistical variables. The principal components are a linear combination of the original variables. Its goal is to reduce the dimensionality of the original data set. A small set of uncorrelated variables is much easier to understand and use in further analyses than a large set of correlated variables. Principal Component Analysis

3 types of PCA General PCA : apply PCA method to the initial Data Table. Centered PCA : apply PCA method to the centered variables. reduced PCA : apply PCA method to the centered and reduced variables Principal Component Analysis

- Centered PCA X : statistical variable (Mean of X) (Centered variable)

Reduced PCA Principal Component Analysis (Centered and reduced variable)

The PCA provides a method of representation of a population in order to Locate homogeneous groups of observations, across from the variables. Reveal differences between observations or groups of observations, across from the set of variables. Highlight observations with the atypical behavior. Reduce the information which allows to describe the position of an observation in the set of the population. Principal Component Analysis

The population defined variables on the population. defined variables on the population. Example: Principle Principal Component Analysis

Two types of analysis Analysis of the observations. Analysis of the variables. The reduced analysis: Principal Component Analysis

Observations analysis Each observation is represented by a point in a three dimensional space. How to compute a distance between two observations? Principal Component Analysis

The distance measures the resemblance between these two observations. More the distance is small more the two points are nearby and thus more the two observations resemble each other. Conversely, more the distance is large, more the points are distant and less the observations resemble each other. Principal Component Analysis Observations analysis The 3 axes are defined by variables Y 1 (.), Y 2 (.) and Y 3 (.) calculated from initial variables The distance between two observations  i and  k is given by :

It is impossible to carry out a representation of the observations in a dimensional space greater than 3. It is thus necessary to find a good representation of the observations group in a space of lower size (2 for example). How to pass from a space of size greater or equal to 3 at a space of more restricted size?  Look for a "good subspace" of representation by using a mathematical operator. Two problems are posed : 1.Give a meaning to the expression "good representation", 2.Characterize the subspace Principal Component Analysis Observations analysis

To find under space F such that the distance between points is preserved in the process of projection on this subspace. Thus, the resemblance between observations is preserved in this operation of projection Principal Component Analysis Observations analysis Find a sub-space F such as Find a sub-space F such as :

Solution : To determine the subspace F, of dimension q, by determining q first eigenvalues and q eigenvectors associated of the matrix Y' Y Principal Component Analysis Observations analysis (Correlation matrix)

Principal Component Analysis Z=Y’.Y 1, 2, 3 …. m : Eigenvalues of Z 1, 2, 3 …. m : Eigenvalues of Z u 1, u 2, u 3 …. u m : Eigenvectors of Z Z. u 1 : Vector of the n observations coordinates on the first principal axis Z. u 2 : Vector of the n observations coordinates on the second principal axis ……….. Z. u m : Vector of the n observations coordinates on the m th principal axis Observations analysis

It is necessary to build indicators to know quality of the obtained results. These indicators are : an indicator of global quality an indicator of contribution of the observation to total inertia an indicator of contribution of the observation to the inertia explained by the subspace F an indicator of error of perspective. Principal Component Analysis Observations analysis

Global quality Eigenvalues of Y'Y : The One dimensional Subspace F : we obtain IQG(F) = The first axis (of the analysis) provide 68.96% of initial information. The subspace generated by the two first axis : IQG(F)=1. (100% of initial info.) 1 = = = 0.00 Principal Component Analysis Observations analysis q : subspace dimension n : number of variables Eigenvalues numbered in the descending order

Contribution of the observation to total inertia  CIT(  i ) = 1. CIT allows to locate easily the observations far distant from center of gravity. Principal Component Analysis Observations analysis N : number of individuals (observations) in the CPA

contribution of the observation to the inertia explained by the subspace The CIE determines the observations which contribute more to create a subspace F. In general, this parameter is calculated for all the observations for each axis CIE values for nine observations of our example. Principal Component Analysis Observations analysis

Error of perspective : COS 2 (.,.) has the following properties : Principal Component Analysis Observations analysis The quality of representation of an observation on the subspace

Objective: to determine synthetic statistical variables which "explain" the initial variables. Problem: to fix the criterion which allows to determine these synthetic variables, then to interpret these variables. In our example, the problem can be posed mathematically as following : Variables analysis Principal Component Analysis Y 1 (.), Y 2 (.) and Y 3 (.) are explained linearly by the synthetic variables Z 1 (.) and Z 2 (.) d 1 (.), d 2 (.) and d 3 (.) are the residual variables, which one want to minimize the variances a ij are the solutions of the optimization problem : Min ( V(( d 1 (.))+V(d 2 (.))+V(d 3 (.))) V(d i (.) : variance of d i (.)

Solution : calculation of the eigenvectors associated to q greater eigenvalues of matrix YY ' Notice : Matrix YY' has the same no null eigenvalues as the matrix Y' Y. These two eigenvectors define the two sought synthetic variables. Principal Component Analysis Variables analysis

The same previous indicators are used in the variables analysis. A significant indicator is IQG(F), the indicator of quality of the subspace F (in which the variables are projected). This indicator allows to calculate the "residual variance” (not taken into account in the representation by the subspace): Residual variance = m.[1 - IQG(F)] Principal Component Analysis Variables analysis

it is shown that the coordinate of the projection of a variable on an axis of the subspace is proportional to the linear coefficient of correlation between this variable and the "synthetic" variable corresponding to the axis:¶ Note: Taking into account this proportionality, the program carries out a calculation of reduction which involves that the co-ordinates of projected variables on each axis are directly the linear coefficients of correlation. Principal Component Analysis Variables analysis

For each variable, the coefficient of multiple correlation with the variables corresponding to the axes of a subspace F on which it is projected is proportional to the square of the norm of the projected vector.  A variable will be explained better by the axis of a subspace when the norm of the projected associated vector is large. Principal Component Analysis Variables analysis

Simulated example Number of variables : 8 Number of observation : 300 NumX1X2X3X4X5X6X7X Principal Component Analysis

Linear correlation between the variables ¶

Variables non correlated with X 1

--- Eigenvalues - Cumulated - Cumulated percentage Les valeurs propres 100% of inertia is obtained with the two first axes

Variables coordinates U1 U X X X X X X X X #1 #2 X1 X2 X3 X4 X5 X6 X7 X8 U1 : First principal component All the variables are located inside a unit circle (Reduced ACP)

1 1 0 #1 #2 X1 X2 X3 X4 X5 X6 X7 X8 two dimensions are highlighted Variables coordinates

observations coordinates

Factorial Correspondences Analysis The factorial correspondences analysis is used to extract information starting from the contingency tables. contingency tables: crossing of 2 variables X and Y. contingency tables (Frequency tables) : crossing of 2 variables X and Y. X : m modalities Y : p modalities Objectives of FCA To build a modalities map of two variables X and Y. To determine if there are correlations between certain modalities of X and some modalities of Y.

Example : 2 variables : ward and expenditure. 5 wards (division in hospital) 5 expenditures (post of expenditure) Factorial Correspondences Analysis

A row modality is represented by a point of a p dimensions space ( ) represents second row Row 2 : Point of R 5 A rowmodality : 5 points in 5 dimensions space A row modality : 5 points in 5 dimensions space Analysis of row modalities Factorial correspondences Analysis

How to find a subspace of reduced size Q (q=2 for example) to represent these points? The distance between "points represented" (in the subspace) must be the nearest distance between the initial points.  one must define a distance between the points (between modalities). A row modality is represented by a vector X i whose his coordinates are computed by : Factorial Correspondences Analysis

Distance between two modalities is given by : This distance is called Chi-square distance Example : distances between modalities of wards are given in this table : Factorial Correspondences Analysis

The problem formulation Find a q-dimensional subspace F, where : is maximized Factorial Correspondences Analysis

Center of gravity of x I having a weight f l. Centering operation Each vector z i has p coordinates noted z ij. We can define a Matrix Z where the general term is : z ij It is shown that the q-dimensional subspace F is generated by the eigenvectors of the matrix Z' Z Factorial Correspondences Analysis

Example : Center of gravity Vector x i Vector y i Matrix Z Eigenvalues : 1 = = = = = 0 3 = 0 Factorial Correspondences Analysis

Quality of representation indicators Quality of sub-space engendered : q : dimension of sub-space P : number of column modalities Factorial Correspondences Analysis

Contribution of a row modality i to making axis k: 0  CIE(i,u k )  1. if CIE is close to 1, the row modality has a significant weight in the determination of the subspace F. Example : Contribution of row modalities Factorial Correspondences Analysis

Quality of representation (perspective effect) : measure the degree of deformation during projection. Factorial Correspondences Analysis

Columns modalities analysis : Columns modalities are analyzed same manner as the rows modalities. Coordinates of x i are such as: The matrices Z' Z and ZZ' have the same ones no null eigenvalues Factorial Correspondences Analysis

contributions of columns modalities quality of representation of columns modalities These indicators have the same definitions, adapted to the columns modalities Factorial Correspondences Analysis

the simultaneous representation of the rows and the columns projected in the first factorial plane (axes 1 and 2) of our example Factorial Correspondences Analysis

Illustrations