# Lecture 3: A brief background to multivariate statistics

## Presentation on theme: "Lecture 3: A brief background to multivariate statistics"— Presentation transcript:

Lecture 3: A brief background to multivariate statistics
Univariate versus multivariate statistics The material of multivariate analysis Displaying multivariate data The uses of multivariate statistics A refresher of matrix algebra Bio 8100s Applied Multivariate Biostatistics 2001

Multivariate versus univariate statistics
In univariate statistical analysis, we are concerned with analyzing variation in a single random variable. In multivariate statistical analysis, we are concerned with analyzing variation in several random variables which may or may not be related. Bio 8100s Applied Multivariate Biostatistics 2001

The material of multivariate analysis
Multivariate data consists of a set of measurements (usually related) of P variables X1, X2, …, XP on n sample units. The variables Xj may be ratio, ordinal, or nominal. Bio 8100s Applied Multivariate Biostatistics 2001

Example 1: Bumpus’ sparrow data
5 morphological measurements (in mm) of 49 sparrows recovered from a storm in 1898. ... Bio 8100s Applied Multivariate Biostatistics 2001

Example 2: Biodiversity of SE Ontario wetlands
Species richness (number of species) of 5 different taxa in 57 wetlands in southeastern Ontario. ... ... ... ... ... ... Bio 8100s Applied Multivariate Biostatistics 2001

The material of multivariate analysis
In some applications, the measured variables comprise both dependent (X) and independent (Y) variables. The material of multivariate analysis Bio 8100s Applied Multivariate Biostatistics 2001

Example 1: Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Bio 8100s Applied Multivariate Biostatistics 2001

Example 2: Anurans in SE Ontario wetlands in relation to surrounding forest cover and road densities
Bio 8100s Applied Multivariate Biostatistics 2001

Multivariate LS estimators
The vector of sample means, variances and covariances is an estimate of the true (“population”) means, variances and covariances. As such, inferences to the latter based on the former assume random sampling. Population Sample Bio 8100s Applied Multivariate Biostatistics 2001

The sample covariance matrix
The sample covariance matrix is a square matrix whose diagonal elements give the sample variances for each measured variable (si2), and whose off-diagonal elements are the sample covariances between pairs of variables (cik). Bio 8100s Applied Multivariate Biostatistics 2001

A review of matrix algebra
A matrix of size m x n is an array of numbers (either real or complex) with m rows and n columns. Matrices with one column are column vectors, matrices with one row are row vectors. Bio 8100s Applied Multivariate Biostatistics 2001

Special matrices A zero matrix 0 has all elements equal to zero.
A diagonal matrix T is a square matrix (m = n) with all elements equal to zero except the main diagonal. An identity matrix I is a diagonal matrix with all diagonal terms equal to zero. Bio 8100s Applied Multivariate Biostatistics 2001

Matrix operations The transpose of a matrix A (AT) is obtained by interchanging rows and columns. The transpose of a row vector is a column vector, and the transpose of a column vector is a row vector. Bio 8100s Applied Multivariate Biostatistics 2001

The trace of a matrix The trace of a matrix A, denoted tr(A), is the sum of the diagonal elements. The trace is defined only for square matrices. Bio 8100s Applied Multivariate Biostatistics 2001

Two matrices A and B are conformable for addition if they are of the same size (same numbers of rows and columns). The resulting matrix A + B (A - B) is obtained by adding (subtracting) individual matrix elements. Bio 8100s Applied Multivariate Biostatistics 2001

Matrix multiplication by a scalar
The multiplication of a matrix A by a scalar k involves multiplying each element of A by k. Bio 8100s Applied Multivariate Biostatistics 2001

Matrix multiplication
Two matrices A (m x n) and B (n x p) are conformable for multiplication (A • B) if the number of columns in A equals the number of rows in B. A • B and B • A are both defined only when both A and B are square, but even when true, in general A • B  B • A . Bio 8100s Applied Multivariate Biostatistics 2001

Matrix inversion The inverse of a matrix A, denoted A-1, is the matrix solving the matrix equation where I is the identity matrix. Only square matrices are invertible, and some matrices cannot be inverted (“singular” matrices) Bio 8100s Applied Multivariate Biostatistics 2001

The covariance matrix A multivariate sample is described by a covariance matrix, whose diagonal elements give the sample variances for each measured variable (si2), and whose off-diagonal elements are the sample covariances between pairs of variables (cik). Bio 8100s Applied Multivariate Biostatistics 2001

Calculating the sample covariance matrix
d = N M Q P - 1 3 2 4 7 Bio 8100s Applied Multivariate Biostatistics 2001

The determinant of a matrix: 2 X 2 matrices
The determinant of a matrix A, denoted det(A) or |A|, is a unique number associated with every square matrix. In multivariate statistics, the determinant of the sample covariance matrix C plays a crucial role in hypothesis testing. Bio 8100s Applied Multivariate Biostatistics 2001

Matrix inversion and the determinant: 2 X 2 matrices
If a 2 X 2 matrix A is invertible, the elements of its inverse A-1 are obtained by dividing modified elements of A by |A| Hence, if |A| = 0, the division is undefined and the matrix is non-invertible or singular. Bio 8100s Applied Multivariate Biostatistics 2001

Multivariate variance: a geometric interpretation
Larger variance Smaller variance Univariate variance is a measure of the “volume” occupied by sample points in one dimension. Multivariate variance involving m variables is the volume occupied by sample points in an m -dimensional space. X X X1 X2 Occupied volume Bio 8100s Applied Multivariate Biostatistics 2001

Multivariate variance: effects of correlations among variables
No correlation Multivariate variance: effects of correlations among variables X1 X2 Correlations between pairs of variables reduce the volume occupied by sample points… …and hence, reduce the multivariate variance. Positive correlation Negative correlation X1 Occupied volume X2 Bio 8100s Applied Multivariate Biostatistics 2001

C and the generalized multivariate variance
The determinant of the sample covariance matrix C is a generalized multivariate variance… … because area2 of a parallelogram with sides given by the individual standard deviations and angle determined by the correlation between variables equals the determinant of C. Bio 8100s Applied Multivariate Biostatistics 2001

The use of determinants in multivariate analysis
Univariate single-classification ANOVA, k groups For a univariate sample variance sa2, the multivariate analog is the determinant of the corresponding sample covariance matrix Ca, i.e., | Ca|… … and these variances are often used in the calculation of multivariate test statistics, e.g., Wilk’s L. Multivariate single-classification ANOVA (MANOVA) Bio 8100s Applied Multivariate Biostatistics 2001

Eigenvalues The eigenvalues of a p X p matrix A are the p solutions, some of which may be zero, to the equation |A - lI| = 0. The trace of a matrix is the sum of its eigenvalues… … and the determinant of a matrix is the product of its eigenvalues. Bio 8100s Applied Multivariate Biostatistics 2001

Eigenvalues and eigenvectors I
Suppose v is a vector, and L a linear transformation. If L(v) = lv, then v is an eigenvector of L associated with the eigenvalue l. e.g., if L is the reflection in the line y = mx, then a is the eigenvector associated with eigenvalue 1, b with -1. Note that a and b are orthogonal! Bio 8100s Applied Multivariate Biostatistics 2001

Eigenvalues and eigenvectors of C
No correlation Eigenvectors of the covariance matrix C are orthogonal directed line segments that “span” the variation in the data, and the corresponding (unsigned) eigenvalues are the length of these segments. … so the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X1 Positive correlation X2 Negative correlation X1 X2 Bio 8100s Applied Multivariate Biostatistics 2001

Displaying multivariate data I: Draftman’s plots (SPLOM)
Plot pairs of variables against one another. Advantages: need only 2 plotting dimensions, bivariate relationships among variables is clear. Problems: no direct information on relationships in higher than 2 dimensions, relationships between objects unclear. Bio 8100s Applied Multivariate Biostatistics 2001

Displaying multivariate data II: multiple 3-D plots
Plot 3 variables against one another. Advantages: trivariate relationships among variables is clear. Problems: no direct information on relationships in higher than 3 dimensions, relationships between objects unclear. Bio 8100s Applied Multivariate Biostatistics 2001

Displaying multivariate data III: plotting index variables
Generate index variables that combine information from several measured variables, then plot these variables. Advantages: 2- D plots make relationships among variables clear. Disadvantages: relationships among objects unclear, key information may be lost in data reduction Bio 8100s Applied Multivariate Biostatistics 2001

Displaying multivariate data IV: Icon plots
Cuon Dingo Prehistoric dog Chinese wolf Golden jackal Modern Displaying multivariate data IV: Icon plots Used to visualize relationships among objects, e.g. different canine groups. Advantages: All variables displayed simultaneously. Problems: order of display of variables arbitrary, and impressions may depend on order. Relationships among variables may be unclear. X4 X3 X5 X6 X2 X1 Bio 8100s Applied Multivariate Biostatistics 2001

Displaying multivariate data V: profile plots
Represent objects by lines, histograms or Fourier plots. Advantages: All variables displayed simultaneously. Problems: order of display of variables arbitrary, and impressions may depend on order. Relationships among variables may be unclear. Bio 8100s Applied Multivariate Biostatistics 2001

Similar presentations