# Multivariate data and multivariate analysis. Multivariate data - values recorded for several random variables on a number of units. Unit Variable 1 …

## Presentation on theme: "Multivariate data and multivariate analysis. Multivariate data - values recorded for several random variables on a number of units. Unit Variable 1 …"— Presentation transcript:

Multivariate data and multivariate analysis

Multivariate data - values recorded for several random variables on a number of units. Unit Variable 1 … Variable q 1 x 11 … x 1q.... n x n1 … x nq n – number of units q – number of variables recorded on each unit x ij - value of the jth variable for the ith unit Multivariate data

Multivariate statistical analysis – simultaneous statistical analysis of a collection of variables (e.g.handle apples, oranges, and pears at the same time). Simultaneous statistical analysis of a collection of variables improves upon separate univariate analyses of each variable by using information about the relationships between the variables. The general aim of most multivariate analysis is to uncover, display, or extract any “signal” in the data in the presence of noise and to discover what the data have to tell us. Multivariate analysis

- Exploratory methods allow the detection of possible unanticipated patterns in the data, opening up a wide range of competing explanations. These methods are characterized by an emphasis on the importance of graphical display and visualization of the data and the lack of any associated probabilistic model that would allow for formal inferences (discussed in this course). - Statistical inferences methods are used when individuals from a multivariate data set have been sampled from some population and the investigator wishes to test a well defined hypothesis about the parameters of that population’s probability density function (multivariate normal). Multivariate analysis

- late 19 th century Francis Galton and Karl Pearson quantified the relationship between offspring and parental characteristics and developed the correlation coefficient. - early 20 th century Charles Spearman introduced the factor analysis while investigating correlated intelligence quotient (IQ) tests. Over the next two decades, Spearman’s work was extended by Hotelling and by Thurstone. - in 1920s Fisher’s introduction of the analysis of variance was followed by its multivariate generalization, multivariate analysis of variance, based on work by Bartlett and Roy. History of the development of multivariate analysis

- At the beginning, the computational aids to take the burden of the vast amounts of arithmetic involved in the application of the multivariate methods were very limited. - In the early years of the 21 st century, the wide availability of cheap and extremely powerful personal computers and laptops, and flexible statistical software has meant that all the methods of multivariate analysis can be applied routinely even to very large data sets. - Application of multivariate techniques to large data sets was named “data mining” (the nontrivial extraction of implicit, previously unknown and potentially useful information from data). History of the development of multivariate analysis

individualsexageIQdepressionhealthweight 1Male21120YesVery good150 2Male43NANoVery good160 3Male22135NoAverage135 4Male86150NoVery poor140 5Male6092YesGood110 6Female16130YesGood110 7FemaleNA150YesVery good120 8Female43NAYesAverage120 9Female2284NoAverage105 10Female8070NoGood100 NA – Not Available Number of units n=10 Number of variables q=7 Example of multivariate data

Nominal - Unordered categorical variables (e.g. treatment allocation, the sex of the responder, hair color). Ordinal - Where there is an ordering but no implication of equal distance between the different points of the scale (e.g. educational level: no schooling, primary, secondary, or tertiary education). Interval - Where there are equal differences between successive points on the scale but the position of zero is arbitrary (e.g. measurement of temperature using the Celsius or Fahrenheit scales). Ratio – The highest level of measurement, relative magnitudes of scores as well as the differences between them. The position of zero is fixed (e.g. absolute measure of temperature (Kelvin), age, weight, and length). Example of multivariate data Types of measurements

Observations and measurements that should have been recorded but for one reason or another, were not (e.g. non-response in sample surveys, dropouts in longitudinal data). How to deal with missing values? 1. complete - case analysis by omitting any case with a missing value on any of the variables (not recommended because might lead to misleading conclusion and inferences). 2. available-case analysis – exploit incomplete information by using all the cases available to estimate the quantities of interest (difficulties arise when missing data is not missing completely at random). 3. multiple imputation – a Monte Carlo technique in which the missing values are replaced by m>1 simulated versions (3< m <10) (the most appropriate way). Example of multivariate data Missing values

Covariance of two random variables is a measure of their linear dependence. If the two variables are independent of each other, their covariance is equal to zero. Larger values of the covariance show greater degree of linear dependence between two variables. Example of multivariate data Covariance - expectation

If the covariance of the variable with itself is simply its variance. The variance of variable is: Example of multivariate data Covariance In a multivariate data with q observed variables, there are q variances and q(q-1)/2 covariances. Covariance depends on the scales on which the two variables are measured.

Example of multivariate d Covariance matrix

chestwaisthipsgenderchestwaisthipsgender 343032male362435female 373237male362537female 383036male342437female 363339male332234female 382933male362637female 433238male372637female 403342male342538female 383040male362637female 403037male382840female 413239male352335female Example of multivariate d Measure data Measurements of chest, waist, and hips on a sample 20 men and women. R data “Measure”

> cov(Measure[,1:3] chest waist hips chest 6.631579 6.368421 3.052632 waist 6.368421 12.526316 3.684211 hips 3.052632 3.684211 5.894737 Example of multivariate d Covariance Calculate the covariance matrix for the numerical variables : chest, waist, hips in “Measure” data. We remove the categorical variable “gender” (column 4).

Covariance matrix for gender female in “Measure” data >cov(subset(Measure,Measure\$gender=="female")[,c(1:3)]) chest waist hips chest 2.277778 2.166667 1.500000 waist 2.166667 2.988889 2.633333 hips 1.500000 2.633333 2.900000 Covariance matrix for gender male in “Measure” data >cov(subset(Measure, Measure\$gender=="male")[,c(1:3)]) chest waist hips chest 6.7222222 0.9444444 3.944444 waist 0.9444444 2.1000000 3.077778 hips 3.9444444 3.0777778 9.344444 Example of multivariate d Covariance

Correlation is independent of the scales of the two variables. Correlation coefficient ( ) is the covariance divided by the product of standard deviations of the two variables. where The correlation coefficient lies between -1 and +1and gives a measure of the linear relationship of the variables X i and X j. It is positive if high values of X i are associated with high values of X j and negative if high values of Xi are associated with low values of X j. Example of multivariate d Correlation - standard deviation

Correlation matrix for “Measure” data > cor(Measure[,1:3]) chest waist hips chest 1.0000000 0.6987336 0.4882404 waist 0.6987336 1.0000000 0.4287465 hips 0.4882404 0.4287465 1.0000000 Example of multivariate d Correlation

Distance between the units in the data is often of considerable importance in some multivariate techniques. The most common measure used : Euclidian distance x ik and x jk, k = 1,…, q variable values for units i and j. When the variables in a multivariate data set are on different scales we need to do a standardization before calculating the distances (e.g. divide each variable by its standard deviation). Example of multivariate d Distances

Example of multivariate d Distances The distance matrix for the first 12 observations in “Measure” data after standardization. > dist(scale(Measure[,1:3],center = FALSE)) 1 2 3 4 5 6 7 8 9 10 11 2 0.17 3 0.15 0.08 4 0.22 0.07 0.14 5 0.11 0.15 0.09 0.22 6 0.29 0.16 0.16 0.19 0.21 7 0.32 0.16 0.20 0.13 0.28 0.14 8 0.23 0.11 0.11 0.12 0.19 0.16 0.13 9 0.21 0.10 0.06 0.16 0.12 0.11 0.17 0.09 10 0.27 0.12 0.13 0.14 0.20 0.06 0.09 0.11 0.09 11 0.23 0.28 0.22 0.33 0.19 0.34 0.38 0.25 0.24 0.32 12 0.22 0.24 0.18 0.28 0.18 0.30 0.32 0.20 0.20 0.28 0.06

Multivariate normal density function for two variables x 1 and x 2 : Example of multivariate d Multivariate normal density function µ 1 and µ 2 – population means of the two variables σ 1 and σ 2 - population variances ρ - population correlation between two variables X 1 and X 2 Linear combinations of the variables are themselves normally distributed.

Example of multivariate d Methods to test the multivariate normal distribution: - normal probability plots for each variable separately - convert each multivariate observation to a single number before plotting (i.e. each q-dimensional observation x i could be converted into a generalized distance d i 2, giving a measure of the distance of the particular observation from the mean vector of the complete sample x ̅ ). d i 2 = (x i - x ̅ ) T S -1 (x i - x ̅ ) S- sample covariance matrix If the observations are from a multivariate normal distribution, then distances have approximately a chi-squared distribution with q degrees of freedom and are denoted by the symbol. Multivariate normal density function

Air pollution in 41 cities in the USA. R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year Example of multivariate d Air pollution data

Read the “USairpollution” data with the first column as row names: >USairpollution=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv",header=T,row.names=1) Normal probability plots for “manu” and “popul” variables in “USairpollution” data. > qqnorm(USairpollution\$manu,main=“manu”) > qqline(USairpollution\$manu) > qqnorm(USairpollution\$popul,main=“popul”) > qqline(USairpollution\$popul) Multivariate normal density function

layout(matrix(1:8,nc=2)) sapply(colnames(USairpollution), function(x){ qqnorm(USairpollution[[x]], main=x) qqline(USairpollution[[x]]) }) Normal probability plots for each variable separately in “USairpollution” data. Multivariate normal density function

The plots for SO 2 concentration and precipitation both deviate considerably from linearity, and the plots for manufacturing and population show evidence of a number of outliers.

>x=USairpollution >cm=colMeans(x) >S=cov(x) >d=apply(x,1,function(x)t(x-cm)%*%solve(S)%*%(x-cm)) >plot(qc <- qchisq((1:nrow(x)-1/2)/nrow(x),df=6), sd<-sort(d), xlab=expression(paste(chi[6]^2,"Quantile")), ylab="Ordered distances",xlim=range(qc)*c(1,1.1)) >oups=which(rank(abs(qc-sd),ties="random")>nrow(x)-3) >text(qc[oups],sd[oups]-1.5,names(oups)) >abline(a=0,b=1) Chi-square plot: Multivariate normal density function

Chi-square plot Plotting the ordered distances against the corresponding quantiles of the appropriate chi-square distribution should lead to a straight line through the origins. Chi-square plot is also useful for detecting outliers (i.g. Chicago, Phoenix, Providence). Multivariate normal density function

Download ppt "Multivariate data and multivariate analysis. Multivariate data - values recorded for several random variables on a number of units. Unit Variable 1 …"

Similar presentations