# Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)

## Presentation on theme: "Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)"— Presentation transcript:

Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)

Multi-… approaches in statistics  Multiple comparison tests  Multiple testing adjustments  Methods for adjusting the significance levels when doing a large number of tests (comparisons between treatments) within a single analyses  Multiple regression analysis  Statistical model building with more than one explanatory variable  Multi-factor analysis of variance  Analysis of designed experiments with more than one explanatory factor  Multivariate Analysis  Methods to summarise and explore the relationships among multiple response variables, and/or to assess differences among “treatments” based on multiple response variables 2

Multivariate data  Responses by a number of individuals to a range of different (related) questions in a (social studies) survey  Counts of different species in a range of locations in an ecological survey  Measurements of a range of traits of individual people, animals, plants, products, …  Different medical measurements made on a group of patients  Measurements of gene expression / protein expression, metabolite expression in biological samples  Counts of sequence consensus matches from microbial samples  Data on the “distances” between a number of objects 3

Multivariate questions  Identifying groups of objects with similar (and different) responses  e.g. UK landscape areas with similar percentage composition of different land covers (woodland, arable, urban, …)?  Identifying the particular measurement that contribute to the variability among a set of objects  e.g. weed species that are present under different long-term herbicide strategies?  Identifying the particular measurements that contribute to differences between groups of objects  e.g. which genes discriminate between patients with and without some form of cancer?  Identifying the particular measurements that explain variation in some over-arching response variable  e.g. which traits of predatory insects influence the predation rate of aphids? 4

Exploratory data analysis  Summary statistics / graphical summaries  Variability for each variable/measurement  Groups of observations  both pre-determined – to find potential differences  and to be identified – based on each individual variable  Scatter plots / correlations  Associations between pairs of variables 5

Univariate analysis  For each individual variable:  Hypothesis tests  Choice depends on the question to be answered  Analysis of variance  For variables measured in designed experiments  Regression analysis  To build statistical models to describe how one response variable depends on one (or more) explanatory variables  Generalised Linear Models (GLMs)  For data where standard assumptions do not hold!  Time Series Analysis  … 6

Multivariate analysis  For a set of “correlated” variables:  Assess relationships between variables  Consider the effects of “treatments” on these relationships  Consider how a “response” depends on these relationships  Multivariate methods concerned with “data reduction”  Summarise the correlations between variables  Produce a smaller set of (uncorrelated) variables containing the important information  For a set of “related” objects  Identify groups of similar objects  Identify differences between groups of similar objects  And what makes the objects similar! 7

Simple graphical summaries 1  For compositional data  e.g. numbers of onion bulbs in different marketable size grades  Present as a stacked bar- chart  For raw data  For percentage of the total 8

Simple graphical summaries 2  More general data  e.g. different measurements on a set of plants  Scatter plots for each pair of variables  Present in a matrix  Calculate linear correlation coefficient for each pair of variables 9

Two forms of data matrix  The DATA matrix  p variables for each of n samples (observations)  Presented in a rectangular matrix  n rows and p columns  The ASSOCIATION matrix  Distance, similarity or dissimilarity  Between every pair of variables or every pair of samples  Symmetric square matrix  n-by-n - between samples  p-by-p - between variables  just show lower triangle  Turns multivariate data into univariate data? p variables n samples p variables Lower Triangle 10

Analysing Association Data 11  Start with associations  Distances between locations on a map  Psychometric (sensory) similarities between products  Construct associations from data  Depends on the types of data  Binary (presence/absence) data  Simple matching coefficient; Jaccard coefficient; …  Continuous data  Euclidean distance; Manhattan distance; …  Similarities or Disimilarities/Distances

Finding groups of similar objects 12  Hierarchical Cluster Analysis  Aim: to arrange the objects into homogenous groups  Output:  Dendrogram showing how objects are joined together  Levels of similarity/distance at which groups are formed or divided  Primarily a descriptive technique  Interpretation includes identification of “how many groups?”  Agglomerative methods  Start with individual objects, group the two most similar together, re- calculate similarity between new group and other objects, and continue until all objects in one group  Different rules (algorithms) for re-calculating similarities, resulting in different dendrograms

Simple example  Relative intensity of fluorescence spectrum at four different wavelengths  Calculate distances using the “Euclidean” metric  standardised by the mean absolute deviation  Illustrate HCA using Single Link (Nearest Neighbour) algorithm  Distance to new group is the minimum of the distances to the objects being grouped CompoundWavelength (nm) 300350400450 A16626727 B15606931 C14596831 D15617131 E14607030 F14596930 G17636829 H16626928 I15607230 J17636927 K18626828 L18646729 13

Step 1 A0.000 B4.0240.000 C4.2561.4030.000 D4.9671.9553.1790.000 E4.2181.4522.1121.6150.000 F4.0161.3341.2132.5681.1530.000 G2.1283.2304.0363.8193.7693.8990.000 H1.9912.8973.6933.1972.8183.0991.6150.000 I5.4002.8493.8821.4031.9912.9354.5803.5590.000 J2.1124.1574.9804.2564.1034.4151.8411.3344.5030.000 K2.0083.7874.5264.4154.256 1.3341.8414.8581.6150.000 L2.6674.4295.108 5.1315.1631.4032.9185.9282.6501.8610.000 ABCDEFGHIJKL  Identify minimum distance  1.153 between E and F  Join these objects into a group  Re-calculate all distances to this group  And repeat! 14

Dendrograms 15

Non-Hierarchical Clustering  Aim: to divide units into a number of mutually exclusive groups  Optimize some suitable criterion directly from the data matrix  Does not analyse the similarity matrix  Criteria include  Maximise the total Euclidean distance between groups  Minimise the determinant of the within-group variance-covariance matrix, pooled over groups  Repeat for different numbers of groups  Usually start with a large number of groups, and gradually reduce the number  Grouping is not hierarchical, i.e. best 3-group solution may not be best 2-group solution with one group divided into 2 sub-groups  Need a rule to determine the “right” number of groups  Also known as K-means clustering 16

Analysing Association Data  Multidimensional Scaling (MDS) and Principal Co-ordinate Analysis (PCO)  Analyse the same matrix of similarities or distances to produce a multidimensional picture of the relationships between units  Generates an “ordination” or configuration for a set of objects  Matches the inter-point distances to the dissimilarities or distances  PCO works with similarities  Produces an analytical solution (metric scaling)  Matches configuration distances to the observed dissimilarities based on the sum of squared differences  MDS works with distances or dissimilarities  Produces an iterated solution (non-metric scaling)  Matches the configuration distances to the observed dissimilarities based on rank orders (monotonic regression) 17

MDS Output for fluorescence data 18

Exploring patterns  Principal Component Analysis (PCA)  Aim: to identify the (combinations of) variables that explain the variability within a data set  Primarily a descriptive technique  Usually for quantitative variables  Starts with DATA matrix (p variables by n units)  Transforms original set of correlated variables into new set of orthogonal (independent) variables  Linear combinations of original variables  First principal component accounts for as much of the variability in the data as possible  Second principal component accounts for as much of the remaining variability as possible, and is orthogonal to the first  etc. 19

Matrix algebra  PCA best described in terms of matrix algebra  In common with almost all multivariate analysis methods  PCA is an eigenvalue decomposition of the matrix of associations between the variables  Produces two matrices  A diagonal matrix containing the eigenvalues  A rectangular (n-by-p) matrix containing the eigenvectors  Three possible matrices of associations can be used  Constructed from the original data matrix  Sum of Squares and Products Matrix (SSPM)  Variance-Covariance Matrix  Correlation Matrix  Different results from the PCA applied to each 20

PCA Output  Roots (eigenvalues)  How much of the variation is explained by each component  Expressed as a percentage of total  Indicates how many components are necessary  Loadings (eigenvectors)  How each original variable contributes to each principal component  Shows which variables are important  Scores  Values of each observation on each principal component 21

Fluorescence example  Relative intensity of fluorescence spectrum at four different wavelengths CompoundWavelength (nm) 300350400450 A16626727 B15606931 C14596831 D15617131 E14607030 F14596930 G17636829 H16626928 I15607230 J17636927 K18626828 L18646729 22

PCA output  Analysis based on the variance- covariance matrix  Variances of original variables are similar (~25% each)  PC1 accounts for almost 73% of total variability  First two PCs account for nearly 89%  Obtain PC Scores for each compound by multiplying observed values by coefficients (loadings)  View groupings against PCs 300350400450 3002.2046 3502.25002.7500 400-1.1136-1.15912.2652 450-1.4773-1.70461.02272.2046 Eigenvalue6.58191.48630.87950.2066 Proportion0.7270.1580.0930.022 Cumulative0.7270.8850.9781.000 VariablePC1PC2PC3PC4 3000.529-0.218-0.3430.745 3500.594-0.319-0.324-0.664 400-0.383-0.9170.1000.050 450-0.4700.099-0.876-0.041 23

Example PCA plot A K G D I H J L E B F C 24

Biplots 25  Graphical approach to present results from PCA (and a number of other multivariate methods)  Plots objects as points  As for example PCA plot  Plots variables as vectors  Supports interpretation of analysis  Shows those objects that are similar  Identifies variables that are highly correlated  Identifies variables that are particular associated with groups of objects

Correspondence Analysis  Analogous to Principal Components Analysis  Appropriate for categorical variables rather than continuous variables  Also known as “reciprocal averaging”  Finds an ordination (ordering) of each categorical variable that maximises the correlation between the two categorical variables  Used in the analysis of ecological community data  e.g. counts of the numbers of different species in different environments – identifies the species associated with particular environments  Extension to Canonical Correspondence Analysis  Incorporates the influence of one or more explanatory variables (such as environmental variables) in finding the ordination  Enables sites to be ranked along each environmental variable, taking account of correlations between species and environmental variables 26

Factor Analysis 27  Similar approach to PCA, predominantly used in the social sciences  Principle: that correlations between variables can be explained by a number of common factors, plus a number of specific factors (one for each original variable)  Focused on explaining the covariance between variables  While PCA is focused on explaining the maximum amount of variance in the data  Observed variables are assumed to be linear combinations of hypothetical underlying (and un-observable) factors  Creates an underlying causal model  Original application to the derivation of factors underlying intelligence  Can be used in a hypothesis testing mode (confirmatory factor analysis)

Canonical Variate Analysis (CVA)  Similar to PCA in working on the data matrix  Works on within-group SSPM and between-group SSPM  Finds combinations of the original variables to maximise the ratio of between-group variance to within-group variance  Groups are separated as much as possible whilst keeping each group as compact as possible  Combinations can be used to discriminate between the groups  For g groups – at most g-1 combinations of variables to discriminate between them  Need at least g-1 original variables  For a new observation, use “discriminant” functions to identify which group it is most likely to belong to  Discriminant Analysis 28

CVA Output  Output  Latent roots (eigenvalues)  How much variation is explained by each component  Expressed as a percentage of total  Root greater than 1 indicates that there is discrimination between groups on that canonical variate  Explicit test for dimensionality  Latent vectors (loadings)  Contributions of each original variable to new canonical variates  Canonical Variate Means  Mean values for each group on the canonical variates  Adjustment terms so that the centroid of group means is at the origin  Produce plots showing each group mean with a 95% confidence interval  Construct confidence intervals for the “population” 29

Example – Fisher “iris” data  Measurements of sepal length, sepal width, petal length and petal width for 50 plants of each of three iris species  Plot shows separation between the three species  Loadings (coefficients) indicate which variables are used to separate the groups 30

Multivariate ANOVA and Regression  Generalisation of univariate Analysis of Variance  Analyse multiple variables of data from a designed experiment  Assess for the effects of different factors on the whole set of variables  Takes account of the covariance between variables  Interpretation based on matrices of variances and covariances  Similar approach for multivariate regression  Relates a set of correlated response variables to one or more explanatory variables  PC Regression  PLS Regression 31

Procrustes Rotation  Provides a way of comparing two (or more) multidimensional configurations of a set of units  Procrustes = Greek inn-keeper who fitted his guests to one size of bed by chopping bits off or stretching bits!  Takes one configuration and fits the second configuration to it  Combinations of rotation, reflection and scaling each axis  Measure how much manipulation is needed to make the configurations similar  Measure how similar it is possible to make them  Generalise to more than two configurations 32

Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)