Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Analysis

Similar presentations


Presentation on theme: "Multivariate Analysis"— Presentation transcript:

1 Multivariate Analysis
Pattern Analysis Finding patterns among objects on which two or more independent variables have been measured Principal Coordinates Analysis (PCO) Principal Components Analysis (PCA) (Flury 1988) Cluster analysis (Everitt 1992) Allow the projection of multivariate phenotypic or genotypic measurements in lower dimensional spaces so that the underlying patterns or structures can be described and visually displayed The ‘genetic’ patterns among a set of entities (genetic materials) is difficult to discern from DNA fingerprints (raw multivariate data) Patterns among the entities can be ‘extracted’ by PCA, PCO or cluster analyses of pairwise genetic distance matrices

2

3 Principal Components Analysis (PCA)
Neighbor-Joining Cladogram

4 Similarity and Dissimilarity (Genetic Distance) Measures
Applications include: Assessment genetic relationships Prediction of heterosis Heterotic group definition Identification of duplicates in collections Assessment of genetic diversity Plant variety protection

5 Similarity and Dissimilarity (Genetic Distance) Measures
Choice of distance measure is affected by: Properties of marker system Genealogy of germplasm Lines or populations Objectives of study Subsequent multivariate analysis

6 Genetic distance (Dissimilarity) measures based on allele frequency data
The first step is to build a matrix of pair-wise measures of dissimilarity Multiple indexes can be used to estimate dissimilarity

7 Genetic distance measures based on allele frequency data
(Reif et al Crop Science 45 (1), 1-7

8 Genetic distance measures based on allele frequency data
Euclidean (dE) - No underlying genetic concept. Can be used with multivariate methods that require Euclidean distances Roger (1972) (dR) - Linearly related to coefficient of coancestry Modified Roger’s (dW) - dW2 is linearly related to panmictic-midparent heterosis Cavalli-Sforza and Edwards (1967) (dCE) - Based on Kimura’s (1954) model of selective drift

9 Genetic distance measures based on allele frequency data
Reynolds et al. (1983) (dRE) – Based on a model where mutation and selection can be neglected and drift is the major evolutionary force Nei (1972) (dN72) - Based on the infinite-allele model (Kimura and Crow, 1964) Nei et al. (1983) (dN83) - For homozygous inbred lines, dN83 = dR and, hence, dN83 is also linearly related to the coancestry coefficient

10 Similarity Measures for Binary Data
Entity i Entity j Count Condition Present (1) a (vij) Positive match Absent (0) b (wij) Mismatch c (xij) d (yij) Negative match Simple matching Jaccard (1908) Dice (1945)

11 Shared allele distance
S = No. of shared alleles u = No. of loci (Bowcock et al. 1994)

12 Similarity Measures for Binary Data
Individual Marker1 Marker 2 Marker 3 Marker 4 Marker 5 Marker 6 Marker 7 Marker 8 Marker 9 Marker 10 1 3 Similarity Simple matching Rank Shared alleles Jaccard’s Dice’s s12 0.70 2 0.40 1 0.57 s13 0.50 3 0.00 s23 0.80 0.33

13 PRINCIPAL COORDINATES ANALYSIS (PCO or PCoA)
Distance between Oregon towns (miles) Genetic distance between barley varieties (Nei et al., 1983 index)

14 Principal Coordinates Analysis is a method to visualize similarities or dissimilarities of data.
It starts with a distance matrix (dissimilarity) and assigns for each item a location in a 2 or 3 dimensional space

15 PRINCIPAL COMPONENTS ANALYSIS (PCA)
Transforms a number of possibly correlated variables (in this case allelic states) into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible

16 The goal of PCA is to reduce the dimensionality of the data while retaining as much as possible of the variance of the observed variables: Reduces the number of observed variables to a smaller number of principal components which account for most of the variance The total amount of variance in PCA is equal to the number of observed variables being analyzed. Observed variables are standardized, e.g., mean=0, standard deviation=1 The first principal component identified accounts for most of the variance in the data. The second component identified accounts for the second largest amount of variance in the data and is uncorrelated with the first principal component and so on. Components accounting for maximal variance are retained while other components accounting for a trivial amount of variance are not retained. Eigenvalues indicate the amount of variance explained by each component. Eigenvectors are the weights used to calculate components scores.

17 Distance-based methods (starting from a distance matrix)
Cluster Analysis: Individuals with similar descriptions are mathematically gathered into a cluster. Distance-based methods (starting from a distance matrix) UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Neighbor-Joining Model-Based methods Neighbor-Joining Cladogram

18 POPULATION STRUCTURE Marker 1 Marker 2 Marker 3 Marker 4 Marker 5 Marker 6 Marker 7 Marker 8 Marker 9 Marker 10 Individual 1 1 Individual 2 Individual 3 Individual 4 Individual 5 Individual 6 Individual 7 Individual 8 Individual 9 Individual 10 Individual 11 Individual 12 Individual 13 Individual 14 Individual 15 Individual 16 Individual 17 Individual 18 Individual 19 Individual 20 Hypothesis 1: There is one population that has intermediate frequencies at all loci and all individuals are from that population Hypothesis 2: There are two populations: blue and pink, with high allele frequency at some loci and low allele frequency at other loci

19 POPULATION STRUCTURE It is important to estimate:
How many subpopulations there are - To which subpopulation each individual belongs (%)


Download ppt "Multivariate Analysis"

Similar presentations


Ads by Google