Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Analysis Pattern Analysis Finding patterns among objects on which two or more independent variables have been measured – Principal Coordinates.

Similar presentations


Presentation on theme: "Multivariate Analysis Pattern Analysis Finding patterns among objects on which two or more independent variables have been measured – Principal Coordinates."— Presentation transcript:

1 Multivariate Analysis Pattern Analysis Finding patterns among objects on which two or more independent variables have been measured – Principal Coordinates Analysis (PCO) – Principal Components Analysis (PCA) (Flury 1988) – Cluster analysis (Everitt 1992) Allow the projection of multivariate phenotypic or genotypic measurements in lower dimensional spaces so that the underlying patterns or structures can be described and visually displayed The ‘genetic’ patterns among a set of entities (genetic materials) is difficult to discern from DNA fingerprints (raw multivariate data) Patterns among the entities can be ‘extracted’ by PCA, PCO or cluster analyses of pairwise genetic distance matrices

2

3 Principal Components Analysis (PCA) Neighbor-Joining Cladogram

4 Similarity and Dissimilarity (Genetic Distance) Measures Applications include: Assessment genetic relationships Prediction of heterosis Heterotic group definition Identification of duplicates in collections Assessment of genetic diversity Plant variety protection

5 Similarity and Dissimilarity (Genetic Distance) Measures Choice of distance measure is affected by: Properties of marker system Genealogy of germplasm Lines or populations Objectives of study Subsequent multivariate analysis

6 Genetic distance (Dissimilarity) measures based on allele frequency data The first step is to build a matrix of pair-wise measures of dissimilarity Multiple indexes can be used to estimate dissimilarity

7 Genetic distance measures based on allele frequency data (Reif et al. 2005. Crop Science 45 (1), 1-7

8 Genetic distance measures based on allele frequency data Euclidean (d E ) - No underlying genetic concept. Can be used with multivariate methods that require Euclidean distances Roger (1972) (d R ) - Linearly related to coefficient of coancestry Modified Roger’s (d W ) - d W 2 is linearly related to panmictic-midparent heterosis Cavalli-Sforza and Edwards (1967) (d CE ) - Based on Kimura’s (1954) model of selective drift

9 Genetic distance measures based on allele frequency data Reynolds et al. (1983) (d RE ) – Based on a model where mutation and selection can be neglected and drift is the major evolutionary force Nei (1972) (d N72 ) - Based on the infinite-allele model (Kimura and Crow, 1964) Nei et al. (1983) (d N83 ) - For homozygous inbred lines, d N83 = d R and, hence, d N83 is also linearly related to the coancestry coefficient

10 Similarity Measures for Binary Data Entity iEntity jCountCondition Present (1) a (v ij )Positive match Present (1)Absent (0)b (w ij )Mismatch Absent (0)Present (1)c (x ij )Mismatch Absent (0) d (y ij )Negative match Simple matchingJaccard (1908) Dice (1945)

11 Shared allele distance (Bowcock et al. 1994) S = No. of shared alleles u = No. of loci

12 Similarity Measures for Binary Data IndividualMarker 1 Marker 2 Marker 3 Marker 4 Marker 5 Marker 6 Marker 7 Marker 8 Marker 9 Marker 10 1 1000110010 2 0000100110 3 0000000100 Similarity Simple matchingRank Shared allelesRankJaccard’sRankDice’sRank s 12 0.702 20.4010.571 s 13 0.503 30.003 3 s 23 0.801 10.3320.502

13 PRINCIPAL COORDINATES ANALYSIS (PCO or PCoA) Distance between Oregon towns (miles)Genetic distance between barley varieties (Nei et al., 1983 index)

14 Principal Coordinates Analysis is a method to visualize similarities or dissimilarities of data. It starts with a distance matrix (dissimilarity) and assigns for each item a location in a 2 or 3 dimensional space

15 Transforms a number of possibly correlated variables (in this case allelic states) into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible PRINCIPAL COMPONENTS ANALYSIS (PCA)

16 The goal of PCA is to reduce the dimensionality of the data while retaining as much as possible of the variance of the observed variables: Reduces the number of observed variables to a smaller number of principal components which account for most of the variance The total amount of variance in PCA is equal to the number of observed variables being analyzed. Observed variables are standardized, e.g., mean=0, standard deviation=1 The first principal component identified accounts for most of the variance in the data. The second component identified accounts for the second largest amount of variance in the data and is uncorrelated with the first principal component and so on. Components accounting for maximal variance are retained while other components accounting for a trivial amount of variance are not retained. Eigenvalues indicate the amount of variance explained by each component. Eigenvectors are the weights used to calculate components scores.

17 Cluster Analysis: Individuals with similar descriptions are mathematically gathered into a cluster. Distance-based methods (starting from a distance matrix) UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Neighbor-Joining Model-Based methods Neighbor-Joining Cladogram

18 Marker 1Marker 2Marker 3Marker 4Marker 5Marker 6Marker 7Marker 8Marker 9Marker 10 Individual 11111100000 Individual 21110010010 Individual 30111110000 Individual 41111010100 Individual 51111110000 Individual 61111110000 Individual 71111100000 Individual 81101110000 Individual 90111111000 Individual 101101110010 Individual 111111110000 Individual 121010110000 Individual 130000001111 Individual 140000011110 Individual 150100001111 Individual 160000000111 Individual 171000101111 Individual 180000001111 Individual 190010001100 Individual 200000001111 Hypothesis 1: There is one population that has intermediate frequencies at all loci and all individuals are from that population Hypothesis 2: There are two populations: blue and pink, with high allele frequency at some loci and low allele frequency at other loci POPULATION STRUCTURE

19 It is important to estimate: -How many subpopulations there are - To which subpopulation each individual belongs (%) POPULATION STRUCTURE


Download ppt "Multivariate Analysis Pattern Analysis Finding patterns among objects on which two or more independent variables have been measured – Principal Coordinates."

Similar presentations


Ads by Google