Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Similar presentations


Presentation on theme: "Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University."— Presentation transcript:

1 Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University Magdeburg, Germany The 8th Tartu Conference on Multivariate Statistics The 6th Conference on Multivariate Distributions with Fixed Marginals Tartu, Estonia, 26-29 June 2007

2 Contents Introduction Example data - microbial fingerprints “Usual” way as multivariate test based on spherically distributed scores (PC scores) Test based on pairwise similarity measures Comparison of results in example data Application to other data Extensions of permutation test Parametric “rotation” test for small n Simulation studies on robustness Summary

3 Introduction Consider global multivariate comparisons between two or more populations of high-dimensional data Gene expression data (all genes, known groups of genes) Neuroimaging Genetical fingerprints (e.g. microbial DNA in soil samples) … Formal description: independent sample vectors x kj ~ N p (  k,  ), k = 1, 2, …, K; j = 1, …, n k, p >> n = n 1 + … + n K or more general x kj ~ F k (x), k = 1, 2, …, K; j = 1, …, n k, p >> n Wanted: test for H 0 :  1 =... =  K or F 1 (x) = … = F K (x)  x

4 Question: What is the impact of different (natural or genetically modified) plant cultures to the soil microbial population? Extraction of bacterial samples, DNA parts amplified by PCR Several samples investigated together in electrophoresis gels (e.g. denaturing gradient gel electrophoresis, DGGE) Gels scanned, analyzed with GelCompar, vector of hundreds or thousands greyscale values per lane. Example data - microbial fingerprints

5 M 1 B 1 B 2 B 3 B 4 S 1 S 2 S 3 M 2 P 1 P 2 P 3 P 4 R 1 R 2 R 3 R 4 M 3 Denaturing gradient gel with fingerprints of bacterial communities from rhizo- sphere soil (lanes S 1 to S 3 : strawberry, P 1 to P 4 : potato, R 1 to R 4 : oilseed rape) and unplanted soil (lanes B 1 to B 4 ). Lanes M 1 to M 3 : standard bacterial mix

6 “Usual” way as multivariate test based on spherically distributed scores (PC scores) Exact test based on spherically distributed scores: PC q test (Läuter et al., 1998) 1. Transformation of raw data vectors into q-dimensional score vectors: x kj  z kj = D x kj (k = 1,..., K; j = 1,..., n k ) with (p  q)-matrix D (q << p) from EVP or better from dual EVP 2. Multivariate test (here Wilks‘  ) with q-dimensional scores z kj

7 1. Calculate pairwise similar. measures between sample elements, e.g., Pearson’s r Test based on pairwise similarity measures

8 (2. Investigate similarities by cluster analyses, supported by GelCompar)

9

10 3. Calculate the test statistic and use it as basis for a permutation test, where in each new permutation step the n = n 1 + … + n K sample elements are randomly allocated to the K groups of sizes n 1, …, n K (simultaneous exchanges of rows and columns in correlation matrix). Test can be carried out with all groups or pairwise. We used random permutations. Problems occur with small samples because of restricted number of permutations, e.g. with two samples of size 4 different permutations. The permutation test in its basic form is a special case of the Mantel test (Mantel, 1967), similar application to electropheresis data by Aittokallio et al. (2000).

11 Comparison of results in example data GroupsdPC 1 PC 2 PC 3 PC 4 PC 5 all groups<.001.136.016.001<.001 B – S.029.116.221.454.260.129 B – P.029.038.011.042.046.150 B – R.029.023.015.031.045.055 S – P.029.264.035.100.174.360 S – R.029.392.175.135.159.016 P – R.057.442.251.056.168.062 p-values for global test and unadjusted pairwise tests d version performs quite well, but here at its limits – no Bonferroni possible

12 GroupsdPC 1 PC 2 PC 3 PC 4 PC 5 all groups<.001.116.011<.001 B – S.029.098.119.288.155.290 B – P.029.003.019.036.021.068 B – R.029.058.001.006.002.017 S – P.029.837.104.091.253.451 S – R.029.736.245.024.107.358 P – R.029.773.151.051.156.166 p-values for global test and unadjusted pairwise tests The same data with transformation x:= ln(1+x)

13 Other microbiological fingerprints (DGGE), from soil of four different regions, each four samples Gene expression analyses from microarrays  permutation test based on pairwise correlation coefficients of sample elements performed very well, outperformed PC test in examples (Kropf et al. 2007). Application to other data

14 Extensions of permutation test The correlation based test can be extended in different ways (Kropf et al., 2004): Inclusion of block designs (e.g., use of several geles, where lanes may not be compared across different geles). Comparison of dependent samples (e.g., the same soil samples analyzed with different types of geles). Use of other distance or similarity measures instead of r (e.g., z-dot transformation of r, squared Euclidean distance, other distances for binary or ordinal data, …).  High flexibility for applications.

15 Parametric ‚rotation‘ test for small n Usual assumptions: As the distribution of the test statistic might be too complicated for a ‚closed‘ solution, we are looking for a Monte Carlo version: The test statistic is traced back to a left-spherically distributed matrix (particularly an iid multivariate normal rows with expectation zero), which  under H 0  is distributional invariant to random orthogonal rotations. Use infinite no. of random rotations instead of restricted no. of permutations.  “Rotation” test (cf. Langsrud, 2005; Läuter et al. 2005)

16 1 … p 1…n1……1…nK1…n1……1…nK data matrix X dist./sim. matrix R = (r ij ) r ij = r(x (i), x (j) ) test statistic d = d(R)

17 reduced data matrix X* 1 … p data matrix X dist./sim. matrix R = (r ij ) r ij = r(x (i), x (j) ) 1 … p 12……n1n12……n1n test statistic d = d(R) 1 … p 1…n1……1…nK1…n1……1…nK data matrix X

18 reduced data matrix X* 1 … p data matrix X dist./sim. matrix R = (r ij ) r ij = r(x (i), x (j) ) 1 … p 12……n1n12……n1n test statistic d = d(R) 1 … p 1…n1……1…nK1…n1……1…nK data matrix X„decorrelated“ matrix X + random rotations: X + :=  X +  =  *(  *  *)  1/2  * (n-1)  (n-1) from iid standard normal elements repeatedly R has to be invariant with respect to a constant vector shift in arguments r(x (i), x (j) ) = r(x (i) + a, x (j) + a), e.g. squared Eucl. distance, no longer Pearson‘s r !

19 GruppendPC 1 PC 2 PC 3 PC 4 PC 5 d Euk2 d rot alle 4<.001.136.016.001<.001 1 – 2.029.116.221.454.260.129.029.024 1 – 3.029.038.011.042.046.150.029.001 1 – 4.029.023.015.031.045.055.029.004 2 – 3.029.264.035.100.174.360.029.017 2 – 4.029.392.175.135.159.016.057.043 3 – 4.057.442.251.056.168.062.057.051 Example data (4 groups: bulk soil, strawberry, potato, oilseed rape) p-values from global test and with unadjusted pairwise comparisons

20 GroupsdPC 1 PC 2 PC 3 PC 4 PC 5 all groups<.001.116.011<.001 B – S.029.098.119.288.155.290 B – P.029.003.019.036.021.068 B – R.029.058.001.006.002.017 S – P.029.837.104.091.253.451 S – R.029.736.245.024.107.358 P – R.029.773.151.051.156.166 p-values for global test and unadjusted pairwise tests The same data with transformation x:= ln(1+x) d Euk2 d rot <.001.029.012.029<.001.029.006.029.002.029.028.086.036

21 Simulation studies on robustness e.g. p indep. components from expontial distribution  others: uniform distribution: slightly anticonservative sum of normal and one of above: nearly exact

22 Summary Tests based on pairwise similarity or distance measures show a high power in high-dimensional data. The permutation tests for the pairwise methods are not dependent on normality assumptions and performed surprisingly well in many situations. The basic idea is not new (cf. Mantel, 1967), but might have lost attention, at least in the field of medical biometry. Extensions for other designs are possile to some degree. Similar (partly asymptotic) methods in Software “CANOCO” (Canonical Community Ordination) by ter Braak und Šmilauer (2002). Small number of possible permutations restricts application for very small samples. In this case the rotation test can help. It is, however, dependent on the parametric assumptions, so variables should be checked and – if necessary – transformed.

23 References Aittokallio, T., Ojala, P., Nevalainen, T.J., Nevalainen, O. (2000). Analysis of similarity of electrophoretic patterns in mRNA differential display. Electrophoresis 21, 2947– 2956. Kropf, S., Heuer, H., Grüning, M., Smalla, K. (2004). Significance test for comparing complex microbial community fingerprints using pairwise similarity measures. Journal of Microbiological Methods 57/2, 187-195. Kropf, S., Lux, A., Eszlinger, M., Heuer, H., Smalla, K. (2007). Comparison of independent samples of high-dimensional data by pairwise distance measures. Biometrical Journal 49, 230-241. Langsrud, Ø. (2005). Rotation Tests, Statistics and Computing, 15, 53-60. Läuter, J., Glimm, E., Kropf, S. (1998). Multivariate Tests Based on Left-Spherically Distributed Linear Scores. Annals of Statistics 26, 1972-1988. Erratum: Annals of Statistics 27, 1441. Läuter, J., Glimm, E., Eszlinger, M. (2005). Search for Relevant Sets of Variables in a High-Dimensional Setup Keeping the Familywise Error Rate. Submitted to Statistica Neerlandica. Mantel, N., 1967. The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Res. 27, 209-220. ter Braak, C.J.F., Šmilauer, P. (2002). CANOCO Reference Manual and CanoDraw for Windows User’s Guide: Software for Canonical Community Ordination (Version 4.5). Microcomputer Power, Ithaca NY, USA.


Download ppt "Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University."

Similar presentations


Ads by Google