Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Slides:



Advertisements
Similar presentations
Analysis of variance and statistical inference.
Advertisements

Permutation Tests Hal Whitehead BIOL4062/5062.
An Introduction to Multivariate Analysis
1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
CLUSTERING PROXIMITY MEASURES
The General Linear Model Or, What the Hell’s Going on During Estimation?
Random effects estimation RANDOM EFFECTS REGRESSIONS When the observed variables of interest are constant for each individual, a fixed effects regression.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Comparison of Parametric and Nonparametric Thresholding Methods for Small Group Analyses Thomas Nichols & Satoru Hayasaka Department of Biostatistics U.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Chapter 11 Multiple Regression.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Repeated Measures ANOVA Used when the research design contains one factor on which participants are measured more than twice (dependent, or within- groups.
Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.
Linear Regression and Correlation
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Summarized by Soo-Jin Kim
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Multivariate Analysis of Variance (MANOVA). Outline Purpose and logic : page 3 Purpose and logic : page 3 Hypothesis testing : page 6 Hypothesis testing.
CHAPTER 26 Discriminant Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Essential Statistics in Biology: Getting the Numbers Right
Basic concepts in ordination
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Statistics and Research methods Wiskunde voor HMI Bijeenkomst 3 Relating statistics and experimental design.
Accuracy and power of randomization tests in multivariate analysis of variance with vegetation data Valério De Patta Pillar Departamento de Ecologia Universidade.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
7/16/2014Wednesday Yingying Wang
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Chapter 13 Multiple Regression
ANOVA: Analysis of Variance.
New Proposals for Multiple Test Procedures, Applied to Gene Expression Array Data Siegfried Kropf, Otto von Guericke University Magdeburg in cooperation.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Chapter 12 Introduction to Analysis of Variance PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Eighth Edition by Frederick.
Ark nr.: 1 | Forfatter: Øyvind Langsrud - a member of the Food Science Alliance | NLH - Matforsk - Akvaforsk Rotation Tests - Computing exact adjusted.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Multiple Comparisons with Gene Expression Arrays Using a Data Driven Ordering of Hypotheses Siegfried Kropf, Jürgen Läuter, Magdeburg, Germany Peter H.
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Analysis of variance Tron Anders Moger
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Hypothesis Testing with z Tests Chapter 7. A quick review This section should be a review because we did a lot of these examples in class for chapter.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Chapter 12 Introduction to Analysis of Variance
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Estimating standard error using bootstrap
Chapter 7. Classification and Prediction
An Introduction to Two-Way ANOVA
2nd Level Analysis Methods for Dummies 2010/11 - 2nd Feb 2011
OVERVIEW OF LINEAR MODELS
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Chi Square (2) Dr. Richard Jackson
OVERVIEW OF LINEAR MODELS
Fixed, Random and Mixed effects
Presentation transcript:

Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University Magdeburg, Germany The 8th Tartu Conference on Multivariate Statistics The 6th Conference on Multivariate Distributions with Fixed Marginals Tartu, Estonia, June 2007

Contents Introduction Example data - microbial fingerprints “Usual” way as multivariate test based on spherically distributed scores (PC scores) Test based on pairwise similarity measures Comparison of results in example data Application to other data Extensions of permutation test Parametric “rotation” test for small n Simulation studies on robustness Summary

Introduction Consider global multivariate comparisons between two or more populations of high-dimensional data Gene expression data (all genes, known groups of genes) Neuroimaging Genetical fingerprints (e.g. microbial DNA in soil samples) … Formal description: independent sample vectors x kj ~ N p (  k,  ), k = 1, 2, …, K; j = 1, …, n k, p >> n = n 1 + … + n K or more general x kj ~ F k (x), k = 1, 2, …, K; j = 1, …, n k, p >> n Wanted: test for H 0 :  1 =... =  K or F 1 (x) = … = F K (x)  x

Question: What is the impact of different (natural or genetically modified) plant cultures to the soil microbial population? Extraction of bacterial samples, DNA parts amplified by PCR Several samples investigated together in electrophoresis gels (e.g. denaturing gradient gel electrophoresis, DGGE) Gels scanned, analyzed with GelCompar, vector of hundreds or thousands greyscale values per lane. Example data - microbial fingerprints

M 1 B 1 B 2 B 3 B 4 S 1 S 2 S 3 M 2 P 1 P 2 P 3 P 4 R 1 R 2 R 3 R 4 M 3 Denaturing gradient gel with fingerprints of bacterial communities from rhizo- sphere soil (lanes S 1 to S 3 : strawberry, P 1 to P 4 : potato, R 1 to R 4 : oilseed rape) and unplanted soil (lanes B 1 to B 4 ). Lanes M 1 to M 3 : standard bacterial mix

“Usual” way as multivariate test based on spherically distributed scores (PC scores) Exact test based on spherically distributed scores: PC q test (Läuter et al., 1998) 1. Transformation of raw data vectors into q-dimensional score vectors: x kj  z kj = D x kj (k = 1,..., K; j = 1,..., n k ) with (p  q)-matrix D (q << p) from EVP or better from dual EVP 2. Multivariate test (here Wilks‘  ) with q-dimensional scores z kj

1. Calculate pairwise similar. measures between sample elements, e.g., Pearson’s r Test based on pairwise similarity measures

(2. Investigate similarities by cluster analyses, supported by GelCompar)

3. Calculate the test statistic and use it as basis for a permutation test, where in each new permutation step the n = n 1 + … + n K sample elements are randomly allocated to the K groups of sizes n 1, …, n K (simultaneous exchanges of rows and columns in correlation matrix). Test can be carried out with all groups or pairwise. We used random permutations. Problems occur with small samples because of restricted number of permutations, e.g. with two samples of size 4 different permutations. The permutation test in its basic form is a special case of the Mantel test (Mantel, 1967), similar application to electropheresis data by Aittokallio et al. (2000).

Comparison of results in example data GroupsdPC 1 PC 2 PC 3 PC 4 PC 5 all groups< <.001 B – S B – P B – R S – P S – R P – R p-values for global test and unadjusted pairwise tests d version performs quite well, but here at its limits – no Bonferroni possible

GroupsdPC 1 PC 2 PC 3 PC 4 PC 5 all groups< <.001 B – S B – P B – R S – P S – R P – R p-values for global test and unadjusted pairwise tests The same data with transformation x:= ln(1+x)

Other microbiological fingerprints (DGGE), from soil of four different regions, each four samples Gene expression analyses from microarrays  permutation test based on pairwise correlation coefficients of sample elements performed very well, outperformed PC test in examples (Kropf et al. 2007). Application to other data

Extensions of permutation test The correlation based test can be extended in different ways (Kropf et al., 2004): Inclusion of block designs (e.g., use of several geles, where lanes may not be compared across different geles). Comparison of dependent samples (e.g., the same soil samples analyzed with different types of geles). Use of other distance or similarity measures instead of r (e.g., z-dot transformation of r, squared Euclidean distance, other distances for binary or ordinal data, …).  High flexibility for applications.

Parametric ‚rotation‘ test for small n Usual assumptions: As the distribution of the test statistic might be too complicated for a ‚closed‘ solution, we are looking for a Monte Carlo version: The test statistic is traced back to a left-spherically distributed matrix (particularly an iid multivariate normal rows with expectation zero), which  under H 0  is distributional invariant to random orthogonal rotations. Use infinite no. of random rotations instead of restricted no. of permutations.  “Rotation” test (cf. Langsrud, 2005; Läuter et al. 2005)

1 … p 1…n1……1…nK1…n1……1…nK data matrix X dist./sim. matrix R = (r ij ) r ij = r(x (i), x (j) ) test statistic d = d(R)

reduced data matrix X* 1 … p data matrix X dist./sim. matrix R = (r ij ) r ij = r(x (i), x (j) ) 1 … p 12……n1n12……n1n test statistic d = d(R) 1 … p 1…n1……1…nK1…n1……1…nK data matrix X

reduced data matrix X* 1 … p data matrix X dist./sim. matrix R = (r ij ) r ij = r(x (i), x (j) ) 1 … p 12……n1n12……n1n test statistic d = d(R) 1 … p 1…n1……1…nK1…n1……1…nK data matrix X„decorrelated“ matrix X + random rotations: X + :=  X +  =  *(  *  *)  1/2  * (n-1)  (n-1) from iid standard normal elements repeatedly R has to be invariant with respect to a constant vector shift in arguments r(x (i), x (j) ) = r(x (i) + a, x (j) + a), e.g. squared Eucl. distance, no longer Pearson‘s r !

GruppendPC 1 PC 2 PC 3 PC 4 PC 5 d Euk2 d rot alle 4< < – – – – – – Example data (4 groups: bulk soil, strawberry, potato, oilseed rape) p-values from global test and with unadjusted pairwise comparisons

GroupsdPC 1 PC 2 PC 3 PC 4 PC 5 all groups< <.001 B – S B – P B – R S – P S – R P – R p-values for global test and unadjusted pairwise tests The same data with transformation x:= ln(1+x) d Euk2 d rot < <

Simulation studies on robustness e.g. p indep. components from expontial distribution  others: uniform distribution: slightly anticonservative sum of normal and one of above: nearly exact

Summary Tests based on pairwise similarity or distance measures show a high power in high-dimensional data. The permutation tests for the pairwise methods are not dependent on normality assumptions and performed surprisingly well in many situations. The basic idea is not new (cf. Mantel, 1967), but might have lost attention, at least in the field of medical biometry. Extensions for other designs are possile to some degree. Similar (partly asymptotic) methods in Software “CANOCO” (Canonical Community Ordination) by ter Braak und Šmilauer (2002). Small number of possible permutations restricts application for very small samples. In this case the rotation test can help. It is, however, dependent on the parametric assumptions, so variables should be checked and – if necessary – transformed.

References Aittokallio, T., Ojala, P., Nevalainen, T.J., Nevalainen, O. (2000). Analysis of similarity of electrophoretic patterns in mRNA differential display. Electrophoresis 21, 2947– Kropf, S., Heuer, H., Grüning, M., Smalla, K. (2004). Significance test for comparing complex microbial community fingerprints using pairwise similarity measures. Journal of Microbiological Methods 57/2, Kropf, S., Lux, A., Eszlinger, M., Heuer, H., Smalla, K. (2007). Comparison of independent samples of high-dimensional data by pairwise distance measures. Biometrical Journal 49, Langsrud, Ø. (2005). Rotation Tests, Statistics and Computing, 15, Läuter, J., Glimm, E., Kropf, S. (1998). Multivariate Tests Based on Left-Spherically Distributed Linear Scores. Annals of Statistics 26, Erratum: Annals of Statistics 27, Läuter, J., Glimm, E., Eszlinger, M. (2005). Search for Relevant Sets of Variables in a High-Dimensional Setup Keeping the Familywise Error Rate. Submitted to Statistica Neerlandica. Mantel, N., The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Res. 27, ter Braak, C.J.F., Šmilauer, P. (2002). CANOCO Reference Manual and CanoDraw for Windows User’s Guide: Software for Canonical Community Ordination (Version 4.5). Microcomputer Power, Ithaca NY, USA.