Nonparametric tests European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,

Slides:



Advertisements
Similar presentations
Prepared by Lloyd R. Jaisingh
Advertisements

STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
SADC Course in Statistics Common Non- Parametric Methods for Comparing Two Samples (Session 20)
SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)
Comparison of 2 Population Means Goal: To compare 2 populations/treatments wrt a numeric outcome Sampling Design: Independent Samples (Parallel Groups)
STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!
Chapter 18: The Chi-Square Statistic
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 16 l Nonparametrics: Testing with Ordinal Data or Nonnormal Distributions.
Chi square.  Non-parametric test that’s useful when your sample violates the assumptions about normality required by other tests ◦ All other tests we’ve.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Economics 105: Statistics Go over GH 11 & 12 GH 13 & 14 due Thursday.
Statistical Tests Karen H. Hagglund, M.S.
Lecture 10 Non Parametric Testing STAT 3120 Statistical Methods I.
Chapter 14 Analysis of Categorical Data
Lesson #25 Nonparametric Tests for a Single Population.
Topic 2: Statistical Concepts and Market Returns
Statistics 07 Nonparametric Hypothesis Testing. Parametric testing such as Z test, t test and F test is suitable for the test of range variables or ratio.
Bivariate Statistics GTECH 201 Lecture 17. Overview of Today’s Topic Two-Sample Difference of Means Test Matched Pairs (Dependent Sample) Tests Chi-Square.
Lecture 9 Today: –Log transformation: interpretation for population inference (3.5) –Rank sum test (4.2) –Wilcoxon signed-rank test (4.4.2) Thursday: –Welch’s.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Nonparametrics and goodness of fit Petter Mostad
Chapter 15 Nonparametric Statistics
Statistical Methods II
Choosing Statistical Procedures
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
EDRS 6208 Analysis and Interpretation of Data Non Parametric Tests
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Independent samples- Wilcoxon rank sum test. Example The main outcome measure in MS is the expanded disability status scale (EDSS) The main outcome measure.
Essential Statistics in Biology: Getting the Numbers Right
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
Chapter 13 – Difference Between Two Parameters Math 22 Introductory Statistics.
Biostat 200 Lecture 7 1. Hypothesis tests so far T-test of one mean: Null hypothesis µ=µ 0 Test of one proportion: Null hypothesis p=p 0 Paired t-test:
Previous Lecture: Categorical Data Methods. Nonparametric Methods This Lecture Judy Zhong Ph.D.
Nonparametric Statistics aka, distribution-free statistics makes no assumption about the underlying distribution, other than that it is continuous the.
© Copyright McGraw-Hill CHAPTER 13 Nonparametric Statistics.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Experimental Design and Statistics. Scientific Method
GG 313 Lecture 9 Nonparametric Tests 9/22/05. If we cannot assume that our data are at least approximately normally distributed - because there are a.
Statistics in Applied Science and Technology Chapter14. Nonparametric Methods.
Analisis Non-Parametrik Antonius NW Pratama MK Metodologi Penelitian Bagian Farmasi Klinik dan Komunitas Fakultas Farmasi Universitas Jember.
Nonparametric Statistics
Biostatistics Nonparametric Statistics Class 8 March 14, 2000.
Midterm. T/F (a) False—step function (b) False, F n (x)~Bin(n,F(x)) so Inverting and estimating the standard error we see that a factor of n -1/2 is missing.
STATISTICAL TEST.
Nonparametric Statistics Overview. Objectives Understand Difference between Parametric and Nonparametric Statistical Procedures Nonparametric methods.
1 Underlying population distribution is continuous. No other assumptions. Data need not be quantitative, but may be categorical or rank data. Very quick.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
GIS and Spatial Analysis1 Summary  Parametric Test  Interval/ratio data  Based on normal distribution  Difference in Variances  Differences are just.
Non-parametric test ordinal data
Non-Parametric Tests 12/1.
Chapter 4. Inference about Process Quality
Non-Parametric Tests 12/1.
Non-Parametric Tests 12/6.
CHOOSING A STATISTICAL TEST
Parametric vs Non-Parametric
Non-Parametric Tests.
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
SA3202 Statistical Methods for Social Sciences
Nonparametric Statistical Methods: Overview and Examples
Nonparametric Statistics Overview
Nonparametric Statistical Methods: Overview and Examples
Some Nonparametric Methods
Nonparametric Statistical Methods: Overview and Examples
Hypothesis testing. Chi-square test
Nonparametric Statistical Methods: Overview and Examples
Non – Parametric Test Dr. Anshul Singh Thapa.
Chapter 24 Comparing Two Means.
Nonparametric Statistics
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Introductory Statistics
Presentation transcript:

Nonparametric tests European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,

What is a nonparametric test? Parametric: assume data from some family of distribution functions Gamma distribution with different parameters Normal distribution mean variance Gamma distribution shape scale etc… Non-parametric means that no assumptions about distribution Generally means just look at ranks of data Most traditional tests assume a normal distribution Shape Scale

Robustness Pearsons correlation test Correlation = (p-value = ) Correlation = (p-value = ) Correlation = (p-value = 5.81e-06) Correlation = (p-value = 6.539e-08) A single observation can change the outcome of many tests Robust tests are resistant to outliers but require more data 200 observations from normal distribution x ~ normal(0,1)y ~ normal(1,3)

Robustness A single observation can change the outcome of many tests Robust tests are resistant to outliers but require more data Spearmans correlation test Correlation = (p-value = ) Correlation = (p-value = ) Correlation = (p-value = 0.101) Pearsons correlation test Correlation = (p-value = ) Correlation = (p-value = ) Correlation = (p-value = 5.81e-06) Correlation = (p-value = 6.539e-08) Non-parametricParametric

Newcombs speed of light data Newcombs lab (1878) Washington monument (~12 s later) Standard test of all data Mean % confidence interval (width=5.3) Newcomb dropped the outlier Mean % confidence interval (width=3.1) Robust test (Sign test for median) Median % confidence interval (width=2.5)

Efficiency of robust tests Few results, mostly for large samples Using median rather than mean50% more data Wilcoxon test vs. t-test20% more data (no more than) Potvin and Roff (1993) Ecology 74: Percentage extra data for same tests Asymptotic Relative Efficiency Asymptotic valid for large samples Relative efficiency ratio of variance

Efficiency of robust tests Few results, mostly for large samples Using median rather than mean50% more data Wilcoxon test vs. t-test20% more data (no more than) Potvin and Roff (1993) Ecology 74: Percentage extra data for same tests Requires less data!

Kolmogorov test AKA Kolmogorov-Smirnov test Type of data:continuous Parametric equivalent:none Distribution of statistic:exact when no ties in data Does this data follow a specific distribution? Are two sets of data from the same distribution? Maximum difference

Kolmogorov test Why does it work? Rank difference constant under transformation stretch and contract x axis

Kolmogorov test For testing whether data is normally distributed or not, the Shapiro-Wilk test is preferred. See shapiro.test in R Not valid when null distribution has been fitted to data, e.g. test against normal but fit mean and variance ks.test(stud_logexp, pnorm) One-sample Kolmogorov-Smirnov test data: stud_logexp D = , p-value < 2.2e-16 alternative hypothesis: two-sided Is Studentized expression data normal?

Kolmogorov two-sample test Are two sets of data from the same distribution? Gene expression data from Arabidopsis thaliana sprayed with 1.6mM Tween sprayed with water ks.test(logexp1,logexp2) Two-sample Kolmogorov-Smirnov test data: logexp1 and logexp2 D = , p-value = alternative hypothesis: two-sided Biggest deviations for low expression

Sign test Is the median of the data zero? Is the median x? (Subtract x from data and test against zero) Type of data:continuous Parametric equivalent:Students t-test (one sample) Distribution of statistic:exact when no ties in data 50% 50:50 chance each side median Count them up use binomial test median <00> Gene expression differences

binom.test( c(12334,10155) ) Exact binomial test data: c(10155, 12334) number of successes = 10155, number of trials = 22489, p-value < 2.2e-16 alternative hypothesis: true probability of success is not equal to percent confidence interval: sample estimates: probability of success Sign test Is the median of the data zero? Is the median x? (Subtract x from data and test against zero) Gene expression differences Expect difference in expression to be zero Discard differences of exactly zero <00> Confidence interval is on proportion not the expression difference SIGN.test in the PASWR package is a more convenient way of doing a sign test and gives confidence intervals.

Wilcoxon Signed Rank test Type of data:ordinal (interval for paired data) Parametric equivalent:Students t-test Distribution of statistic:exact Is the data symmetric about zero? Is the data symmetric about x? (Subtract x and test against zero) Much stronger assumption than signed test median=0.72 Test rejects non-symmetric data a <- rweibull(1000,1,1) wilcox.test( a-median(a) ) p-value = 1.087e-05

Wilcoxon Signed Rank test Special case when we do expect symmetry X=Intrinsic + Random X Y=Intrinsic + Random Y Look a pair X & Y Random property measurement error natural variation Paired data Same gene under two different conditions Measuring response (before and after) Paired control, e.g. sibling pairs

Wilcoxon Signed Rank test Paired data Same gene under two different conditions Measuring response (before and after) Paired control, e.g. sibling pairs Special case when we do expect symmetry X=Intrinsic + Random X Y=Intrinsic + Random Y - Distribution of difference is symmetric about zero Look a pair X & Y -=

Wilcoxon Signed Rank test Have gene expression data in two matched Arabidopsis thaliana plants one sprayed with 1.6mM Tween and left for one hour one sprayed with distilled water and left for one hour The genes form matched pairs WaterTweenDifference

Wilcoxon Signed Rank test wilcox.test( lexp1, lexp2, paired=TRUE ) Wilcoxon signed rank test with continuity correction data: lexp1 and lexp2 V = , p-value < 2.2e-16 alternative hypothesis: true location shift is not equal to 0 wilcox.test( lexp1, lexp2, paired=TRUE, conf.int=TRUE) Wilcoxon signed rank test with continuity correction data: lexp1 and lexp2 V = , p-value < 2.2e-16 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: sample estimates: (pseudo)median

Wilcoxon Rank Sum Test Also referred to as Mann-Whitney or Mann-Whitney-Wilcoxon test Type of data:ordinal Parametric equivalent:two-sample Students t-test Distribution of statistic:exact Do two samples have the same median? Look at same expression data but ignore pairing wilcox.test( lexp1, lexp2, conf.int=TRUE) Wilcoxon rank sum test with continuity correction data: lexp1 and lexp2 W = , p-value = alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: sample estimates: difference in location

Paired vs two-sample tests Pairing can make a huge difference to power of test Look at a case where the variation in intrinsic greater than effect wilcox.test(sample1,sample2) Wilcoxon rank sum test data: sample1 and sample2 W = 4930, p-value = alternative hypothesis: true location shift is not equal to 0 wilcox.test(sample1,sample2,paired=TRUE) Wilcoxon signed rank test data: sample1 and sample2 V = 1609, p-value = alternative hypothesis: true location shift is not equal to 0

Kruskal-Wallis Type of data:ordinal Parametric equivalent: ANOVA Distribution of statistic:approximate What if we have several groups? Arabidopis gene expression data consisted of 6 experiments 6 groups of expression data; do they have different medians? kruskal.test(gene_expression) Kruskal-Wallis rank sum test data: gene_expression Kruskal-Wallis chi-squared = , df = 5, p-value = 2.575e-11 For two samples, Kruskal-Wallis is equivalent to Wilcoxon Rank Sum

Friedman test Paired observations Wilcoxon Signed Rank test Genes Groups Type of data:ordinal Parametric equivalent: ANOVA with blocks Distribution of statistic:approximate Genes G1 G2Groups Many groups Kruskal-Wallis test Many groups in distinct units

Friedman test Classic example: wine tasting Ask 4 women to rank 3 different wines, is one wine preferred? MerlotShirazPinot Noir Agnes123 Clara213 Mona132 Pam123 wine Merlot Shiraz Pinot Noir Agnes Clara Mona Pam friedman.test(wine) Friedman rank sum test data: wine Friedman chi-squared = 4.5, df = 2, p-value = friedman.test(t(wine)) Friedman rank sum test data: t(wine) Friedman chi-squared = , df = 3, p-value = AgnesClaraMonaPam Merlot1211 Shiraz2132 Pinot Noir3323 Flip the question: Are judges ranking wines in a consistent manner? Expected since forcing judges to rank

Friedman test Another look at the Arabidopis data - look at first 20 genes Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Friedman Test p-value = Kruskal-Wallis Test p-value = Genes

Friedman test Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Exp Exp Exp Exp Exp Exp 6 Friedman Test p-value = Pairwise Wilcoxon Signed Rank (multiple comparisons problem) Friedman / Kruskal-Wallis: at least one experiment shows difference Does not say which experiment Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Exp Exp Exp Exp Exp Exp 6 Raw p-values Adjusted p-values

Friedman test Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Exp Exp Exp Exp Exp Exp 6 Adjusted p-values from Signed Rank testExperiment map Actually have three pairs of experiments AExp 6 & Exp 1: with and without Tween, 1 hour BExp 2 & Exp 3: with and without Tween, 2.5 hours CExp 5 & Exp 4: with and without Tween, 1 hour (replicate of A) Difference detected may not be a useful one But note: Looked at first 20 genes Full set has 22810

Aside on blocking Gene Experiment The Friedman tests assumes that all treatments are applied to all blocks balanced complete design Statistical lingo Experiments are treatments Genes are blocks Might not be able to do this too expensive blocks only available in packs of fixed size Incomplete experimental design Which treatments with which blocks is a critical issue

Aside on blocking Gene Experiment The Friedman tests assumes that all treatments are applied to all blocks balanced complete design Statistical lingo Experiments are treatments Genes are blocks Might not be able to do this too expensive blocks only available in packs of fixed size Incomplete experimental design Which treatments with which blocks is a critical issue Talk to a statistician before you start