Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.

Similar presentations


Presentation on theme: "Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7."— Presentation transcript:

1 Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7

2 Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions Bootstrap Robust methods Nonparametric tests (continued) Analysis of categorical data Multiple linear regression

3 Statistical Data Analysis 3 Today’s topic: Nonparametric methods for two sample problems (Chapter 6: 6.3, 6.4 ) 6. Nonparametric methods (continued) 6.1. One sample: two nonparametric tests for location 6.2. Aymptotic efficiency 6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test 6.3.2. Wilcoxon two-sample test 6.3.4. Kolmogorov-Smirnov two-sample test 6.3.4. Permutation tests 6.3.5. Asymptotic efficiency (read yourself) 6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman 6.4.2. Rank correlation test of Kendall 6.4.3. Permutation tests

4 Statistical Data Analysis 4 Nonparametric methods: Introduction-recap Nonparametric tests No assumption of parametric family for underlying distribution of data For problems with large class of distributions belonging to H 0 Distribution of test statistic same under every distribution that belongs to H 0 Why these tests? Robust w.r.t. confidence level: conf level α for large class of distributions More efficient than tests with more assumptions when these assumptions do not hold: fewer observations necessary for same power (= onderscheidend vermogen)

5 Statistical Data Analysis 5 6.3. Two-sample problem: equality of distributions (1) Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z) > mean(x) = 61.56 > mean(z) = 75.11 > median(x) = 54.25 > median(z) = 62.5 > sd(x) = 28.82 > sd(z) = 48.51 > length(x) = 32 > length(z) = 23 Is distribution of x in some way smaller than that of z? Example

6 Statistical Data Analysis 6 Two-sample problem: equality of distributions (2) (C ontinued) Is distribution of x same as that of z? How to investigate with plot? Better: Empirical qqplot of x and z (In)equality of distributions not clear (see also Chapter 3), so investigate further with test(s) Boxplots in one figure Example

7 Statistical Data Analysis 7 Two-sample problem: equality of distributions (3) Situation realizations of, independent, unknown distr. F realizations of, independent, unknown distr. G Are F and G the same? Which aspect? Location, spread, general shape, … Case 1. Paired observations, m = n Case 2. Unpaired observations and two independent groups of random variables

8 Statistical Data Analysis 8 Paired-samples: equality of distributions ~ F ~ G Case 1. Paired observations Main interest: difference in location of F and G Consider differences → one sample Now investigate location of distribution of with one sample test(s).

9 Statistical Data Analysis 9 Paired-samples: 3 one-sample tests ~ F ~ G F = G ? Case 1. Paired observations Test whether location of distribution of differences equals 0 with one sample test(s): i) normal: t-test ii) dependent, independent: sign test, Wilcoxon’s signed rank test iii) independent, independent: Wilcoxon’s signed rank test, because then under H 0 symmetry around 0 automatic

10 Statistical Data Analysis 10 Unpaired samples: equality of distributions ~ F ~ G Case 2. Unpaired observations and two independent groups of random variables m and n may be different

11 Statistical Data Analysis 11 Unpaired samples : t-test Two-sample t-test Assumptions: F normal, mean μ; G normal, mean ν ; equal variances Test statistic: ~ t m+n-2 If variances not equal: adjusted denominator Note: this is default in R: ?t.test t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95,...)

12 Statistical Data Analysis 12 6.3.1. Unpaired samples: median test Median test Assumptions: F, G continuous distributions Test statistic: ~ Hyp (m+n, m, p) nonparametric “half” of the total number of observations Suited (efficient) for shift alternatives: H 1 : G = F(-θ) Does not use much information of data (NB. This is not Mood’s median test)

13 Statistical Data Analysis 13 6.3.2. Unpaired samples: Wilcoxon two-sample test (1) Wilcoxon two-sample or Wilcoxon rank sum test or Mann-Whitney test Assumptions: F, G continuous distribution functions Test statistic: ranks of in combined sample of size N = m+n nonparametric With ties or for large n, m normal approximation Especially suited (efficient) for shift alternatives: H 1 : G = F(-θ) Uses more information from data

14 Wilcoxon rank sum test Assumptions: F, G continuous distributions Test statistic: ranks of in combined sample of size N = m+n Statistical Data Analysis 14 Unpaired samples: Wilcoxon two-sample test (2) Equivalent test statistics used under same name: and switched roles first and second sample = sum of ranks of first sample = m(m+1) used by R

15 Statistical Data Analysis 15 6.3.3. Unpaired samples: Kolmogorov-Smirnov test Kolmogorov-Smirnov two-sample test Assumptions: F, G have continuous distribution functions nonparametric Test statistic: ranks of in combined sample of size N = m+n Especially suited (efficient) for general alternatives: F and G need not have same shape

16 Statistical Data Analysis 16 6.3.4. Unpaired samples: permutation tests Permutation tests for unpaired data Assumptions: F, G have continuous distribution functions Test statistic: that gives info about difference F and G e.g., Med(X) m – Med(Y) n, etc. Test conditionally on ordered combined sample : (right-sided) nonparametric

17 Statistical Data Analysis 17 Unpaired samples: illustration (1) Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G (what does this mean??) Note: For different tests this H1 becomes in R: t.test: difference in expectations is less than 0; > t.test(x,z,alternative="less") median test: difference in location less than 0; # compute yourself with 1-phyper(18, 32, 23, 56/2) # check where the numbers come from!! Mann-Whitney/Wilcoxon: difference in location less than 0; > wilcox.test,alternative="less") Kolmogorov-Smirnov: CDF of X above of that of Y; > ks.test(x,z,alternative="greater") Example

18 Statistical Data Analysis 18 Unpaired samples: illustration (2) (C ontinued) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G p-values: t.test: 0.12 median test: 0.11 Mann-Whitney/Wilcoxon: 0.20 (normal approximation was used –due to ties) Kolmogorov-Smirnov: 0.31 (R-warning) H0 not rejected for these tests. Note: we have performed all tests, but - t.test is not good candidate, because data not likely to be normal based on plots; - whether distributions have same shape, i.e. whether shift-alternative is good choice, not clear: shapes look similar, but sd’s are quite different. If it is, then median and Mann- Whitney tests are good in terms of power; - Kolmogorov Smirnov is good test also for general types of alternatives. There are ties here and R does not know how to adjust for this, so consider p-value as an approximation. Example

19 Statistical Data Analysis 19 6.4. Paired samples: correlation ~ F ; ~ G Only for paired observations : are X i and Y i correlated? How to start investigation? Make scatter plot Measures of correlation? (Pearson’s) sample correlation Kendall’s rank correlation Spearman’s rank correlation Can all be used for testing

20 Statistical Data Analysis 20 6.4.1/6.4.2. Paired samples: tests for correlation Only for paired observations : for all i (Pearson’s) (linear) correlation test Assumptions: F normal, G normal Test statistic: ~ t n-2 Kendall’s rank correlation test, Spearman’s rank correlation test Assumptions: F, G continuous distribution functions Test statistic: and, resp. Both based on ranks: nonparametric R: cor.test(x,y, method= "pearson“/ "kendall“/ "spearman", …)

21 Statistical Data Analysis 21 6.4.3. Paired samples: permutation test for correlation Only for paired observations : for all i Permutation tests Assumptions: F, G continuous distribution functions Test statistic: that gives info about dependence X i and Y i e.g. Kendall’s, Spearman’s Test conditionally on combined first and ordered second sample : (right-sided) Conditional, so different results from former tests with same statistics nonparametric

22 Statistical Data Analysis 22 Permutation tests (1) 1. Unpaired observations and 2. Paired observations, m = n Bootstrap if not computable exactly: generate large number B of randomly chosen permutations π, and approximate p-value by fraction: 1. Replace by 2. Replace by

23 Statistical Data Analysis 23 Permutation tests (2) 1. Unpaired observations and 2. Paired observations, m = n How to permute in both cases? Instead of permutation π of 1,…, m+n, and 1,…, n, resp., easier to permute the data: 1. Permute (X 1,..,X m, Y 1,…,Y n ) and make new division in two samples of m and n observations. 2. Permute (Y 1,..,Y n ), leave (X 1,..,X n ) as it is, and make new pairs.

24 Statistical Data Analysis 24 Permutation test: illustration for unpaired samples (1) Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G If interested in specific characteristics: permutation test Unpaired data: conditional test, given the sorted values of the combined sample: Choose T; suppose H0 rejected for small (large) values of T then B times, do: u generate random permutation of (x 1,..,x 32,,z 1,…,z 23 ) and make new division in two samples of 32 and 23 observations with R-function `sample’ u xperm = first 32 elements of permuted data u zperm = last 23 elements of permuted data u determine T(xperm, zperm) Count fraction of B values T(xperm, zperm) smaller (larger) than T(x,z) of thrombo data: this is p-value Example

25 Statistical Data Analysis 25 Permutation test: illustration for unpaired samples (2) (C ontinued) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G Results of permutation tests Permutation test for difference in mean: T=mean(X)-mean(Y) Left p-value: 0.096 (bootstrap approximation) (several times: 0.107, 0.091, 0.114, …) Permutation test for difference in median: T=median(X)-median(Y) Left p-value: 0.165 (bootstrap approximation) (several times: 0.163, 0.19, 0.18, …) Permutation test for Mann-Whitney: T=U-tilde Left p-value: 0.183 (bootstrap approximation) (several times: 0.195, 0.214, 0.201, …) (around value for unconditional Mann-Whitney). Permutation test for difference in sd: T=sd(X)-sd(Y) Left p-value: 0.045 (bootstrap approximation) (several times: 0.043, 0.053, 0.059, …) Example

26 Statistical Data Analysis 26 Recap 6. Nonparametric methods (continued) 6.3. Two samples: nonparametric tests for equality of distributions 6.3.1. Median test 6.3.2. Wilcoxon two-sample test 6.3.4. Kolmogorov-Smirnov two-sample test 6.3.4. Permutation tests 6.3.5. Asymptotic efficiency (read yourself) 6.4. Two samples: nonparametric tests for correlation 6.4.1. Rank correlation test of Spearman 6.4.2. Rank correlation test of Kendall 6.4.3. Permutation tests

27 Statistical Data Analysis 27 Nonparametric methods for one sample problems The end


Download ppt "Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7."

Similar presentations


Ads by Google