Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

Similar presentations


Presentation on theme: "Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3."— Presentation transcript:

1 Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3

2 Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions (continued) Bootstrap (first part) Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression

3 Statistical Data Analysis 3 Today’s topics: Exploring distributions (Chapter 3: 3.5.2, 3.5.3 ) Bootstrap (Chapter 4: 4.1, 4.2) Exploring distributions 3.5. Tests for goodness of fit 3.5.1. Shapiro-Wilk test for normal distribution (last week) 3.5.2. Kolmogorov-Smirnov test for general distribution 3.5.3. Chi-square test for goodness of fit for general distribution Bootstrap 4.1. Simulation (read yourself) 4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

4 Statistical Data Analysis 4 3.5. Exploring distributions: reminder Testing Ingredients of test? n Hypotheses H 0 and H 1 n Test statistic T n Distribution of T under H 0 and know how it is changed/shifted under H 1 n Rule for when H 0 will be rejected: u Rejection rule either based on critical region or on p-value How to perform test? n Describe the above n Choose significance level α n Calculate and report value t of T n Report whether t is in critical region, or whether p-value < α n Formulate conclusion of test: “H 0 rejected” or “H 0 not rejected” n If possible translate conclusion to practical context NB. When asked to perform a test, you have to do all 6 steps!

5 Statistical Data Analysis 5 Tests for goodness of fit: for (one) general distribution Situation independent realizations from unknown distribution F now:, one specific distribution: Which statistic gives information about distribution F?

6 Statistical Data Analysis 6 3.5.2. Kolmogorov-Smirnov test (1) independent realizations from unknown distribution F Idea: use empirical distribution function Makes sense: is r.v., ~ binom(n, F(x)) so that for n → ∞, Then also under H 0, for n → ∞, Base test on distance between and

7 Statistical Data Analysis 7 Kolmogorov-Smirnov test (2) Test statistic: Distribution of D n under H 0 : same for all continuous F 0 : D n is distribution free over class of continuous distribution functions K-S test is nonparametric test Because When is H 0 rejected? For large values of D n

8 Statistical Data Analysis 8 Kolmogorov-Smirnov test (3) Test statistic: p-values from tables or computer package. Note: standard K-S test with these p-values not suitable for composite H 0 Then adjusted K-S test with adjusted p-values Example: for “H 0 : F is normal” adjusted test statistic for K-S test is What is difference? adj Additional stochasticity!

9 Statistical Data Analysis 9 Kolmogorov-Smirnov test (4) Data: x H 0 : F = N(0,1) H 1 : F ≠ N(0,1) Test statistic: R: > ks.test(x,pnorm) One-sample Kolmogorov-Smirnov test data: x D = 0.1163, p-value = 0.4735 alternative hypothesis: two-sided H 0 rejected? of x Example

10 Statistical Data Analysis 10 Kolmogorov-Smirnov test (5) Data: y H 0 : F is normal ← composite null hypothesis H 1 : F is not normal Test statistic: R: > ks.test(y,pnorm) D = 0.6922, p-value = 6.661e-16 > ks.test(y,pnorm,mean=mean(y),sd=sd(y)) D = 0.1081, p-value = 0.5655 > mean(y) [1] 3.62158 > sd(y) [1] 3.043356 adj Incorrect: this is test for H 0 : F = N(0,1) H 1 : F ≠ N(0,1) Incorrect : this is test for H 0 : F = N(3.62158,(3.04335) 2 ) H 1 : F ≠ N(3.62158,(3.04335) 2 ) of y Example We have not used D adj ! ! p-value should be 0.126 (next week) Correct?

11 Statistical Data Analysis 11 3.5.3. Chi-square test for goodness of fit (1) independent realizations from unknown distribution F Idea: use empirical distribution in different way: divide real line in intervals I 1, …,I k and compare number of data in intervals with expected number in intervals under H 0 N i = number of observations in I i p i = probability of observation in I i under F 0 Then np i = expected number in intervals under H 0 Test statistic:

12 Statistical Data Analysis 12 Chi-square test for goodness of fit (2) Test statistic: Distribution of X 2 under H 0 : different for different F 0, but for n → ∞ distribution of X 2 under H 0 : chi-square with k-1 df same for all F 0 For large enough n, X 2 distribution free chi-square test nonparametric test When is H 0 rejected? For large values of X 2

13 Statistical Data Analysis 13 Chi-square test for goodness of fit (3) Test statistic: How to choose intervals I 1, …,I k ? How many? More is better, but not too many Rule of Thumb: at least 5 observations expected in each interval under H 0 Where? About same number expected in each interval under H 0

14 Statistical Data Analysis 14 Chi-square test for goodness of fit (4) Data: y H 0 : F = N(4,9) H 1 : F ≠ N(4,9) Test statistic: R: > chisquare(y,pnorm,k=8, lb=0, ub=16,mean=4,sd=3) $chisquare [1] 13.11222 $pr [1] 0.06942085 $N (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] (14,16] 14 13 9 4 5 0 1 0 $np [1] 8 12 12 8 3 0 0 0 #Expected numbers under H 0 do not satisfy rule of thumb Better: choose suitable vector b of `breaks’ > chisquare(y,pnorm,breaks=b,mean=4,sd=3) of y Example

15 Statistical Data Analysis 15 Chi-square test for goodness of fit (5) Test statistic: under H 0 : χ 2 k-1 Standard chi-square test not suitable for composite H 0 Then adjusted chi-square test with adjusted chi-square distribution Example: for “H 0 : F is normal” adjusted chi-square test statistic is under H 0 : χ 2 k-m-1 only for one specific type of estimators

16 Statistical Data Analysis 16 Recap Exploring distributions 3.5. Tests for goodness of fit 3.5.2. Kolmogorov-Smirnov test for general distribution 3.5.3. Chi-square test for goodness of fit for general distribution

17 Statistical Data Analysis 17 4. Bootstrap

18 Statistical Data Analysis 18 Bootstrap: Introduction(1) Data: 59 melting temperatures of beewax P unknown true underlying distribution of beewax data Estimator of location of P? T n = (sample) Mean Estimate of location of P? t n = mean(beewax) = 63.589 How accurate is estimate? How good is estimator? Distribution of T n ? Broad/narrow? Main question: How to estimate unknown distribution of estimator T n Notation: Q P Example R: > beewax [1] 63.78 63.34 63.36 63.51 …. > mean(beewax) [1] 63.58881 > sd(beewax) [1] 0.3472209 > var(beewax) [1] 0.1205624 R: > beewax [1] 63.78 63.34 63.36 63.51 …. > mean(beewax) [1] 63.58881 > sd(beewax) [1] 0.3472209 > var(beewax) [1] 0.1205624

19 Statistical Data Analysis 19 Bootstrap: Introduction(2) ( Continued) Simple case: assume P ~ N(μ,σ 2 ) T n = (sample) Mean What is distribution Q P of T n ? We estimate: N(63.589, 0.121/59) How did we find this? i) Estimator of P: N((sample) Mean, (sample) Variance) ii) Estimate: N(63.589,0.121) iii) Q P is distribution of Mean of 59 independent observations from P iv) Estimator of Q P : N((sample) Mean, (sample)Variance/59) v) Estimate of Q P : N(63.589, 0.121/59) Example R: > beewax [1] 63.78 63.34 63.36 63.51 …. > mean(beewax) [1] 63.58881 > sd(beewax) [1] 0.3472209 > var(beewax) [1] 0.1205624 > length(beewax) [1] 59

20 Statistical Data Analysis 20 Bootstrap: Introduction(3) ( Continued) Other case: assume P ~ N(μ,σ 2 ) Now T n = (sample) Median What is distribution Q P of T n ? How to proceed now? i) Estimator of P: N((sample) Mean, (sample) Variance) ii) Estimate: N(63.589,0.121) iii) Q P is distribution of Median of 59 independent observations from P iv) Estimator of Q P : ? v) Estimate of Q P : ? This is what bootstrap is about: estimate distribution Q P of function T n of 59 independent observations from unknown P n Example R: > beewax [1] 63.78 63.34 63.36 63.51 …. > mean(beewax) [1] 63.58881 > sd(beewax) [1] 0.3472209 > var(beewax) [1] 0.1205624 > length(beewax) [1] 59

21 Statistical Data Analysis 21 Bootstrap: Introduction(4) ( Continued) Again other case: no assumption about P T n = (sample) Mean What is distribution Q P of T n ? How to proceed now? i) Estimator of P: ? ii) Estimate: ? iii) Q P is distribution of Mean of 59 independent observations from P iv) Estimator of Q P : ? v) Estimate of Q P : ? This is what bootstrap is about: estimate distribution Q P of function T n of 59 independent observations from unknown P n Example R: > beewax [1] 63.78 63.34 63.36 63.51 …. > mean(beewax) [1] 63.58881 > sd(beewax) [1] 0.3472209 > var(beewax) [1] 0.1205624 > length(beewax) [1] 59

22 Statistical Data Analysis 22 4.2. Bootstrap estimators for a distribution This is what bootstrap is about: estimate distribution Q P of function T n of n independent observations from unknown P Situation realizations of, independent, unknown distr. P Goal Estimate distribution of estimator Cases 1. Assume P is some parametric distribution with unknown parameters 2. Assume nothing about P

23 Statistical Data Analysis 23 Bootstrap estimators for a distribution; Case 1: parametric bootstrap estimator (1) ( Beewax; case 1) Case 1: Assume P ~ N(μ,σ 2 ) ; T n = (sample) Median What is distribution Q P of T n ? How to proceed? i) Estimator of P: N( X,S 2 ) = P ii) Estimate: N(63.589,0.121) iii) Q P is distribution of Median of 59 independent observations from P iv) Estimator of Q P : distribution of Median of 59 independent observations from N( X,S 2 ) = P v) Estimate of Q P : distribution of Median of 59 independent observations from N(63.589,0.121) Unknown: use computer to generate realizations from estimate of Q P Empirical distribution of generated set is parametric bootstrap estimate of Q P Example ^ θnθn ^ θnθn Which distribution is this?

24 Statistical Data Analysis 24 Bootstrap estimators for a distribution; Case 1: parametric bootstrap estimator (2) ( Continued; case 1) How to generate realizations from estimate of Q P : i.e. from distribution of Median of 59 independent observations from N(63.589,0.121)? # 1. Generate one bootstrap sample: > xstar=rnorm(59, 63.589,sqrt(0.121)) # Check: > xstar [1] 63.84819 62.88915 63.71705 64.06793 ….. [57] 63.56481 64.03403 63.75276 #Note: xstar is of same length as beewax # 2. Now compute one bootstrap value tstar from xstar: > tstar=median(xstar) > tstar [1] 63.70498 # 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of Q P Example

25 Statistical Data Analysis 25 Bootstrap estimators for a distribution; Case 1: parametric bootstrap estimator (3) ( Continued; case 1) The B values tstar are generated realizations from estimate of Q P i.e. from distribution of Median of 59 independent observations from N(63.589,0.121) Recall: empirical distribution of generated set is parametric bootstrap estimate of Q P Also: sample variance 0.0038 of generated set is parametric bootstrap estimate of variance of Q P Example

26 Statistical Data Analysis 26 Bootstrap estimators for a distribution; Case 2: empirical bootstrap estimator (1) ( Beewax; case 2) Case 2: Assume nothing about P; T n = (sample) Mean What is distribution Q P of T n ? How to proceed? i) Estimator of P: empirical distribution of data = P n ii) Estimate: empirical distribution of beewax data iii) Q P is distribution of Mean of 59 independent observations from P iv) Estimator of Q P : distribution of Mean of 59 independent observations from empirical distribution of data v) Estimate of Q P : distribution of Mean of 59 independent observations from empirical distribution of beewax data Unknown: use computer to generate realizations from estimate of Q P Empirical distribution of generated set is Empirical bootstrap estimate of Q P Example Which distribution is this? ^

27 Statistical Data Analysis 27 Bootstrap estimators for a distribution; Case 2: empirical bootstrap estimator (2) ( Continued; case 2) How to generate realizations from estimate of Q P : i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data ? # 1. Generate one bootstrap sample: > xstar=sample(beewax, replace = TRUE) # Check: > xstar [1] 63.69 64.42 63.30 63.03 63.13 63.13 63.08 63.27 63.08 64.12 64.21 63.43 ….. #Note: xstar is of same length as beewax and consists of values sampled from the set of #beewax values. # 2. Now compute one bootstrap value tstar from xstar: > tstar=mean(xstar) > tstar [1] 63.60271 # 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of Q P Example

28 Statistical Data Analysis 28 Bootstrap estimators for a distribution; Case 2: empirical bootstrap estimator (3) ( Continued; case 2) The B values tstar are generated realizations from estimate of Q P i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data Recall: empirical distribution of this generated set is empirical bootstrap estimate of Q P Also: sample variance 0.00193 of this generated set is empirical bootstrap estimate of variance of Q P Note: this value is comparable to value 0.121/59 = 0.0020 of estimate of variance of Q P under normality assumption for P Example

29 Statistical Data Analysis 29 Empirical bootstrap with R # Can be done in one go with local R-function bootstrap: > bootstrap = function(x, statistic, B = 100.,...) { # returns a vector of B bootstrap values of real-valued statistic. # statistic(x) should be R-function ; arguments of # statistic can be inserted on... # resampling is done from empirical distribution of x y <- numeric(B) for(j in 1.:B) y[j] <- statistic(sample(x, replace = TRUE),...) y } # Compute 1000 bootstrap values tstar: > tstarvector=bootstrap(beewax,mean,B=1000)

30 Statistical Data Analysis 30 Bootstrap: two errors Recall goal: to estimate distribution Q P of function T n of n independent observations from unknown P Note: in bootstrap estimation procedure two types of “errors” are made; Which ones? Given the data: - Estimate of Q P : distribution of function T n of n independent observations from estimate of P - Which is estimated in turn by empirical distribution of computer generated realizations of this distribution How can we make these errors small? Size first error depends on quality estimator of P Size second error can be made small by taking B large First error Second error

31 Statistical Data Analysis 31 Recap Bootstrap 4.1. Simulation (read yourself) 4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

32 Statistical Data Analysis 32 Exploring distributions/Bootstrap The end


Download ppt "Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3."

Similar presentations


Ads by Google