Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.

Similar presentations


Presentation on theme: "Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2."— Presentation transcript:

1 Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2

2 Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression

3 Statistical Data Analysis 3 Today’s topic: Exploring distributions (Chapter 3: 3.1-3.5.1) Introduction: goal of investigations 3.1. Quantile function and location-scale families 3.2 QQ-plots for one sample 3.3. Symplots 3.4. (Empirical) QQ-plots for two samples 3.5. Goodness of fit tests 3.5.1. Shapiro-Wilk test for normal distribution

4 Statistical Data Analysis 4 Exploring distributions: introduction (1) Data set: its empirical distribution its underlying distribution

5 Statistical Data Analysis 5 Exploring distributions: introduction (2) Data set: values of a variable, measured on sample from population Data set: empirical distribution = distribution of the data underlying distribution = distribution of the variable in population Goal: find underlying distribution Empirical distribution helps to determine underlying distribution Different sample from same population: different empirical distribution but same underlying distribution

6 Statistical Data Analysis 6 Exploring distributions: introduction (3) Goal: find underlying distribution What do the graphical and numerical summaries tell us about underlying distribution of data set? And what not?

7 Statistical Data Analysis 7 Exploring distributions: introduction (4) Goal: find underlying distribution This lecture’s questions: For one sample: n Do data originate from specific distribution? n Is the underlying distribution symmetric? For two samples: n Do the samples have the same underlying distribution? Answers with graphs and tests

8 Statistical Data Analysis 8 3.1. Quantile function and location-scale families F distribution function of X, α-quantile: x α such that More general: quantile function of F: A distribution function and two of its quantiles if F strictly increasing

9 Statistical Data Analysis 9 Location-scale family (1) If X has distribution function F, then a + bX has distribution function F a,b given by Location-scale family of F: Expectation and variance of distribution F a,b ??

10 Statistical Data Analysis 10 Location-scale family (2) Quantiles of F and F a,b in location-scale family of F have linear relationship Example Quantiles of N(2,16) against quantiles of N(0,1)

11 Statistical Data Analysis 11 Different location-scale families Quantiles of F and G from different location-scale families do not have linear relationship Example Quantiles of members of different location scale families against each other

12 Statistical Data Analysis 12 3.2. Plots for one sample: QQ-plots (1) So far: theoretical quantiles For data x 1, …, x n : Empirical α-quantile x α : fraction α of x 1, …, x n is ≤ x α Empirical quantiles roughly correspond to theoretical quantiles of underlying distribution QQ-plot: plot of empirical quantiles against plot of theoretical quantiles of a particular distribution If QQ-plot shows roughly straight line?? Then underlying distribution of data belongs to location-scale family of that distribution

13 QQ-plot: plot of empirical quantiles against plot of theoretical quantiles of a particular distribution If QQ-plot shows roughly straight line, then underlying distribution of data belongs to location-scale family of that distribution Example 25 data from N(0,9) 100 data from N(0,9) R: qqnorm, qqexp, qqchisq, etc. Statistical Data Analysis 13 Plots for one sample: QQ-plots (2)

14 Recall: intercept (a) and slope (b) of best fitting line in QQ-plot are location and scale parameters w.r.t. “distribution on x-axis”. How to estimate expectation and standard deviation of underlying distribution from QQ-plot? Statistical Data Analysis 14 Plots for one sample: QQ-plots (3) If QQ-plot shows roughly straight line, then underlying distribution of data belongs to location-scale family of “distribution on x-axis”. Now: which specific distribution of this family do data come from, or: which location and scale does underlying distribution have? - Estimate by eye intercept (a) and slope (b) of best fitting line in QQ-plot; - Express expectation (or standard deviation) of underlying distribution in terms of (known) expectation (or (known) standard deviation) of “distribution on x-axis”, a and b.

15 Statistical Data Analysis 15 Plots for one sample: QQ-plots (4) QQ-plot: plot of empirical quantiles against plot of theoretical quantiles of a particular distribution If QQ-plot does not show straight line, then underlying distribution of data belongs to location-scale family of other distribution Example 25 data from N(0,9) 100 data from N(0,9) In this of one with case with heavier tails. How to see this??

16 Statistical Data Analysis 16 3.3. Plots for one sample: symplot (1) Symmetry: X symmetric around θ if X – θ and θ – X have same distribution. If X continuous then probability density symmetric around θ. How to check? n histogram, stem-and-leaf plot n skewness parameter n difference in sample mean and sample median n quantile function → symplot

17 Statistical Data Analysis 17 Plots for one sample: symplot (2) If X symmetric around θ, then Thus: linear relationship between lower and upper theoretical quantiles Analogously for data from symmetric distribution: linear relationship between lower and upper empirical quantiles Symplot: plot of points R: symplot θ Area α F -1 (α)F -1 (1-α)

18 Statistical Data Analysis 18 Plots for one sample: symplot (3) If data from symmetric distribution: linear relationship between lower and upper empirical quantiles N(0,1) exp(1)chisq(df=3) 3 examples

19 Statistical Data Analysis 19 Intermezzo: Scheme for exploring distributions

20 Statistical Data Analysis 20 3.4.Plots for two samples: (empirical) QQ-plots (1) Do two samples have the same underlying distribution? How to answer this with empirical quantiles?

21 Statistical Data Analysis 21 Plots for two samples: (empirical) QQ-plots (2) Plot for comparing distribution of two samples (Empirical) QQ-plot: If m = n, then plot the points If m < n, then plot the points R: qqplot

22 Statistical Data Analysis 22 Plots for two samples: (empirical) QQ-plots (3) Do data in samples a and b have same underlying distribution? We see roughly straight line, so … but not line y = x, so underlying distributions of a and b come from same location-scale family, but are not the same !

23 Statistical Data Analysis 23 3.5. Tests for goodness of fit (1) Exploring distributions: now more formal methods Testing Ingredients of test? n Hypotheses H 0 and H 1 n Test statistic T n Distribution of T under H 0 and know how it is changed/shifted under H 1 n Rule for when H 0 will be rejected: u Rejection rule either based on critical region or on p-value How to perform test? n Describe the 4 above ingredients n Choose significance level α n Calculate and report value t of T n Report whether t is in critical region, or whether p-value < α n Formulate conclusion of test: “H 0 rejected” or “H 0 not rejected” n If possible translate conclusion to practical context NB. When asked to perform a test, you have to do all 6 steps!

24 Statistical Data Analysis 24 Tests for goodness of fit (2) Situation independent realizations from unknown distribution F often: or is a location scale family Be cautious with too strong conclusions When n is small, power small, conclusion not very reliable When n is very, very large, H 0 almost always rejected

25 Statistical Data Analysis 25 3.5.1. Goodness of fit tests: Shapiro-Wilk test (1) Test for null hypothesis that underlying distribution is normal: independent realizations from unknown distribution F Now is location-scale family of normal distributions Test statistic: (values in (0,1]) Distribution under H 0 from tables or computer package H 0 is rejected for “small” values of W R: shapiro.test

26 Statistical Data Analysis 26 Goodness of fit tests: Shapiro-Wilk test (2) R: > shapiro.test(beewax) Shapiro-Wilk normality test data: beewax W = 0.9748, p-value = 0.2579Conclusion? > shapiro.test(rexp(50)) Shapiro-Wilk normality test data: rexp(50) W = 0.9353, p-value = 0.008848Conclusion? “small”

27 Statistical Data Analysis 27 Recap 3. Introduction: goal of investigations 3.1. Quantile function and location-scale families 3.2. QQ-plots for one sample 3.3. Symplots 3.4. (Empirical) QQ-plots for two samples 3.5. Goodness of fit tests 3.5.1. Shapiro-Wilk test for normal distribution

28 Statistical Data Analysis 28 Investigating distributions The end


Download ppt "Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2."

Similar presentations


Ads by Google