Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.

Slides:

Advertisements

Similar presentations

Descriptive Statistics-II

Advertisements

Chapter 6 Confidence Intervals.

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.

1 Chi-Square Test -- X 2 Test of Goodness of Fit.

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.

2013/12/10.  The Kendall’s tau correlation is another nonparametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

11 Simple Linear Regression and Correlation CHAPTER OUTLINE

McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.

Biomedical Presentation Name: 牟汝振 Teach Professor: 蔡章仁.

Chapter 14 Analysis of Categorical Data

9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.

SIMPLE LINEAR REGRESSION

Inference about a Mean Part II

Inferences About Process Quality

SIMPLE LINEAR REGRESSION

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.

12.3 – Measures of Dispersion

Lecture II-2: Probability Review

Chapter 15 Nonparametric Statistics

Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.

SIMPLE LINEAR REGRESSION

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

AM Recitation 2/10/11.

Introduction to Linear Regression and Correlation Analysis

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.

Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.

Common Probability Distributions in Finance. The Normal Distribution The normal distribution is a continuous, bell-shaped distribution that is completely.

Regression Analysis (2)

Chapter 6 The Normal Probability Distribution

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.

6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.

Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.

The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.

Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.

Chapter SixteenChapter Sixteen. Figure 16.1 Relationship of Frequency Distribution, Hypothesis Testing and Cross-Tabulation to the Previous Chapters and.

9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.

Tests for Random Numbers Dr. Akram Ibrahim Aly Lecture (9)

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Inference about Two Means: Independent Samples 11.3.

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.

GG 313 Lecture 9 Nonparametric Tests 9/22/05. If we cannot assume that our data are at least approximately normally distributed - because there are a.

Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics S eventh Edition By Brase and Brase Prepared by: Lynn Smith.

Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,

Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.

Describing Samples Based on Chapter 3 of Gotelli & Ellison (2004) and Chapter 4 of D. Heath (1995). An Introduction to Experimental Design and Statistics.

Distributions of Sample Means. z-scores for Samples  What do I mean by a “z-score” for a sample? This score would describe how a specific sample is.

Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai The Normal Curve and Univariate Normality PowerPoint.

Session V: The Normal Distribution Continuous Distributions (Zar, Chapter 6)

STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.

Statistics and probability Dr. Khaled Ismael Almghari Phone No:

Estimating standard error using bootstrap

Virtual University of Pakistan

Inference about the slope parameter and correlation

Regression and Correlation

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

SOME IMPORTANT PROBABILITY DISTRIBUTIONS

Assumption of normality

Quantitative Methods PSY302 Quiz 6 Confidence Intervals

9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE

6-1 Introduction To Empirical Models

Chapter 9 Hypothesis Testing.

The normal distribution

SIMPLE LINEAR REGRESSION

Summary Table of Influence Procedures for a Single Sample (I)

Distribution-Free Procedures

Correlation and Simple Linear Regression

Presentation transcript:

Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國

outline Graphical Tests Analysis of Skewness and Kurtosis Statistical Tests Chi-square test for normality Kolmogorov-Smironov Shapiro-Wilk (original test) Shapiro-Wilk (expanded test) Transformations

Graphical Tests Histogram histogram can be used to test whether data is normally distributed. Determine whether the data in column B of Figure 1 are normally distributed using a histogram. the histogram doesn’t look particularly normal.

QQ Plot A PP plot (point-point plot) A QQ plot (quantile-quantile plot) Using a QQ plot determine whether the data set {-5.2, -3.9, -2.1, 0.2, 1.1, 2.7, 4.9, 5.3} is normally distributed. we have also standardized the original data so that it is easier to compare the standardized data with the standard normal approximation for each data point. Finally we have included a scatter diagram (the QQ plot) of the data vs. the standardized normal data

Real Statistics Excel Support The Real Statistics Resource Pack contains a supplemental data analysis tool which creates QQ plots automatically. In the current implementation, the data must be organized as a single column. We illustrate the use of the QQ Plot data analysis tool in the following example.

Box Plots While box plots can’t actually be used to test for normality, they can be useful for testing for symmetry, which often is a sufficient substitute for normality.

Analysis of Skewness and Kurtosis ince the skewness and kurtosis of the normal distribution are zero, values for these two parameters should be close to zero for data to follow a normal distribution. A rough measure of the standard error of the skewness is where n is the sample size. A rough measure of the standard error of the kurtosis is where n is the sample size. If the absolute value of the skewness for the data is more than twice the standard error this indicates that the data are not symmetric, and therefore not normal. Similarly if the absolute value of the kurtosis for the data is more than twice the standard error this is also an indication that the data are not normal.

Statistical Tests for Normality and Symmetry Chi-square test for normality The chi-square goodness of fit test can be used to test the hypothesis that data comes from a normal hypothesis. In particular, we can use Theorem 2 of Goodness of Fit, to test the null hypothesis: We now perform the Chi-square goodness of fit test. Since the observed and expected frequencies of the first and last interval are less than 5, it is better to combine the 1st and 2nd as well as the last and second to last intervals. The chi-square test statistic is 4.47, which is less than the critical value of 14.07, and so we can conclude that there is a good fit. Note that the df = number of interval – 1 = 8 – 1 = 7 since the mean and standard deviation are given.

We next test the null hypothesis that the data is normally distributed using the sample mean and variance (3.74 and 2.20 respectively as see in Figure 3) as estimates for the population mean/variance. As in Example 1, we combine the first two and the last two intervals so that all frequencies are at least 5. Once again we use a chi-square goodness of fit test based on 8 intervals, but this time since the mean and variance are estimated parameters, per Theorem 3 of Goodness of Fit.n

Kolmogorov-Smirnov Test Definition 1: Let x 1,…,x n be an ordered sample with x 1 ≤ … ≤ x n and define S n (x) as follows: Now suppose that the sample comes from a population with cumulative distribution function F(x) and define Dn as follows: Observation: It can be shown that Dn doesn’t depend on F. Since Sn(x) depends on the sample chosen, Dn is a random variable. Our objective is to use Dn as way of estimating F(x). The distribution of Dn can be calculated, but for our purposes the important aspect of this distribution are the critical values. These can be found in the Kolmogorov-Smirnov Table.

If Dn,α is the critical value from the table, then P(Dn ≤ Dn,α) = 1 – α. Dn can be used to test the hypothesis that a random sample came from a population with a specific distribution function F(x). If then the sample data is a good fit with F(x). Also from the definition of D n given above, it follows that

the mean is and the standard deviation is We can now build the table that allows us to carry out the KS test, namely: D n = the largest value in column G, which in our case is If the data is normally distributed then the critical value D n,α will be larger than D n. From the Kolmogorov-Smirnov Table we see that D n,α = D 1000,.05 = 1.36 / SQRT(1000) = Since D n = < = D n,α, we conclude that the data is a good fit with the normal distribution.

Shapiro-Wilk Original Test Rearrange the data in ascending order so that x 1 ≤ … ≤ x n. Calculate SS as follows: If n is even, let m = n/2, while if n is odd let m = (n–1)/2 Calculate b as follows, taking the ai weights from the Table 1 (based on the value of n) in the Shapiro-Wilk Table. Note that if n is odd, the median data value is not used in the calculation of b. Calculate the test statistic W = b2 ⁄ SS Find the value in the Table 2 of the Shapiro-Wilk Table (for a given value of n) that is closest to W, interpolating if necessary. This is the probability that the data comes from a normal distribution.

We begin by sorting the data in column A using Data > Sort & Filter|Sort or the QSORT supplemental function, putting the results in column B. We next look up the a coordinate values for n = 12 (the sample size) in Table 1 of the Shapiro-Wilk Table, putting these values in column E. Corresponding to each of these 6 coordinates a1,…,a6, we calculate the values x12 – x11, …, x7 – x6, where xi is the ith data element in sorted order. E.g. since x1 = 35 and x12 = 86, we place the difference 86 – 35 = 51 in cell. Column I contains the product of the coordinate and difference values. E.g. cell I5 contains the formula =E5*H5. The sum of these values is b = , which is found in cell.

We next calculate SS = Thus W = b 2 ⁄ SS = We now look for when n = 12 in Table 2 of the Shapiro-Wilk Table and find that the value lies between.50 and.90. The W value for.5 is.943 and the W value for.9 is.973. Interpolating between these value, we arrive at p-value = Since p-value =.87 >.05 = α, we retain the null hypothesis that the data are normally distributed. SHAPIRO(R1, False) = the Shapiro-Wilk test statistic W for the data in the range R1 SWTEST(R1, False) = p-value of the Shapiro-Wilk test on the data in R1 SWCoeff(n, j, False) = the jth coefficient for samples of size n SWCoeff(R1, C1, False) = the coefficient corresponding to cell C1 within sorted range R1 SWPROB(n, W) = p-value of the Shapiro-Wilk test for a sample of size n for test statistic W

Shapiro-Wilk Expanded Test The following version of the Shapiro-Wilk Test handles samples between 12 and 5,000 elements, although samples of at least 20 elements are recommended. We also show how to handle samples with more than 5,000 elements. Assuming that the sample has n elements, perform the following steps: 1. Sort the data in ascending order x1 ≤ … ≤ xn 2. Define the values m1, …, mn by mi = NORMSINV((i −.375)/(n +.25)) 3. Let M = [mi] be the n × 1 column vector whose elements are these mi and let

If M is represented by the n × 1 range R1 in Excel, then =SUMSQ(R1) calculates the value m. 4. Set u = 1/ and define the coefficients a 1, …, a n where a i = m i / for 2 < i < n − 1 It turns out that a i = −a n-i+1 for all i and that where A = [a i ] is the n × 1 column vector whose elements are the a i. 5. The W statistic is now defined by Because of the above properties of the coefficients a 1, …, a n it turns out that W = the square of the correlation coefficient between a 1, …, a n and x 1, …, x n. Thus the values of W are always between 0 and 1.

6. Thus we can test the statistic using the standard normal distribution. If the p-value ≤ α then we reject the null hypothesis that the original data is normally distributed. We carry out the calculations described above to get the results shown in Figure 1. The W statistic is The p-value = >.05 = α shows that there are no grounds for rejecting the null hypothesis that the data is normally distributed. SHAPIRO(R1) = the Shapiro-Wilk test statistic W for the data in the range R1 using the expanded method SWTEST(R1) = p-value of the Shapiro-Wilk test on the data in R1 using the expanded method SWCoeff(n, j) = the jth coefficient for samples of size n SWCoeff(R1, C1) = the coefficient corresponding to cell C1 within sorted range R1.

Note that these functions can optionally take an additional argument b: SHAPIRO(R1, b), SWTEST(R1, b), SWCoeff(n, j, b) and SWCoeff(R1, C1, b). When omitted this argument defaults to True (i.e. the values for the expanded Shapiro-Wilk test as described above are used). If b is set to False then the values for the original Shapiro-Wilk test are used instead.

This time we use the supplemental functions described above to obtain the results shown in Figure 2. The value of W and the p-value are as indicated using the formulas indicated. Since p-value = <.05 = α, we reject the hypothesis that the data is normally distributed. Note that we don’t need to sort the data and the data does not have to be arranged in a column to use the formulas.

If for some reason we want to obtain the coefficients, we need to sort the data. This is done by highlighting the range G3:K14 and entering =QSORT(A3:E14) and pressing Ctrl-Shft-Enter. The first coefficient is obtained by entering the formula =SWCoeff($G$3:$K$14,G3) in cell M3. If you highlight the range M3:Q14 and press Ctrl-R and Ctrl-D Observation: If a sample larger than 5,000, you can randomly divide the larger sample into a number of approximately equal-sized smaller samples and then run the SW algorithm as described above on each sample to obtain the z score for each smaller sample. Suppose that there are k such samples with z scores of z 1, …, z n. Recall that if range R1 contains sample i then z i = NORMSINV(SWTEST(R1)). The average of the z-scores will be an approximation of the z value for the whole sample. The expected mean of z is the average of the means of the z i, namely 0 and the standard deviation of z should be the standard deviation of the z i divided by √k, namely 1/. Thus you should test z/ using the standard normal distribution.

Transformations to Create Symmetry When a sample is not distributed normally, and is not even symmetric, then sometimes it can be useful to transform the data so that the transformed data is more normal or at least roughly symmetric. We touch upon the subject in Transformations, and will explore this concept a bit further in this section. When data is skewed to the left, transformations such as f(x) = log x (either base 10 or base e) and f(x) = will tend to correct some of the skew since larger values are compressed. Both of these transformations don’t accept negative numbers, and so the transformations f(x) = log (x+a) or f(x) = may need to be used instead where a is a constant sufficiently large so that x + a is positive for all the data elements. We now show how to use a log transformation via an example.

If we create a QQ Plot as described in Graphical Tests for Normality and Symmetry, we see that the data is not very normal (Figure 2). We now make a log transfer. We choose log base 10, although the result would be similar if we had chosen log base.

THE END