Copyright (c) Bani Mallick1 STAT 651 Lecture 10. Copyright (c) Bani Mallick2 Topics in Lecture #10 Comparing two population means using rank tests Comparing.

Copyright (c) Bani Mallick1 STAT 651 Lecture 10

Copyright (c) Bani Mallick2 Topics in Lecture #10 Comparing two population means using rank tests Comparing two population variances using Levene’s test The effect of outliers

Copyright (c) Bani Mallick3 Book Sections Covered in Lecture #10 Chapter 6.3 (Wilcoxon Test) Page 368 (Levene’s test, although it is called Levine’s test): This is slightly different from what SPSS does The material on outliers is from my own notes

Copyright (c) Bani Mallick4 Lecture 9 Review: Comparing Two Populations Difference of sample means The s.d. from repeated sampling is You need reasonably large samples from BOTH populations

Copyright (c) Bani Mallick5 Lecture 9 Review: Comparing Two Populations If you can reasonably believe that the population sd’s are nearly equal, it is customary to pick the equal variance assumption and estimate the common standard deviation by

Copyright (c) Bani Mallick6 Lecture 9 Review: Comparing Two Populations The standard error then of is the value The number of degrees of freedom is

Copyright (c) Bani Mallick7 Lecture 9 Review: Comparing Two Populations A (1  100% CI for is Note how the sample sizes determine the CI length

Copyright (c) Bani Mallick8 Lecture 9 Review: Comparing Two Populations The CI can of course be used to test hypotheses This is the same as So we just need to check whether 0 is in the interval, just as we have done

Copyright (c) Bani Mallick9 Lecture 9 Review: Comparing Two Populations Generally, you should make your sample sizes nearly equal, or at least not wildly unequal. Consider a total sample size of 100 = 1 if n 1 = 1, n 2 = 99 = 0.20 if n 1 = 50, n 2 = 50 Thus, in the former case, your CI would be 5 times longer!

Copyright (c) Bani Mallick10 Lecture 9 Review: NHANES Comparison Mean(Healthy) – Mean(Cancer) The 95% CI is from 0.0065 to 0.5223 0 = Hypothesized value 0.0650.5223 Confidence Interval

Copyright (c) Bani Mallick11 Arsenic and Squamous Cell Skin Cancer The question is whether arsenic ingestion is related to squamous call carcinoma We used the transformation X = log(0.005 + toe arsenic)

Copyright (c) Bani Mallick12 Arsenic and Squamous Cell Skin Cancer Healthy Cancer

Copyright (c) Bani Mallick13 Arsenic and Squamous Cell Skin Cancer

Copyright (c) Bani Mallick14 Arsenic and Squamous Cell Skin Cancer, Healthy Cases

Copyright (c) Bani Mallick15 Arsenic and Squamous Cell Skin Cancer: Cancer Cases

Copyright (c) Bani Mallick16 Arsenic and Squamous Cell Skin Cancer Healthy, s = 0.59, IQR = 0.69 Squamous, s = 0.62, IQR = 0.71 These statistics and box plots indicate that the two populations do not have vastly different variability.

Copyright (c) Bani Mallick17 Arsenic and Squamous Cell Skin Cancer Healthy: mean = -2.33, n = 215, se = 0.040 Squamous: mean = -2.3365, n = 140, se = 0.052 Mean difference = 0.020, se = 0.066 95% CI= [-0.109, 0.149] p = 0.76: what does this mean?

Copyright (c) Bani Mallick18 Arsenic and Squamous Cell Skin Cancer Graphs, statistics, CI, p-value, all tell us that not much seems to be going on!

Copyright (c) Bani Mallick19 Robust Inference via Rank Tests Because sample means and standard deviations are sensitive to outliers, so too are comparisons of populations based on them Rank tests form a robust alternative, that can be used to check the results of t-statistic inferences You are looking for major discrepancies, and then trying to explain them

Copyright (c) Bani Mallick20 Robust Inference via Rank Tests Rank tests are very easy to compute, and SPSS provides them. Typically called the Wilcoxon rank sum test The algorithm is to assign ranks to each observation in the pooled data set Then apply a t-test to these ranks Robust because ranks can never get wild

Copyright (c) Bani Mallick21 Robust Inference via Rank Tests Here is how data are ranked Data #1 -3 7224581 #2 28 44505556 Ranks #1 1 2 3 6 10 #2 4 5 7 8 9 Now run a t-test

Copyright (c) Bani Mallick22 Robust Inference via Rank Tests The rank tests give the same answer no matter whether you take the raw data, their logarithms or their square roots. If you have data (raw or transformed) that pass q-q plots tests, then Wilcoxon and t-test should have much the same p-values In this case, you can use the latter to get CI’s

Copyright (c) Bani Mallick23 Robust Inference via Rank Tests Differences between rank and t-tests occur for two reasons generally: outliers and very non-bell shaped histograms

Copyright (c) Bani Mallick24 Robust Inference via Rank Tests In SPSS, you can get Wilcoxon rank sum tests as follows (SPSS calls them Mann-Whitney U) “Analyze”, “Nonparametric Tests”, “2 independent samples”

Copyright (c) Bani Mallick25 Robust Inference via Rank Tests Toe Arsenic log(0.005 + Toe Arsenic) Note how p-values are the same (= 0.468) Test Statistics a 14364.500 24234.500 -.725.468 Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) log(0.005 + Toe Arsenic)Toe Arsenic Grouping Variable: Squamous Cancer Status a.

Copyright (c) Bani Mallick26 Robust Inference via Rank Tests, NHANES Saturated Fat p-values: t-test = 0.057, rank test = 0.014 Log(Saturated Fat): t-test = 0.012, rank test = 0.014 Note how the transform, which is more bell- shaped, agrees more closely with the rank test!

Copyright (c) Bani Mallick27 Robust Inference via Rank Tests An SPSS peculiarity: to do rank tests, you need to have defined a numeric variable that categorizes the groups. The alternative is to convert the data to numbers and then give value labels.

Copyright (c) Bani Mallick28 Inference for Equality of Variances We have described situations that comparing variability of populations is desired. Ott and Longnecker (Chapter 7) give methods for comparing population variances NEVER USE THESE METHODS They are notoriously unreliable, affected by outliers, non-perfectly bell shaped, etc.

Copyright (c) Bani Mallick29 Inference for Equality of Variances SPSS uses what is called Levene’s test From the SPSS Help file (slightly edited) Levene Test For each case, it computes the absolute difference between the value of that case and its cell mean and performs a t-test on those absolute differences.

Copyright (c) Bani Mallick30 Inference for Equality of Variances Levene Test For each case, it computes the absolute difference between the value of that case and its cell mean and performs a t-test on those absolute differences. This is a reasonable test, although I prefer to use a rank test instead of the t-test

Copyright (c) Bani Mallick31 Inference for Equality of Variances I suggest that you supplement the Levene test with a look at the IQR in boxplots If you really need to understand scientifically the question of equality of variance, I suggest that you consult a bona-fide statistician I’ll now illustrate Levene’s test using NHANES (and this is the last time for these data)

Copyright (c) Bani Mallick32 Inference for Equality of Variances: Note the outlier

Copyright (c) Bani Mallick33 Inference for Equality of Variances

Copyright (c) Bani Mallick34 Inference for Equality of Variances P-value of Levene’s test for Saturated Fat = 0.378 Same P-value, but without the outlier = 0.010 P-value for log(Saturated Fat) =.667 P-Value for Levene’s test for Saturated Fat when using the rank test instead of the t-test = 0.039 P-Value for Levene’s test for log(Saturated Fat) when using the rank test instead of the t-test = 0.665

Copyright (c) Bani Mallick35 Inference for Equality of Variances As you can see, the rank test version of Levene’s test gives answers much more in keeping with the box plots The problem was clearly the outlier, so you can expect trouble with Levene’s test if there is a massive outlier Remember, t-tests have trouble with outliers

Copyright (c) Bani Mallick36 Inference for Equality of Variances As you can see, the rank test version of Levene’s test gives answers much more in keeping with the box plots The problem is that it’s a pain to compute the rank test version in SPSS However, theory says that the rank test version is the better, so in exams I’ll give it to you.

Copyright (c) Bani Mallick37 The Effect of an Outlier What will happen to the sample mean For cancer cases if I remove the outlier?

Copyright (c) Bani Mallick38 The effect of anoutlier What will happen to the sample mean for the cancer cases if I remove the outlier? It will decrease What will happen to the sample standard error for the cancer cases if I remove the outlier?

Copyright (c) Bani Mallick39 The effect of anoutlier What will happen to the difference between the sample mean for healthy cases and the same mean for cancer cases, if I delete the outlier? What will happen to the sample standard error for the cancer cases if I remove the outlier? It will decrease

Copyright (c) Bani Mallick40 The effect of anoutlier What will happen to the difference between the sample mean for healthy cases and the same mean for cancer cases, if I delete the outlier? It will increase What will happen to the sample standard error of this difference if I remove the outlier?

Copyright (c) Bani Mallick41 The effect of anoutlier What will happen to the difference between the sample mean for healthy cases and the same mean for cancer cases, if I delete the outlier? It will increase What will happen to the sample standard error of this difference if I remove the outlier? It will decrease Therefore, what will happen to the p-value if I delete the outlier?

Copyright (c) Bani Mallick42 The effect of anoutlier What will happen to the difference between the sample mean for healthy cases and the same mean for cancer cases, if I delete the outlier? It will increase What will happen to the sample standard error of this difference if I remove the outlier? It will decrease Therefore, what will happen to the p-value if I delete the outlier? It will get smaller

Copyright (c) Bani Mallick43 The effect of anoutlier With the outlier, p = 0.057 I remove the outlier, p = 0.002 Therefore, what will happen to the p-value if I delete the outlier? It will get smaller

Copyright (c) Bani Mallick1 STAT 651 Lecture 10. Copyright (c) Bani Mallick2 Topics in Lecture #10 Comparing two population means using rank tests Comparing.

Similar presentations

Presentation on theme: "Copyright (c) Bani Mallick1 STAT 651 Lecture 10. Copyright (c) Bani Mallick2 Topics in Lecture #10 Comparing two population means using rank tests Comparing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright (c) Bani Mallick1 STAT 651 Lecture 10. Copyright (c) Bani Mallick2 Topics in Lecture #10 Comparing two population means using rank tests Comparing.

Similar presentations

Presentation on theme: "Copyright (c) Bani Mallick1 STAT 651 Lecture 10. Copyright (c) Bani Mallick2 Topics in Lecture #10 Comparing two population means using rank tests Comparing."— Presentation transcript:

Similar presentations

About project

Feedback