Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.

Slides:



Advertisements
Similar presentations
Descriptive Statistics-II
Advertisements

4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
Kruskal Wallis and the Friedman Test.
Multiple Analysis of Variance – MANOVA
4/12/2015Slide 1 We have seen that skewness affects the way we describe the central tendency and variability of a quantitative variable: if a distribution.
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Strategy for Complete Regression Analysis
Assumption of normality
Outliers Split-sample Validation
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Detecting univariate outliers Detecting multivariate outliers
Descriptive Statistics In SAS Exploring Your Data.
Matching level of measurement to statistical procedures
Chi-square Test of Independence
Outliers Split-sample Validation
Multiple Regression – Assumptions and Outliers
Correlations and T-tests
Assumption of Homoscedasticity
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
Repeated Measures ANOVA Used when the research design contains one factor on which participants are measured more than twice (dependent, or within- groups.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Mann-Whitney and Wilcoxon Tests.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Example of Simple and Multiple Regression
Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems.
Inferential Statistics: SPSS
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Objectives 1.2 Describing distributions with numbers
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Hierarchical Binary Logistic Regression
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
A statistical method for testing whether two or more dependent variable means are equal (i.e., the probability that any differences in means across several.
Chi-square Test of Independence Steps in Testing Chi-square Test of Independence Hypotheses.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems Homework Problems.
6/2/2016Slide 1 To extend the comparison of population means beyond the two groups tested by the independent samples t-test, we use a one-way analysis.
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
6/4/2016Slide 1 The one sample t-test compares two values for the population mean of a single variable. The two-sample t-test of population means (aka.
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
ANOVA: Analysis of Variance.
Chi-square Test of Independence
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
SW388R6 Data Analysis and Computers I Slide 1 Percentiles and Standard Scores Sample Percentile Homework Problem Solving the Percentile Problem with SPSS.
SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.
Chapter 6: Analyzing and Interpreting Quantitative Data
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
© Copyright McGraw-Hill 2004
1/23/2016Slide 1 We have seen that skewness affects the way we describe the central tendency and variability of a quantitative variable: if a distribution.
Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
2/24/2016Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Appendix I A Refresher on some Statistical Terms and Tests.
Correlation, Bivariate Regression, and Multiple Regression
Assumption of normality
Chapter 9 Hypothesis Testing.
Multiple Regression – Split Sample Validation
Presentation transcript:

Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables (multivariate outliers). Outliers generally have a large impact on the solution, i.e. the outlier case can conceivably change the value or score that we would predict for every other case in the study. Our concern with outliers is to answer the question of whether our analysis is more valid with the outlier case included or more valid with the outlier case excluded. To answer this question, we must have methods for detecting and assessing outliers. The method for detecting univariate outliers is to convert the scores on the variable to standard scores and scan for very large positive and negative standard scores. We will normally apply this strategy to the analysis of a metric dependent variable. The detection of multivariate outliers is used to detect unusual cases for the combined set of metric independent variables, using a multivariate distance measure analogous to standard score distance from the mean of the sample. The decision to exclude or retain the outlier case is based on our understanding of the cause of the outlier and the impact it is having on the results. If the outlier is a data entry error or an obvious misstatement by a respondent, it probably should be excluded. If the outlier is an unusual but probable value, it should be retained. We can improve our understanding of the impact of the outlier by running an analysis twice, one with the outlier included and again with the outlier excluded. Detecting Outliers

Slide 2 1. Detecting Univariate Outliers To detect univariate outliers, we convert our numeric variables to their standard score equivalents. Outliers will be those cases associated with large standard z-score values, e.g. smaller than -2.5 and larger than Standardizing variables converts them to a standard deviation unit of measurement so that the distance from the mean for any case on any variable is expressed in comparable units. The Descriptives procedure can create standard scores for our variables and add them to our data. SPSS names the z-score variables by preceding the variable name with the letter z. The name for the standard score equivalent for x1 is zx1. To locate the outliers for each variable, we can either sort the data set by the z-score variable or use the SPSS Examine procedure to print out the highest and lowest values for the z-score variables to the output window. The use of standard scores to detect outliers presumes that the variable is normally distributed. When a variable is not normally distributed, a boxplot may be more effective in identifying outliers. A boxplot identifies outliers using a somewhat different criteria. Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are identified as outliers. The box length is the inter-quartile range, or the difference between the case at the 25th quartile and the case at the 75th quartile. Detecting Outliers

Slide 3 Compute Standard Scores for the Metric Variables Detecting Outliers

Slide 4 The Standard Scores in the SPSS Data Editor Detecting Outliers

Slide 5 Use the Explore Procedure to Locate Large Standard Scores Indicating Outliers Detecting Outliers

Slide 6 Specify Outliers as the Desired Statistics Detecting Outliers

Slide 7 Extreme Values as Outliers Detecting Outliers

Slide 8 2. Detecting Multivariate Outliers Standard scores measure the statistical distance of a data point from the mean for all cases, measured in standard deviation units along the horizontal axis of a normal distribution plot. There is a similar measure of statistical distance in multidimensional space, known at Mahalanobis D² (d-squared). This statistic measures the distance from the centroid (multidimensional equivalent of a mean) for a set of scores (or vector) for each of the independent variables included in the analysis. The larger the value of the Mahalanobis D² for a case, and the smaller its corresponding probability value, the more likely the case is to be a multivariate outlier. The probability value enables us to make a decision about the statistical test of the null hypothesis, which is that the vector of scores for a case is equal to the centroid of the distribution for all cases. Mahalanobis D² can be computed in SPSS with the regression procedure for a set of independent variables. The Save option will add the D² values to the data set. SPSS does not compute the probability of Mahalanobis D². Mahalanobis D² is distributed as a chi- square statistic with degrees of freedom equal to the number of independent variables in the analysis. The SPSS cumulative density function will compute the area under the chi-square curve from the left end of the distribution to the point corresponding to our statistical value. The right-tail probability of obtaining a D² value this size is equal to one minus the cumulative density function value. We use the probability values to identify the cases which are most distant, or different, from the other cases in the sample. We would make our decision about omitting or including extreme cases by re-running the analysis without them and comparing the results we obtain with and without them to determine whether our results are more representative with or without the extreme cases. Detecting Outliers

Slide 9 Request a Multiple Regression to Compute Mahalanobis Distance Statistics Detecting Outliers

Slide 10 Specify the Variables to Include in the Analysis Detecting Outliers

Slide 11 Add the Mahalanobis Distance Statistic to the Data Set Detecting Outliers

Slide 12 The Mahalanobis Distance Statistics in the Data Editor Detecting Outliers

Slide 13 Compute the Probability Values for the Mahalanobis D² Statistics Detecting Outliers

Slide 14 Sorting the Data Set to Locate Statistically Significant D² Scores Detecting Outliers

Slide 15 Highlight Cases with Statistically Significant Mahalanobis D² Scores Detecting Outliers

Slide 16 The Case ID's for the Multivariate Outliers Detecting Outliers