Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of Data Cleaning

Similar presentations


Presentation on theme: "Basics of Data Cleaning"— Presentation transcript:

1 Basics of Data Cleaning

2 Why Examine Your Data? Basic understanding of the data set
Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data

3 Testing Assumptions MV Normality assumption Violation of MV Normality
Solution is better Violation of MV Normality Skewness (symmetry) Kurtosis (peakedness) Heteroscedascity Non-linearity

4 Negative Skew

5 Positive Skew

6 Kurtosis Mesokurtic Leptokurtic Platykurtic

7 Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age
/STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05  +/- 1.96 .01  +/- 2.58

8 Homoscedascity s21 = s22 = s23 = s24 = s2e
When there are multiple groups, each group has similar levels of variance (similar standard deviation)

9 Linearity

10 Testing the Assumptions of Absence of Correlated Errors
Correlated errors means there is an unmeasured variable affecting the analysis Key is to identify the unmeasured variable and to include it in the analysis How often do we meet this assumption?

11 Data Cleaning Examine Techniques to use
Individual items/scales (i.e., reliability) Bivariate relationships Multivariate relationships Techniques to use Graphs  non-normality, heteroscedasticity Frequencies  missing data, out of bounds values Univariate outliers (+/- 3 SD from mean) Mahalanobis Distance (.001)

12 Graphical Examination
Single Variable: Shape of Distribution Histogram Stem and leaf Relationships between two+ variables Scatterplot

13 Histogram

14 Scatterplot

15 Frequencies

16 Outliers Where do outliers come from?
Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) Legitimate data points* Extreme values of random error (X = t + e) Error in observation Error in data preparation

17 Univariate Outliers Criteria: Mean +/- 3 SD Example: Age
Out of range values > or < 4.53

18 Univariate Outliers

19 Multivariate Outliers
Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables three variables four variables five variables six variables

20 Approaches to Outliers
Leave them alone Delete entire case (listwise) Delete only relevant variables (pairwise) Trim – highest legitimate value Mean substitution Imputation Bottom line: You can do any of the above as long as you tell the reader what you did and the reviewers are ok with that approach. Often it will be driven by the results of your analyses. Most impt. Point: Ethics- you *must* tell the reader what approach you took The Orr Article suggests why this is important. What does it say that suggests important to tell what you did? It says that different types of outlier detection strategies yield different outliers.

21 Effects of Outliers r = .50 r = .32

22 Effects of Outliers

23 Major Problems: Missing Data
Generalizability issues Reduces power (sample size) Impacts accuracy of results Accuracy = dispersion around true score (can be under- or over-estimation) Varies with MDT used

24 Dealing with Missing Data
Listwise deletion Pairwise deletion Mean substitution Regression imputation Hot-deck imputation Multiple imputation

25 Dealing with Missing Data
In Order of Accuracy: Pairwise deletion Listwise deletion Regression imputation Mean substitution Hot-deck imputation

26 Dealing with Missing Data
MDT Pros Cons Listwise deletion Easy to use High accuracy Reduces sample size Pairwise deletion Highest accuracy Problematic in MV analyses; non-positive definite correlation matrix Mean substitution Saves data; preserves sample size Moderate accuracy Attenuation of findings Regression imputation (no error term adjustment) Difficult to use Can’t use when all predictors are missing Hot-deck imputation Lots of bias & error

27 Transformations Best Transformation to Try Square Root Log Inverse
“Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Interpretation of transformed variables?


Download ppt "Basics of Data Cleaning"

Similar presentations


Ads by Google