Presentation is loading. Please wait.

Presentation is loading. Please wait.

CH2. Cleaning and Transforming Data

Similar presentations


Presentation on theme: "CH2. Cleaning and Transforming Data"— Presentation transcript:

1 CH2. Cleaning and Transforming Data
Graphical methods Missing data processes Impact of missing data Approaches available Outliers Underlying assumptions Data transformation Nonmetric data as metric data Fall, 2010 Multivariate Analysis Lec 2

2 Graphical Examination
The shape of distribution Histogram; stem and leaf diagram The relationships among variables Bivariate relationships: scatterplot (matrix) Examining groups difference Boxplot Multivariate profiles: rarely used Fall, 2010 Multivariate Analysis Lec 2

3 Multivariate Analysis Lec 2
Deviate from normal distribution Fall, 2010 Multivariate Analysis Lec 2

4 Multivariate Analysis Lec 2
correlation Fall, 2010 Multivariate Analysis Lec 2

5 Multivariate Analysis Lec 2
Outlier Three groups are essentially equal Three groups have substantial difference Fall, 2010 Multivariate Analysis Lec 2

6 Multivariate Analysis Lec 2
Missing Data A fact of life Missing data that affect the generalizability of the results Determining the reasons underlying missing data -> select the appropriate course of action A missing data process; unknown so ask: Randomly scattered throughout? How prevalent? Action is similar to non-respondents Potential biases and sample size available Few guidelines and remedies to the problem Fall, 2010 Multivariate Analysis Lec 2

7 The Reasons leading to Missing Data
Explicitly expected: ignorable missing data MV analysis itself is a way to overcome missing data Nonsampled observations; probability sampling (the missing data process is random) When data are censored; e.g., causes of death Other types: Procedural factors Inapplicable responses Refusal to respond Fall, 2010 Multivariate Analysis Lec 2

8 Patterns of Missing Data
First, ascertain the degree of randomness presented in the missing data Missing at Random (MAR): if missing values of Y depend on X, but not on Y (not necessarily represent a truly random sample of all Y values Missing Completely at Random (MCAR): the observed values of Y are truly a random sample of all Y values Fall, 2010 Multivariate Analysis Lec 2

9 Diagnosing the Randomness
Assess the missing data process of a single variable Y by forming two groups With missing data and with valid data Dichotomized correlations Statistical significance tests of the correlations provide a conservative estimate of the degree of randomness Nonsignificant: MCAR; otherwise, MAR Overall test: compare the pattern of missing data with the pattern expected for a random missing data process Fall, 2010 Multivariate Analysis Lec 2

10 Approaches for Dealing with Missing Data
Complete Case Approach: Use of observations with complete data only Only for MCAR Delete Variables When a nonrandom pattern of missing data is present No firm guidelines, should based on empirical and theoretical considerations Fall, 2010 Multivariate Analysis Lec 2

11 Multivariate Analysis Lec 2
Imputation Methods The process of estimating the missing value based on valid values of other variables and/or cases in the sample The idea is both seductive and dangerous Two approaches Use of all the available information from a subset of cases to generalize to the entire sample Methods of estimating replacement values for the missing data Fall, 2010 Multivariate Analysis Lec 2

12 All-available Approach
Estimate correlations and maximize the pairwise information available in the sample Use correlation to represent and only for MCAR Problems Correlations may be “out of range” and inconsistent with other correlations The eigenvalue in the correlation matrix can become negative Fall, 2010 Multivariate Analysis Lec 2

13 Replacement of Missing Data
Case substitution (by nonsampled observation) Mean substituion (based on all valid responses) Understate the true variance; actual distribution is distorted; depress the observed correlation Deck imputation: external source Regression imputation: based on regression Reinforce the relationship already in the data; understate the variance; assume substantial correlations with the other variables; not constrained in the estimates it makes Multiple imputation: a combination of several methods Model-based procedures Fall, 2010 Multivariate Analysis Lec 2

14 An Illustration of Missing Data Diagnosis
Number of cases without missing data V1 and V3: the two most likely variables for deletion V1-V9: metric variables Total number of cases = 70 Fall, 2010 Multivariate Analysis Lec 2

15 An Illustration of Missing Data Diagnosis
Delete just V1 = 32 Fall, 2010 Multivariate Analysis Lec 2

16 An Illustration of Missing Data Diagnosis
Reduced sample: 64 cases (70 – 6 = 64) Fall, 2010 Multivariate Analysis Lec 2

17 Multivariate Analysis Lec 2
Rules of Thumb 2–1 How Much Missing Data Is Too Much? Missing data under 10% for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.). The number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) for the missing data. Fall, 2010 Multivariate Analysis Lec 2

18 Multivariate Analysis Lec 2
Rules of Thumb 2–3 Under 10% – Any of the imputation methods can be applied when missing data is this low, although the complete case method has been shown to be the least preferred. 10 to 20% – The increased presence of missing data makes the all available, hot deck case substitution and regression methods most preferred for MCAR data, while model-based methods are necessary with MAR missing data processes Over 20% – If it is necessary to impute missing data when the level is over 20%, the preferred methods are: the regression method for MCAR situations, and model-based methods when MAR missing data occurs. Fall, 2010 Multivariate Analysis Lec 2

19 Multivariate Analysis Lec 2
Outliers Observations with a unique combination of characteristics identifiable as distinctly different from the other observations Beneficial and problematic outliers Four classes Procedure error A result of an extraordinary event Extraordinary observations of no explanation Unique in combination of values across the variables Fall, 2010 Multivariate Analysis Lec 2

20 Multivariate Analysis Lec 2
Fall outside the ellipse more than 2 times Fall, 2010 Multivariate Analysis Lec 2

21 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2

22 Testing the Assumptions of Multivariate Analysis
Why need it The complexity of the relationships The complexity of the analyses and of the results may mask the “signs” of assumption violations apparent in the simpler univariate analysis Tested twice First for separate variables Second for the multivariate model variate Fall, 2010 Multivariate Analysis Lec 2

23 Multivariate Analysis Lec 2
Normality Normal distribution The benchmark for statistical methods Required to use the F and t statistics Univariate normal, does not imply multivariate normality Multivaraite normal, imply univariate normal; depend on the technique Fall, 2010 Multivariate Analysis Lec 2

24 Multivariate Analysis Lec 2
Analysis and Test Graphical analysis: normal probability plot Statistical tests Skewness Kurtosis Remedies Data transformations Fall, 2010 Multivariate Analysis Lec 2

25 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2

26 Multivariate Analysis Lec 2
Other Problems Homoscedasticity: equal variance dispersion Levene test; Box’s M test Fall, 2010 Multivariate Analysis Lec 2

27 Multivariate Analysis Lec 2
Linearity, problems for correlation-based analysis Residual analysis Absence of correlated errors Often due to data collection process Data transformation Typical, squared, square root, logarithm, inverse, depending on data shapes Fall, 2010 Multivariate Analysis Lec 2

28 Multivariate Analysis Lec 2
For 2-9 (a) either variable can be squared to achieve linearity Fall, 2010 Multivariate Analysis Lec 2

29 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2

30 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2

31 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2

32 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2

33 Nonmetric Data with Dummy Variables
Many techniques require metric data Using dummy (dichotomous) variables as replacement variables With k categories, need k – 1 dummy variables Indicator coding and effects coding: same result with different interpretation of the dummy-variable coefficients Fall, 2010 Multivariate Analysis Lec 2

34 Multivariate Analysis Lec 2
Fall, 2010 Multivariate Analysis Lec 2


Download ppt "CH2. Cleaning and Transforming Data"

Similar presentations


Ads by Google