Pre-Processing & Item Analysis DeShon - 2005. Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.

Pre-Processing & Item Analysis DeShon - 2005

Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing depends on the type of measurement instrument used General Issues General Issues Responses within range? Responses within range? Missing data Missing data Item directionality Item directionality Scoring Scoring Transforming responses into numbers that are useful for the desired inference Transforming responses into numbers that are useful for the desired inference

Checking response range First step… First step… Make sure there are no observations outside the range of your measure. Make sure there are no observations outside the range of your measure. If you use a 1-5 response measure, you can’t have a response of 6. If you use a 1-5 response measure, you can’t have a response of 6. Histograms and summary statistics (min, max) Histograms and summary statistics (min, max)

Reverse Scoring Used when combining multiple measures (e.g., items) into a composite Used when combining multiple measures (e.g., items) into a composite All items should refer to the target trait in the same direction All items should refer to the target trait in the same direction Alg: (high scale score +1) – score Alg: (high scale score +1) – score

Missing Data Huge issue in most behavioral research! Huge issue in most behavioral research! Key issues: Key issues: Why is the data missing? Why is the data missing? Planned, missing randomly, response bias? Planned, missing randomly, response bias? What’s the best analytic strategy with missing data? What’s the best analytic strategy with missing data? Statistical Power Statistical Power Biased results Biased results Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351. Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351. Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147-177. Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.

Causes of Missing Data Common in social research Common in social research nonresponse, loss to followup nonresponse, loss to followup lack of overlap between linked data sets lack of overlap between linked data sets social processes social processes dropping out of school, graduation, etc. dropping out of school, graduation, etc. survey design survey design “skip patterns” between respondents “skip patterns” between respondents

Missing Data Step 1: Do everything ethically feasible to avoid missing data during data collection Step 1: Do everything ethically feasible to avoid missing data during data collection Step 2: Do everything ethically possible to recover missing data Step 2: Do everything ethically possible to recover missing data Step 3: Examine amount and patterns of missing data Step 3: Examine amount and patterns of missing data Step 4: Use statistical models and methods that replace missing data or are unaffected by missing data Step 4: Use statistical models and methods that replace missing data or are unaffected by missing data

Missing Data Mechanisms Missing Completely at Random (MCAR) Missing Completely at Random (MCAR) Missing at Random (MAR) Missing at Random (MAR) Not Missing at Random (NMAR) Not Missing at Random (NMAR) X  not subject to nonresponse (age) Y  subject to nonresponse (income)

Missing Completely at Random MCAR MCAR Probability of response is independent of X & Y Probability of response is independent of X & Y Ex: Probability that income is recorded is the same for all individuals regardless of age or income Ex: Probability that income is recorded is the same for all individuals regardless of age or income

Missing at Random MAR MAR Probability of response is dependent on X but not Y Probability of response is dependent on X but not Y Probability of missingness does not depend on unobserved information Probability of missingness does not depend on unobserved information Ex: Probability that income is recorded varies according to age but it is not related to income within a particular age group Ex: Probability that income is recorded varies according to age but it is not related to income within a particular age group

Not Missing at Random NMAR NMAR Probability of missingness does depend on unobserved information Probability of missingness does depend on unobserved information Ex: Probability that income is recorded varies according to income and possibly age Ex: Probability that income is recorded varies according to income and possibly age

How can you tell? Look for patterns Look for patterns Run a logistic regression with your IV’s predicting a dichotomous variable (1=missing; 0=nonmissing LOGISTIC REGRESSION Coefficients: Estimate Std. Error z value Pr(>|z|) Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 *** NEW.AGE 0.181625 0.007524 24.140 < 2e-16 *** SEXMale -0.847947 0.131475 -6.450 1.12e-10 *** DVHHIN94 0.047828 0.026768 1.787 0.0740. DVSMKT94 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 NUMCHRON -0.087992 0.048783 -1.804 0.0713. VISITS 0.012483 0.006563 1.902 0.0572. NEW.WT6 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665

Missing Data Mechanisms If MAR or MCAR, the missing data mechanism is ignorable for full information likelihood-based inferences If MAR or MCAR, the missing data mechanism is ignorable for full information likelihood-based inferences If MCAR, the mechanism is also ignorable for sampling-based inferences (OLS regression) If MCAR, the mechanism is also ignorable for sampling-based inferences (OLS regression) If NMAR, the mechanism is nonignorable – thus any statistic could be biased If NMAR, the mechanism is nonignorable – thus any statistic could be biased

Missing Data Methods Always Bad Methods Always Bad Methods Listwise deletion Listwise deletion Pairwise deletion a.k.a. available case analysis Pairwise deletion a.k.a. available case analysis Person or item mean replacement Person or item mean replacement Often Good Methods Often Good Methods Regression replacement Regression replacement Full-Information Maximum Likelihood Full-Information Maximum Likelihood SEM – must have full dataset SEM – must have full dataset Multiple Imputation Multiple Imputation

Listwise Deletion Assumes that the data are MCAR. Assumes that the data are MCAR. Only appropriate for small amounts of missing data. Only appropriate for small amounts of missing data. Can lower power substantially Can lower power substantially Inefficient Inefficient Now very rare Now very rare Don’t do it! Don’t do it!

FIML - AMOS

Imputation-based Procedures Missing values are filled-in and the resulting “Completed” data are analyzed Missing values are filled-in and the resulting “Completed” data are analyzed Hot deck Hot deck Mean imputation Mean imputation Regression imputation Regression imputation Some imputation procedures (e.g., Rubin’s multiple imputation) are really model- based procedures. Some imputation procedures (e.g., Rubin’s multiple imputation) are really model- based procedures.

Mean Imputation Technique Technique Calculate mean over cases that have values for Y Calculate mean over cases that have values for Y Impute this mean where Y is missing Impute this mean where Y is missing Ditto for X 1, X 2, etc. Ditto for X 1, X 2, etc. Implicit models Implicit models Y=  Y Y=  Y X 1 =  1 X 1 =  1 X 2 =  2 X 2 =  2 Problems Problems ignores relationships among X and Y ignores relationships among X and Y underestimates covariances underestimates covariances

Regression Imputation Technique & implicit models Technique & implicit models If Y is missing If Y is missing impute mean of cases with similar values for X 1, X 2 impute mean of cases with similar values for X 1, X 2 Y =    X 1    X 2   Y =    X 1    X 2   Likewise, if X 2 is missing Likewise, if X 2 is missing impute mean of cases with similar values for X 1, Y impute mean of cases with similar values for X 1, Y X 1 =    X 1    Y   X 1 =    X 1    Y   If both Y and X 2 are missing If both Y and X 2 are missing impute means of cases with similar values for X 1 impute means of cases with similar values for X 1 Y =    X 1   Y =    X 1   X 2 =    X 1   X 2 =    X 1   Problem Problem Ignores random components (no  ) Ignores random components (no  )  Underestimates variances, se’s

Little and Rubin’s Principles Imputations should be Imputations should be Conditioned on observed variables Conditioned on observed variables Multivariate Multivariate Draws from a predictive distribution Draws from a predictive distribution Single imputation methods do not provide a means to correct standard errors for estimation error. Single imputation methods do not provide a means to correct standard errors for estimation error.

Multiple Imputation Context: Multiple regression (in general) Context: Multiple regression (in general) Missing values are replaced with “plausible” substitutes based on distributions or model Missing values are replaced with “plausible” substitutes based on distributions or model Construct m>1 simulated versions Construct m>1 simulated versions Analyze each of the m simulated complete datasets by standard methods Analyze each of the m simulated complete datasets by standard methods Combine the m estimates Combine the m estimates get confidence intervals using Rubin’s rules ( micombine ) get confidence intervals using Rubin’s rules ( micombine ) ADVANTAGE: sampling variability is taken into account by restoring error variance ADVANTAGE: sampling variability is taken into account by restoring error variance

Multiple Imputation Variance within + Variance Imputation (Rubin, 1987, 1996) imputations Point estimate

Another View IMPUTATION ANALYSIS POOLING IMPUTED DATA ANALYSIS RESULTS FINAL RESULTS IMPUTATION: Impute the missing entries of the incomplete data sets M times, resulting in M complete data sets. ANALYSIS: Analyze each of the M completed data sets using weighted least squares. POOLING: Integrate the M analysis results into a final result. Simple rules exist for combining the M analyses. INCOMPLETE DATA

How many Imputations? 5 Efficiency of an estimate:(1+γ/m) -1 Efficiency of an estimate:(1+γ/m) -1 γ = percentage of missing info γ = percentage of missing info m = number of imputations If 30% missing, 3 imputations  91% If 30% missing, 3 imputations  91% 5 imputations  94% 5 imputations  94% 10 imputations  97% 10 imputations  97%

Imputation in SAS PROC MI By default generates 5 imputation values for each missing value By default generates 5 imputation values for each missing value Imputation method: MCMC (Markov Chain Monte Carlo) Imputation method: MCMC (Markov Chain Monte Carlo) EM algorithm determines initial values EM algorithm determines initial values MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn MCMC repeatedly simulates the distribution of interest from which the imputed values are drawn Assumption: Data follows multivariate normal distribution Assumption: Data follows multivariate normal distribution PROC REG Fits five weighted linear regression models to the five complete data sets obtained from PROC MI (used by_imputation_statement ) PROC MIANALIZE Reads the parameter estimates and associated covariance matrix from the analysis performed on the multiple imputed data sets and derives valid statistics for the parameters

Example Case 1 is missing weight Case 1 is missing weight Given 1’s sex and age Given 1’s sex and age generate a plausible distribution for 1’s weight generate a plausible distribution for 1’s weight At random, sample 5 (or more) plausible weights for case 1 At random, sample 5 (or more) plausible weights for case 1 Impute Y! Impute Y! For case 6, sample from conditional distribution of age. For case 6, sample from conditional distribution of age. Use Y to impute X! Use Y to impute X! For case 7, sample from conditional bivariate distribution of age & weight For case 7, sample from conditional bivariate distribution of age & weight

Example PROC MI; DATA=missing_weight_age OUT=weight_age_mi; VAR years_over_20 weight maleness; run; PROC REG; data=weight_age_mi model weight = maleness years_over_20; by _imputation_; run;

Example – Standard Errors Total variance in b 0 Total variance in b 0 Variation due to sampling + variation due to imputation Variation due to sampling + variation due to imputation Mean(s 2 b0 ) + Var(b 0 ) Mean(s 2 b0 ) + Var(b 0 ) Actually, there’s a correction factor of (1+1/M) Actually, there’s a correction factor of (1+1/M) for the number of imputations M. (Here M=5.) for the number of imputations M. (Here M=5.) So total variance in estimating  0 is So total variance in estimating  0 is Mean(s 2 b0 ) + (1+1/M) Var(b 0 ) = 179.53 + (1.2) 511.59 = 793.44 Mean(s 2 b0 ) + (1+1/M) Var(b 0 ) = 179.53 + (1.2) 511.59 = 793.44 Standard error is  793.44 = 28.17 Standard error is  793.44 = 28.17

Example PROC MIAnalyze data=parameters; VAR intercept maleness years_over_20; run; Multiple Imputation Parameter Estimates Parameter Estimate Std Error 95% Confidence Limits DF intercept 178.564526 28.168160 70.58553 286.5435 2.2804 maleness 67.110037 14.801696 21.52721 112.6929 3.1866 years_over_20 -0.960283 0.819559 -3.57294 1.6524 2.991 Other Software www.stat.psu.edu/~jls/misoftwa.html www.stat.psu.edu/~jls/misoftwa.html www.stat.psu.edu/~jls/misoftwa.html

Item Analysis Relevant for tests with a right / wrong answer Relevant for tests with a right / wrong answer Score the item so that 1=right and 0=wrong Score the item so that 1=right and 0=wrong Where do the answers come from? Where do the answers come from? Rational analysis Rational analysis Empirical keying Empirical keying

Item Analysis Goal of Item Analysis Goal of Item Analysis Determine the extent to which the item is useful in differentiating individuals with respect to the focal construct Determine the extent to which the item is useful in differentiating individuals with respect to the focal construct Improve the measure for future administrations Improve the measure for future administrations

Item Analysis Item analysis provides info on the effectiveness of the individual items for future use Item analysis provides info on the effectiveness of the individual items for future use

Typology of Item Analysis Item Analysis Classical Item Response theory Rasch IRT2 IRT3

Item Analysis Classical analysis is the easiest and most widely used form of analysis Classical analysis is the easiest and most widely used form of analysis The statistics can be computed by generic statistical packages (or by hand) and need no specialist software The statistics can be computed by generic statistical packages (or by hand) and need no specialist software The item statistics apply only to that group of testees on that collection of items The item statistics apply only to that group of testees on that collection of items Sample Dependent! Sample Dependent!

Classical Item Analysis Item Difficulty Item Difficulty Proportion Correct (1=correct; 0=wrong) Proportion Correct (1=correct; 0=wrong) the higher the proportion the easier the item the higher the proportion the easier the item In general, need a wide range of item difficulties to cover the range of the trait being assessed In general, need a wide range of item difficulties to cover the range of the trait being assessed If mastery test, need item difficulties to cluster around the cut score If mastery test, need item difficulties to cluster around the cut score Very easy (0.0) or very hard items (1.0) are useless Very easy (0.0) or very hard items (1.0) are useless Most variance at p=.5 Most variance at p=.5

Classical Item Analysis Item Discrimination – 2 methods Item Discrimination – 2 methods Difference in proportion correct between high and low test score groups (27%) Difference in proportion correct between high and low test score groups (27%) Item-total correlation (output in Cronbach’s alpha routines) Item-total correlation (output in Cronbach’s alpha routines) No negative discriminators No negative discriminators Check key or drop item Check key or drop item Zero discriminators are not useful Zero discriminators are not useful Item difficulty and discrimination are interdependent Item difficulty and discrimination are interdependent

Classical Item Analysis

Pre-Processing & Item Analysis DeShon - 2005. Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.

Similar presentations

Presentation on theme: "Pre-Processing & Item Analysis DeShon - 2005. Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pre-Processing & Item Analysis DeShon - 2005. Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.

Similar presentations

Presentation on theme: "Pre-Processing & Item Analysis DeShon - 2005. Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing."— Presentation transcript:

Similar presentations

About project

Feedback