Presentation is loading. Please wait.

Presentation is loading. Please wait.

Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)

Similar presentations


Presentation on theme: "Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)"— Presentation transcript:

1 Missing Values C5.2 Data Screening

2 Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)

3 Missing Data Missing data is an important problem. First, ask yourself, “why is this data missing?” – Because you forgot to enter it? – Because there’s a typo? – Because people skipped one question? Or the whole end of the scale?

4 Missing Data Two Types of Missing Data: – MCAR – missing completely at random (you want this) – MNAR – missing not at random (eek!) There are ways to test for the type, but usually you can see it – Randomly missing data appears all across your dataset. – If everyone missed question 7 – that’s not random. – (click on the dataset or use the View() function.

5 Missing Data MCAR – probably caused by skipping a question or missing a trial. MNAR – may be the question that’s causing a problem. – For instance, what if you surveyed campus about alcohol abuse? What does it mean if everyone skips the same question?

6 Missing Data How much can I have? – Depends on your sample size – in large datasets <5% is ok. – Small samples = you may need to collect more data. Please note: there is a difference between “missing data” and “did not finish the experiment”.

7 Missing Data How do I check if it’s going to be a big deal? – Try running your analysis on the dataset with missing data versus the dataset with the missing data filled in. – In R that’s easy! Yeah! You just change out the name of the dataset you are using, since we are saving them separately as we go.

8 Missing Data Deleting people / variables You can exclude people “pairwise” or “listwise” – Pairwise – only excludes people when they have missing values for that analysis – Listwise – excludes them for all analyses Variables – if it’s just an extraneous variable (like GPA) you can just delete the variable

9 Missing Data What if you don’t want to delete people (using special people or can’t get others)? – Several estimation methods to “fill in” missing data

10 Missing Data Mean substitution – the old way to enter missing data – Conservative – doesn’t change the mean values used to find significant differences – Does change the variance, which may cause significance tests to change with a lot of missing data

11 Missing Data Multiple imputation / expected maximization – now considered the best at replacing missing data – Creates an expected values set for each missing point – Using matrix algebra, the program estimates the probably of each value and picks the highest one

12 Missing Data DO NOT mean replace categorical variables – You can’t be 1.5 gender. – So, either leave them out OR pairwise eliminate them (aka eliminate only for the analysis they are used in).

13 Missing Data DataCategorical/IVsSTOPContinuous/DVsMnarSTOPMcarMore > 5%STOPLess < 5%MICE

14 Missing Data Figure out what you can replace. – First, figure out the percent missing by column. – Then, figure out the percent missing by row. Let’s write a function!

15 Missing Data Make up our own percent missing function. Percentmiss = ##save the function – function(x){ ##this line says make a new function – sum(is.na(x)) ## this line totals up the number of NA values – /length(x) ##divide by the length of the values – * 100 ##gives us the percent – } ##close function

16 Missing Data Let’s use apply to get percent missing by columns and rows. – apply(notypos, 2, percentmiss) ##columns – We will have to exclude several of these columns.

17 Missing Data Now, let’s use apply to get percent missing by rows – apply(notypos, 1, percentmiss) Too much info! missing = apply(notypos, 2, percentmiss) table(missing)

18 Missing Data Install the mice package. Load the mice library. Select only the data that you want to run mice on: – Eliminate bad rows. – Eliminate bad columns. – Bring them all back together

19 Missing Data ##subset out the bad rows replacepeople = notypos[ missing < 6, ] ##note we are going to fudge a little bit dontpeople = notypos[ missing >= 6, ]

20 Missing Data ##figure out the columns to exclude replacecolumn = replacepeople[, -c(1, 3, 13)] dontcolumn = replacepeople[, c(1,3,13)]

21 Missing Data Now run mice! Set a temporary place holder: – tempnomiss = mice(DATASET) – tempnomiss = mice(replacecolumn) This function figures out what and how to replace for you.

22 Missing Data Now, put the replaced data back into your dataset. – nomiss = complete(tempnomiss, 1) complete(dataset you ran mice on, number < 10) – summary(nomiss)

23 Missing Data Put everything back together – We want to take our replaced data – And add back in our columns we couldn’t replace Dontcolumn filledin_none = cbind(dontcolumn, nomiss) – And add back in our rows we couldn’t replace Dontpeople filledin_missing = rbind(dontpeople, filledin_none)


Download ppt "Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)"

Similar presentations


Ads by Google