# Stata’s mishandling of missing data Missing data in logical and relational expressions: a problem and two solutions.

## Presentation on theme: "Stata’s mishandling of missing data Missing data in logical and relational expressions: a problem and two solutions."— Presentation transcript:

Stata’s mishandling of missing data Missing data in logical and relational expressions: a problem and two solutions

Stata’s conventions:

a. Relations (such as ‘>’) treat missing data as ‘positive infinity’; so relations are never undefined, they are simply true or false. Stata’s conventions:

a. Relations (such as ‘>’) treat missing data as ‘positive infinity’; so relations are never undefined, they are simply true or false. b. Logical operators (and ‘…if’) treat missing values as ‘true’; so logical expressions are never undefined, they are simply true or false. Stata’s conventions:

a. Relations (such as ‘>’) treat missing data as ‘positive infinity’; so relations are never undefined, they are simply true or false. b. Logical operators (and ‘…if’) treat missing values as ‘true’; so logical expressions are never undefined, they are simply true or false. (Strictly, (b) is an isolateable subset of a more general rule: c. Logical operators treat all non-zero values as ‘true’. But rule (c), when detached from (a) and (b), may be eccentric but is not pernicious.) Stata’s conventions:

First criticism: Commands should do what they seem to do.

First criticism: Commands should do what they seem to do. Response? Users should understand the conventions; it is then simple to test for missing data as appropriate. No big deal.

First criticism: Commands should do what they seem to do. Response? Users should understand the conventions; it is then simple to test for missing data as appropriate. No big deal. Responding, second criticism: The proffered prophylactic strategy does not scale well. Messy and error-prone.

1. normal Truth table pq!pp&q p|qp|q 11 10 1  01 00 0   1  0 

1. normal Truth table pq!pp&q p|qp|q 11 10 1  01 00 0   1  0 

1. normal Truth table pq!pp&q p|qp|q 110 100 1  0 011 001 0  1  1   0  

1. normal Truth table pq!pp&q p|qp|q 110 100 1  0 011 001 0  1  1   0  

1. normal Truth table pq!pp&q p|qp|q 1101 100 1  0 011 001 0  1  1   0  

1. normal Truth table pq!pp&q p|qp|q 1101 1000 1  0 0110 0010 0  10  1  0  0 

1. normal Truth table pq!pp&q p|qp|q 1101 1000 1 0  0110 0010 0  10  1   0  0 

1. normal Truth table pq!pp&q p|qp|q 11011 10001 1  0  1 01101 0010 0  10 1  1  0  0 

1. normal Truth table pq!pp&q p|qp|q 11011 10001 1  0  1 01101 00100 0  10  1  1  0  0 

1. normal Truth table pq!pp&q p|qp|q 11011 10001 1  0  1 01101 00100 0 10   1  1  0  0  

1. normal Truth table pq!pp&q p|qp|q 11011 10001 1  0  1 01101 00100 0  10   1  1  0  0  

1. normal Truth table – in Stata pq!pp&q p|qp|q 11011 10001 1  011 01101 00100 0  101  1011  0001 011

1. normal Truth table – in Stata pq!pp&q p|qp|q 11011 10001 1  011 01101 00100 0  101  1011  0001 011

1. normal relation pqp+qp+qp>qp>q 1120 1011 1  0110 0000 0   1   0  

pqp+qp+qp>qp>q 1120 1011 1 0110 0000 0  1   0  

pqp+qp+qp>qp>q 1120 1011 1 0110 0000 0  1   0  

1. normal relation – in Stata pqp+qp+qp>qp>q 1120 1011 1 0 0110 0000 0 0  1 1  0 1 0

pqp+qp+qp>qp>q 1120 1011 1 0 0110 0000 0 0  1 1  0 1 0

First criticism: Commands should do what they seem to do. Response? Users should understand the conventions; it is then simple to test for missing data as appropriate. No big deal. Responding, second criticism: The proffered prophylactic strategy does not scale well. Messy and error-prone.

2. Test for missing data? … if (a>b)

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b)

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (a>b|c>d)

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (a>b|c>d) … if (a>b|c>d) & !mi(a,b,c,d)

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (4>3|.>2) … if (4>3|.>2) & !mi(4,3,.,2) F

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (a>b|c>d) … if (a>b|c>d) & !mi(a,b,c,d)

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (a>b|c>d) … if (a>b|c>d) & !mi(a,b,c,d) … if ((a>b) & !mi(a,b)) | ((c>d) & !mi(c,d))

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (a>b|c>d) … if (a>b|c>d) & !mi(a,b,c,d) … if ((a>b) & !mi(a,b)) | ((c>d) & !mi(c,d))

2. Test for missing data? … if (a>b) … if (a>b) & !mi(a,b) … if (a>b|c>d) … if (a>b|c>d) & !mi(a,b,c,d) … if ((a>b) & !mi(a,b)) | ((c>d) & !mi(c,d)) but messy

2. Generating new variables even messier

2. Generating new variables Consider:.generate v = p&q

2. Generating new variables Consider:.generate v = p&q We want this to be: true when p&q is true false when p&q is false

2. Generating new variables Consider:.generate v = p&q We want this to be: true when p&q is true false when p&q is false indeterminate when p&q is indeterminate

2. Generating new variables Consider:.generate v = p&q Stata suggests two commands:.generate v = 0 if !(p&q).replace v = 1 if p&q & !mi(p,q)

2. Generating new variables Consider:.generate v = p&q Stata suggests two commands: alternatively.generate v = 0 if p==0 | q==0.replace v = 1 if p==1 & q==1 (when p and q are indicator variables)

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = p&q if !(p&q)|!mi(p,q)

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command: alternatively.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) “cond(p,T,F,.) cond(p,T or., F)”

2. Generating new variables Consider:.generate v = p&q Stata can manage with one command:.generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0)) Hard to ‘read’, but systematic

2. Generating new variables Consider:.generate v = p&q Users should be able to write this expression not tangle with the complexities of the recent slides.

2. Generating new variables Consider:.generate v = p&q Users should be able to write that expression not tangle with the complexities of the recent slides. [And in real-life, ‘p’ and ‘q’ are themselves likely to be expressions (logical or relational) so Stata’s current missing-data tests become even hairier.]

Two solutions?

Use my program validly, to validly specify recodes and conditionals

Two solutions? Use my program validly, to validly specify recodes and conditionals Persuade Stata to validly specify recodes and conditionals

Validly

validy is a conventional Stata program; but having an adverbial name, appears to be a modifier of other commands. validly generate has the functionality of generate, but, in contrast to generate, gives the correct result when missing data are encountered within relational or logical expressions. Likewise validly replace. Likewise validly assert validly Stata_conditional_command executes the specified conditional_command but, in contrast to Stata’s execution of the ‘unwrapped’ command, gives the correct result when missing data are encountered within relational or logical expressions in the condition.

Validly - syntax validly [generate|gen|replace] newvar | varname = exp [if] [in] [,options] As generate or replace, but using valid functional forms for the expression(s). validly generate requires newvar; replace requires the varname of an existing variable. validly assert exp [if] [in] [,options] As assert, but using valid functional forms for the expression(s). For other non-assignment conditional commands which use if, validly can act as a ‘wrapper’: validly any_conditional_command [,,validly_options] or (the same syntax, expressed differently) validly command parameters if [in] [weight][,command_options] [,,validly_options] validly replaces the conditional expression by a valid functional form, and executes the ‘wrapped’ command (validly’s options appear after double commas, to differentiate them from the command’s options).

Validly - strategy validly takes the relevant expression(s), parses the relational and logical operators into RPN form, and from that builds, by iterative insertion into a macro, complex cond expression(s) [as in our earlier example] which can be executed.

Validly - strategy validly takes the relevant expression(s), parses the relational and logical operators into RPN form, and from that builds, by iterative insertion into a macro, complex cond expression(s) [as in our earlier example] which can be executed. For: it works; nested ‘conds’ were the only replicable strategy I could devise to handle missing data, given Stata’s conventions. Against: the rebarbative results are computationally expensive.

Validly - examples

Two solutions? Use my program validly, to validly specify recodes and conditionals Persuade Stata to validly specify recodes and conditionals

Proposal   Stata’s relational operators should behave as do Stata’s algebraic operators with regard to missing data   Stata’s logical operators should follow the expected rules when encountering missing data. (Further, when evaluating the truth of an expression, ‘missing’ should not count as ‘true’).

Arguments against ‘logic’ 1. It is complex/confusing 2. Generates notable inconsistencies 3. Requires ‘several rules’

Arguments against ‘logic’ 1 - Complex/confusing? ‘All these statements can be made to work, but they are complicated and yield some surprising results (such as the drop/keep inconsistency shown [above]). We feel that most users — including ourselves — would find this more confusing than the system currently in place.’ Gould, W (2003) “Logical expressions and missing values” www.stata.com/support/faqs/data/values.html

Arguments against ‘logic’ 1 - Complex/confusing? The choice, remember, is between (on the current coding) having to write something like:.generate v = p|q if !mi(p,q) | (p & !mi(p)) | (q & !mi(q))

Arguments against ‘logic’ 1 - Complex/confusing? The choice, remember, is between (on the current coding) having to write something like:.generate v = p|q if !mi(p,q) | (p & !mi(p)) | (q & !mi(q)) or (on the proposed coding) being able to write:.generate v = p|q

Arguments against ‘logic’ 1 - Complex/confusing? The choice, remember, is between (on the current coding) having to write something like:.generate v = p|q if !mi(p,q) | (p & !mi(p)) | (q & !mi(q)) or (on the proposed coding) being able to write:.generate v = p|q It is not entirely self-evident that the shorter is ‘more confusing’?

Arguments against ‘logic’ 2 - Inconsistencies? ‘Changing to a three-valued logic might make some comparisons more what one might expect but will introduce inconsistencies elsewhere’.

Arguments against ‘logic’ 2 - Inconsistencies? The only example adduced (trailed by Gould as a ‘notable inconsistency’) is that, under the proposed rules: a command such as keep if age>65 is no longer the same as drop if age<=65 ‘In the current system, … missing values are … treated as positive infinity. Once this fact is absorbed … drop and keep statements work as one would expect.’

Arguments against ‘logic’ – (2) 2 - Inconsistencies? The only example adduced (trailed by Gould as a ‘notable inconsistency’) is that, under the proposed rules: a command such as keep if age>65 is no longer the same as drop if age<=65 But if a sample has three groups (those known to be over 65, those 65 or younger, and those for whom we lack age information) it is surely self evident that dropping one group should not be the same as keeping one other?.’

Arguments against ‘logic’ 2 - Inconsistencies? The only example adduced (trailed by Gould as a ‘notable inconsistency’) is that, under the proposed rules: a command such as keep if age>65 is no longer the same as drop if age<=65 Note: keep if age>65 would only work as one would expect if one should expect that those in the sample lacking age information properly belong in the group of the retired.

Arguments against ‘logic’ 3 - Several rules? under the proposal ‘you would have to remember several rules for how missing values were handled in different situations instead of just one rule’

Arguments against ‘logic’ 3 - Several rules? under the proposal ‘you would have to remember several rules for how missing values were handled in different situations instead of just one rule’ My proposal is that we adopt one rule: ‘missing values are treated as missing’

Arguments against ‘logic’ 3 - Several rules? under the proposal ‘you would have to remember several rules for how missing values were handled in different situations instead of just one rule’ In the current system, missing values are sometimes missing (as in algebra), sometimes invisible (as in max ), sometimes infinity (sometimes even, when contrasting.a and.b, distinct infinities), and sometimes ‘true’.

Proposal reiterated   Stata’s relational operators should behave as do Stata’s algebraic operators with regard to missing data   Stata’s logical operators should follow the expected rules when encountering missing data. (Further, when evaluating the truth of an expression, ‘missing’ should not count as ‘true’).

End of Polemic

How many of these do what they seem to do ? i…if age>50 ii…if unemployed iii…if a==2 & b==2 iv…if a!=2 & b!=2 v…if !(a==2 & b==2) & !mi(a,b) vi…if age>50 & !mi(age) vii…if log(assets)>2 & !mi(assets) viii…if a==2 & b==c ix…if (a!=2 | b!=2) & !mi(a,b) x…if assets/(inc - expend) > 100 & !mi(assets,inc,expend) xi.gen v = a==2 | b==2 xii.gen v = (a==2 | b==2) & !mi( a, b)

i…if age>50 ii…if unemployed iii…if a==2 & b==2 iv…if a!=2 & b!=2 v…if !(a==2 & b==2) & !mi(a,b) vi…if age>50 & !mi(age) vii…if log(assets)>2 & !mi(assets) viii…if a==2 & b==c ix…if (a!=2 | b!=2) & !mi(a,b) x…if assets/(inc - expend) > 100 & !mi(assets,inc,expend) xi.gen v = a==2 | b==2 xii.gen v = (a==2 | b==2) & !mi( a, b)

To handle.generate v = (a>b) & (c>d) we need something along the lines of:.generate p = a>b if !mi(a,b).generate q = c>d if !mi(c,d).generate v = 0 if !(p&q).replace v = 1 if (p&q) & !mi(p,q) e.g.

Footnote on ‘max’ max(x1,x2,...,xn)... Description: returns the maximum value of x1, x2,..., xn. Unless all arguments are missing, missing values are ignored. max(2,10,.,7) = 10 max(.,.,.) =.

Footnote on ‘max’ Suppose you wished, within a marriage, the higher income (with IncF and IncM for female and male); you might expect:.generate Highest = max(IncM,IncF) would do the trick?

Footnote on ‘max’ Suppose you wished, within a marriage, the higher income (say IncF and IncM for female and male); you might expect:.generate Highest = max(IncM,IncF) would do the trick? But for women whose spouses (perhaps bashful tycoons or shamefaced paupers) refused to answer, we get the income of the woman as the purportedly known higher individual income. The analyst should regard the outcome for such observations as strictly unknown — else you could have true high-spending householdswhose ‘highest income’ might be very low (these bashful tycoons), distorting any subsequent analyses.

Footnote on ‘max’ Suppose you wished, within a marriage, the higher income (say IncF and IncM for female and male); you might expect:.generate Highest = max(IncM,IncF) would do the trick? But for women whose spouses (perhaps bashful tycoons or shamefaced paupers) refused to answer, we get the income of the woman as the purportedly known higher individual income. If the values of some variables in a set are unknown, it is misleading to report the maximum of the known as the known maximum.

Transition? One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’

Transition? One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’ — could readily be handled by the introduction of a function mv(v) which would take one variable as its argument, and return a value in the range 1 ‑ 27 corresponding to the extended missing data codes, and zero otherwise.

Transition? One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’ — or, as validly does, could scan for explicit ‘missing’ and parse separately.

Transition? One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’ — or, as validly does, could scan for explicit ‘missing’ and parse separately.

Download ppt "Stata’s mishandling of missing data Missing data in logical and relational expressions: a problem and two solutions."

Similar presentations