Presentation is loading. Please wait.

Presentation is loading. Please wait.

Variable selection in Regression modelling Simon Thornley.

Similar presentations


Presentation on theme: "Variable selection in Regression modelling Simon Thornley."— Presentation transcript:

1 Variable selection in Regression modelling Simon Thornley

2 Which variables should we adjust for and why? Caution: mass confusion

3 StatisticsCausal analysis Statistics and causation  Assesses parameters of a distribution from samples.  Infers associations  Estimate probabilities of past and future events...  If... experimental conditions remain the same.  Infers probabilities under conditions that are changing  e.g. treatments or interventions

4 Variable selection  Based on relationship with outcome variable (p-value)  fit of data to model (likelihood); joint probability of data| model  What about causal relationships between variables?

5 “I compute, therefore I am.”

6 Pearl and causation? Probability theory  Limits of probability theory:  “What is the probability it rained if the grass is wet?” P(Rain | Grass wet) Causal approach  “What is the probability it rained if we make the grass wet?”  P(Rain| do (Grass wet)) = P(Rain) RainGrass wet

7 Simpson’s paradox, 1899  “Any statistical relationship between two variables may be reversed by including additional factors in the analysis”  Reverse regression, compare:  Men earn more than equally qualified women.  Men more qualified than equally paid women.  Which factors should be adjusted for?

8 A visual depiction

9 Gender bias at Berkeley?

10 A DAG explanation GenderAdmission Faculty competitiveness women tended to apply to competitive departments with low rates of admission, whereas men tended to apply to less-competitive departments with high rates of admission among the qualified applicants.

11 DAG: A method for variable selection  Graphic: A picture of nodes (variables) and arcs or edges  Directed: causal effects shown  Acyclic: No arrows from descendants to ancestors  E- Exposure  D-Disease  S- Stratification factors

12 DAG in the inferential process... Joint distribution Data generating model (DAG), M Data Aspects of M Q(M) Inference How does natural process (unknown) assign values to variables in the analysis?

13 DAG Terminology  Path: sequence of arrows connecting two variables, ignoring direction  E S D  A collider is a variable which has two or more arrows pointing to (colliding with) it.  E S D  A path is blocked if it contains colliders, otherwise ‘directed path’  An unblocked path transmits associations along it.  E S D or  E S D

14 Descendant  Any node at the end of a directed path originating at E, is called a descendant of E.  Similarly, parents  Assumption (no line = assumptions)  Any node is independent of all other non-descendents, given parents.

15 Why use DAGs?  Encodes expert knowledge  Make assumptions about research question explicit; allow debate  Link causal to statistical model for causal inference  Make us think, “What could give rise to an observed association between E and Y?”

16 Explaining observed associations  E and D share a common cause (confounding)  Induced by conditioning on common effect of E and D (selection bias, collider). ExposureDisease ExposureDisease Strata ExposureDisease Strata, such as hospitalisation

17 Danger: controlling for colliders Exposure: sugar Outcome: flouride Collider: tooth decay Among individuals with tooth decay, if we know someone was exposed to fluoride in the water, we are more likely to believe that their tooth decay is due to sugar. Spurious association

18 Simple rules to choose confounders  Delete all arrows from E that point to any descendant.  In the new graph determine if there are any unblocked backdoor paths from E to D  The set of confounders S allows one to make the assumption that  P(D=d | do (E=e), S=s) = P ( D=d | E=e, S=s)

19 A worked example: Urate and CVD Collider Smoking Adjust Exposure Outcome

20 Usual (washing machine) approach

21 Summary  Variable selection is complex  Need to consider causal paths  Adjustment can cause more harm than good  Don’t adjust for variables on causal path  Adjust for variables that likely to ‘cause’ exposure and disease  Avoid adjustment for variables with many causes (colliders).


Download ppt "Variable selection in Regression modelling Simon Thornley."

Similar presentations


Ads by Google