Presentation on theme: "4/12/2011Data analysis and causal inference1 Data analysis and causal inference – 1 Victor J. Schoenbach, PhD home page Department of Epidemiology Gillings."— Presentation transcript:
4/12/2011Data analysis and causal inference1 Data analysis and causal inference – 1 Victor J. Schoenbach, PhD home page Department of Epidemiology Gillings School of Global Public Health University of North Carolina at Chapel Hill www.unc.edu/epid600/ Principles of Epidemiology for Public Health (EPID600)
12/30/2001Data analysis and causal inference2 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (email@example.com); downloaded from, Keith M. Gregg, firstname.lastname@example.org, www-leland.stanford.edu/~keithg/humor.shtml “Three professors (a physicist, a chemist, and a statistician) are called in to see their dean. Just as they arrive the dean is called out of his office, leaving the three professors there. The professors see with alarm that there is a fire in the wastebasket.
12/30/2001Data analysis and causal inference3 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (email@example.com); downloaded from, Keith M. Gregg, firstname.lastname@example.org, www-leland.stanford.edu/~keithg/humor.shtml “The physicist says, ‘I know what to do! We must cool down the materials until their temperature is lower than the ignition temperature and then the fire will go out.’
12/30/2001Data analysis and causal inference4 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (email@example.com); downloaded from, Keith M. Gregg, firstname.lastname@example.org, www-leland.stanford.edu/~keithg/humor.shtml “The chemist says, ‘No! No! I know what to do! We must cut off the supply of oxygen so that the fire will go out due to lack of one of the reactants.’
12/30/2001Data analysis and causal inference5 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (email@example.com); downloaded from, Keith M. Gregg, firstname.lastname@example.org, www-leland.stanford.edu/~keithg/humor.shtml “While the physicist and chemist debate what course to take, they both are alarmed to see the statistician running around the room starting other fires. They both scream, ‘What are you doing?’ To which the statistician replies, ‘Trying to get an adequate sample size.’”
12/30/2001Data analysis and causal inference6 Data management Managing epidemiologic data is “mass production” A systematic, organized, professional approach is critical for detecting and avoiding problems
12/30/2001Data analysis and causal inference7 “You can never, never take anything for granted.” Noel Hinners, vice president for flight systems at Lockheed Martin Astronautics, whose engineering team reported measurements in English units that the Mars Climate Orbiter navigation team assumed were metric units.
12/30/2001Data analysis and causal inference8 Without the documentation, the data may be of little if any value (1995 NSFG) 00000000000003122222222402143041000 00000000000001144112131 070520310 00000000000003233112131 072331040 000000000000011163322227070350110 00000000000003133022221 02451121000 00000000000001111112131 02110041000 00000000000002111112131 07307131000 00000000000002122112131 01073041000
12/30/2001Data analysis and causal inference9 “Our data say nothing at all.” (Epidemiology guru Sander Greenland, Congress of Epidemiology 2001, Toronto) Data are observer notes, respondent answers, biochemical measurements, contents of medical records, machine readable datasets, … What does one do with them?
11/13/2007Data analysis and causal inference10 Steps in data management Design the data collection process Write down all data collection procedures Train and supervise data collectors Monitor all data collection activities Document all data collection experiences Keep track of, document, and safeguard data
11/13/2007Data analysis and causal inference11 Data processing Review, edit, and code data forms, documenting exceptions and actions Convert to electronic form “Clean” data – check for illegal or improbable values, combinations of values Prepare summaries
The case of the missing eights Cancer Prevention study II (N=1.2 million) Contractor keyed 20,000 forms/wk; checked weekly. 28-item food frequency had peculiar pattern of missings Pulled original QQs to check Programmer checked code Cause: “O” instead of “0” Steven D. Stellman. Am J Epidemiol 1989;129(4):857-860 4/12/2011Data analysis and causal inference12
4/12/2011Data analysis and causal inference13 Can you find the data management error? 48 * get non-hispanic white population in county for 2000, first by adding 49 ages 15-24, 25-34, 35-44, and 45-64, then by excluding ages 45-64; 50 51 CWHITES=CST00609+CST00610+CST00611+CST00612; 52 CWHITES2=CWHITES-CST00612; 53 54 * get non-hispanic black population in county; 55 56 CBLACKS=CST00616+CST00617+CST00618+CST00619; 57 CBLACKS2=CBLACKS-CST00619; 58 59 * get hispanic or latino population in county; 60 61 CHISPS=CST00623+CST00624+CST00625+CST00626; 62 CHISPS2=CHISPS-CST00626; 63 (continues on next slide)
4/12/2011Data analysis and causal inference14 Can you find the data management error? CST00637 Female population white alone aged 15-24, 2000 – county CST00638 Female population white alone aged 25-34, 2000 – county CST00639 Female population white alone aged 35-44, 2000 – county CST00640Female population white alone aged 45-64, 2000 – county CST00644Female population black* alone aged 15-24, 2000 – county CST00645Female population black* alone aged 25-34, 2000 – county CST00646Female population black* alone aged 35-44, 2000 – county CST00647Female population black* alone aged 45-64, 2000 – county CST00651 Female population Hispanic* aged 15-24, 2000 – county CST00652 Female population Hispanic* aged 25-34, 2000 – county CST00653 Female population Hispanic* aged 35-44, 2000 – county CST00654 Female population Hispanic* aged 45-64, 2000 – county * Full variable name: “black or African American”, “Hispanic or Latino” (continues on next slide)
4/12/2011Data analysis and causal inference15 Can you find the data management error? 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654; (continues on next slide)
4/12/2011Data analysis and causal inference16 Can you find the data management error? 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654;
12/30/2001Data analysis and causal inference17 Data exploration Examine the data – frequency distributions, cross-tabulations, scatterplots – be alert for surprises and suspicious findings Examine means and prevalence for factors of interest, overall and within interesting subgroups Look at associations, prevalence ratios, relative risks, odds ratios, correlations
12/30/2001Data analysis and causal inference18 Carry out focused data analysis Desirable to have a written analysis plan based on the research questions Typically carry out “crude” analyses and analyses controlling for important variables Methods of control: stratification, mathematical modeling
Distribution of U.S. household income, 2007 (CPS data) 4/12/2011Data analysis and causal inference19 Income in $1000s/year Source: http://img55.imageshack.us/i/incomedistr07jo6.jpg/
12/30/2001Data analysis and causal inference20 Stratified analysis Divide the dataset into subsets according to relevant covariables (e.g., age, sex, smoking, …) Examine the estimates and associations within each subset (unless there are too many) Take averages across the subsets
11/13/2007Data analysis and causal inference21 Mathematical modeling Express the outcome as some mathematical function of the relevant covariables “Fit” this function to the data, so that it models the relations in the data Interpret the resulting model to draw inferences about associations
11/13/2007Data analysis and causal inference22 Selecting a pattern to sew a pair of pants Want one that fits the need Can sew without a pattern, but takes time and may not look good Select a pattern that will be well received Have you seen anyone wearing it? Has it been featured in magazines
12/30/2001Data analysis and causal inference23 The strategy of statistical data analysis Look for an available statistical model that will fit the situation (e.g., binomial, normal, chi-square, linear) Have others used it? Has it appeared in a methodology article?
12/30/2001Data analysis and causal inference24 The strategy of statistical data analysis Summarize the data in terms of the statistical model – Mean – Standard deviation – Other parameters
4/22/2002Data analysis and causal inference25 But should always look at the data Distributions can have same mean and standard deviation but look very different – e.g., same mean: 55
4/13/2010Data analysis and causal inference27 Regression models - Conceptual Example of an additive model: Risk of CHD = Risk from Age (“Age_risk”) Risk from BP (“BP_risk”) Risk from CHL (“CHL_risk”) Risk from SMK (“SMK_risk”)
4/13/2010Data analysis and causal inference28 Propose the model Risk of CHD = Age_risk + BP_risk + CHL_risk + SMK_risk Age_risk = Age in years x risk increase per year BP_risk = BP in mmHG x risk increase per mmHG CHL_risk = Cholest. in mg/dL x risk increase per mg/dL SMK_risk = Pack-years x risk increase per pack-year
4/13/2010Data analysis and causal inference29 Fit the model – estimate the coefficients Risk = β 0 + β 1 Age + β 2 BP + β 3 CHL + β 4 SMK β 0 = baseline risk β 1 = risk increase per year β 2 = risk increase per mmHG β 3 = risk increase per mg/dL β 4 = risk increase per pack-year Use the data and statistical techniques to estimate β 1, β 2, β 3, β 4.
12/30/2001Data analysis and causal inference30 P-values and Power P-value: “the probability of obtaining an interesting-looking sample from a boring population” (1 – specificity) Power: “the probability of obtaining an interesting-looking sample from an interesting population” (sensitivity)
11/16/2004Data analysis and causal inference31 The P-value If my study observes 0.5 [e.g., ln(OR)] 0 Boring population 0.7 [ln(OR)] Interesting population
11/22/2005Data analysis and causal inference32 The P-value If my study observes 0.5 [e.g., ln(OR)] 0 Boring population 0.7 Interesting population P-value
11/16/2004Data analysis and causal inference33 The Problem with the P-value But the P-value does not tell me the probability that what I observed was due to chance 0 Boring population 0.7 Interesting population
11/16/2004Data analysis and causal inference34 If I study only boring populations 0 Distributions of samples from boring populations
11/16/2004Data analysis and causal inference35 If I study only interesting populations 0 0.7 Distributions of samples from interesting populations
11/22/2005Data analysis and causal inference36 Many boring populations 0 Boring populations 0.7 Interesting populations
11/22/2005Data analysis and causal inference37 Many interesting populations 0 Boring populations 0.7 Interesting populations
12/30/2001Data analysis and causal inference38 Do epidemiologists study boring populations? That probability depends on how many boring populations there are. If we study 10 interesting populations 100 boring populations with 90% power and 5% significance level, we expect us to obtain 9 interesting samples from the interesting populations and 5 from the boring populations
11/22/2005Data analysis and causal inference39 P-values and predictive values Results: 14 interesting samples 5 came from boring populations Probability that an interesting sample came from a boring population: 5/14 = 36% – not 5%! Analogous to positive predictive value
4/12/2011Data analysis and causal inference40 Analogy to positive predictive value
4/12/2011Data analysis and causal inference41 Meta-analysis Literature reviews Systematic literature reviews Every study is an observation from a population of possible studies The set of studies that have been published may be a biased sample from that population
7/1/2009Data analysis and causal inference42 What should guide data analysis What are the research questions? – Estimate means (e.g., cholesterol) and prevalences (e.g., HIV) – Assess associations (e.g., Is blood lead associated with elevated blood pressure?; Do prepaid health plans provide more preventative care? Do bednets protect against malaria?)
11/20/2007Data analysis and causal inference43 Association of helmet use with death in motorcycle crashes: a matched-pair cohort study (Daniel Norvell and Peter Cummings, AJE 2002;156:483-7) Data from the National Highway Traffic Safety Administration’s Fatality Analysis Reporting System Exposure: helmet use; Outcome: death Potential confounders: sex, seat position, age, state helmet law
11/20/2007Data analysis and causal inference44 Association of helmet use with death in motorcycle crashes: a matched-pair cohort study (Daniel Norvell and Peter Cummings, AJE 2002;156:483-7) 9,222 driver-passenger pairs after exclusions Relative risk of death for a helmeted rider was 0.65 (0.57-0.74), (0.61 adjusted for seat position) Examined effect measure modification by seat position and by type of crash.
When the proofreader takes a week off 12/29/2009, B5 Dec 2009Close 2810547.08 2510520.10 2410520.10 2310466.44 2210464.93 2110414.14 1810328.89 1710308.26 www.google.com/finance/historical?q=INDEXDJX:.DJI Dec 22 23 24 25 28
I hope he’s having a good break! 12/31/2009, B6 Dec 23 24 25 28 29 Dec 2009Close 2910545.41 2810547.08 2510520.10 2410520.10 2310466.44 2210464.93 2110414.14 1810328.89 1710308.26 www.google.com/finance/historical?q=INDEXDJX:.DJI
4/12/2011Data analysis and causal inference48 Thank you Arigato Asanti Dhanyavaad Dumela Gracias Merci Obrigato Xie xie
Your consent to our cookies if you continue to use this website.