Presentation on theme: "Statistical Methods for Health Intelligence Lecture 2: Perspectives, Data Types & Summaries Iain Buchan University of Manchester"— Presentation transcript:
Statistical Methods for Health Intelligence Lecture 2: Perspectives, Data Types & Summaries Iain Buchan University of Manchester firstname.lastname@example.org
Course Material 1: Basic Text Medical Statistics, 4 th Ed Campbell, Machin & Walters Wiley 2007 Statistical knowledge level: Public health practitioner How are you getting on? Are you using any other learning materials?
Your Participation Today: questions about your reading Take notes on my comments Prepare to reproduce exercises in R
Course Material 2: R Statistics: An Introduction Using R Crawley, Wiley 2005 cran.r-project.org Reproduce each example in course text Prepare to do submit R scripts for assessment
Course Material: Optional Probability and Random Variables: a beginner’s guide Stirzaker, Cambridge University Press 1999 Bad Science Goldacre, Fourth Estate Ltd, 2008
Define statistics –quantitative information about a topic Statistics –The measurement of uncertainty
The Statistical Movement Circa 1900: Galton, Pearson, Edgeworth and Yule establish Statistics as a discipline Early/mid 1900s: Fisher consolidates statistical methods and experimental philosophy
Think Whose perspective is Chapter 1? –Medical Statistician Why must the Informatician look wider? –May not have the luxury of study design –Data- vs. hypothesis-driven research –Maximise information validity & utility
Health Statistics 1600-1860 Observation Knowledge Reasoning Summarisation
Evidence Based Medicine Early/mid 1900s: Greenwood, Bradford-Hill & Doll push Statistics into medical research Mid-late 1900s: Cochrane pushes for the routine application of randomised clinical trials and leaves the evidence based medicine movement in his wake
Hypothesis-driven Research ProblemQuestionHypothesisDesign Data collection Data collationData analysisInferenceInterpretationDissemination
Define Epidemiology –the study of the distribution and determinants of disease and health-related states in populations JM Last, 2000
Define Confounding factor –A factor associated with both exposure and outcome but not on the causal pathway about which the inference is being made –What confounded the water cancer vs. water fluoridation example in the book?
Sieving Associations AssociationBiasTypeExplanation CMCMCause-effectRealCause-effect MI CReverseRealEffect-cause C ? MIConfoundingRealEffect-effect C MIRandom errorSpuriousChance C MISystematic errorSpuriousBias C = caffeine, MI = myocardial infarction (heart attack) Disciplined approach to causal inference, Bradford-Hill: Criteria (temporality, strength, dose-response, consistency, plausibility, consideration of alternatives, open to experiment, specificity, coherence)
Hard to Make a Confident Causal Inference Plausible pathway to link outcome to exposure Same results if repeat in different time, place person Exposure precedes outcome Strong relationship ± dose effect Causal factor relates only to the outcome in question Outcome falls if risk factor removed...
Think What is the most important question a Statistician wants a medic to ask? –How might I be wrong? In designing my study In making an inference about an association In generalising my inference beyond the study population Statisticians are understandably conservative Informaticians must be carefully informative
Exhausted Epidemiology Platform The big public health problems e.g. Type 2 Diabetes have “complex webs of causes” Problem 1: Dwindling hits from tools to detect independent “causes” Problem 2: Knowledge can’t be managed by reading papers any more The “data-set” and structure extend beyond the study’s observations
Evidence limits showing Epidemiology has exhausted the big simple causes of ill health Many trials have weak external validity Public health interventions are largely unstudied Many patterns of ill health in society remain unexplained via conventional studies
Need Statistical Informatics Data Necessary Complexity of Models Human Resource
Define Statistical Data-types & Measurement Scales –Categorical Qualitative measuring Binary/Dichotomous Nominal > 2 categories, without order Ordinal (loose) –Nominal with order –Ordinal (ties = lack of measurement sensitivity) –Numerical Quantitative measuring Counts Continuous (any value in a range) –Interval (fixed and defined, meaningful mean difference) –Ratio (zero means something)
Caution Don’t treat ordered nominal data as interval! –Why? –Give examples? –Relate these to software requirements
Programming Note Which has the greater information utility? Sex = 1|2 Sex = m|f Gender = m|f Male = 1|0 Gender_Male = 1|0 –Maximum information Minimum ambiguity Gender_Male = 1|0
Discuss Why categorise continuous data? –Meaningful thresholds (e.g. Hypertensive) –Compact summary / easy presentation –Easier analysis (good / bad?) –Avoid regression to the mean (homework)
Think What is audit? –A quality improvement process that seeks to improve a service through systematic against explicit criteria and implementing change How does this differ from research? –Ethics –Constrained design What is a natural experiment? –Homework...
Summarise Binary Data: r/n Describe a proportion –r = outcome or feature present (numerator) –n = number of subjects observed (denominator) –p=r/n; RR = p1/p2; (A)RD = |p2-p1| Relative Risk (RR) abuse –Pill ↑ risk DVT by (RR =) 2 statistically significant clinically insignificant 2 women in 10,000 pill-years
Summarise Binary Data: r/n~t Describe a rate –r = outcome/success/failure (numerator) –n = number of subjects observed (denominator) –t = time over which subjects observed –n*t = person time – why important? Some may drop out or be lost to follow-up –(incidence) rate IR=r/n, IRR –IRR = 1R1/IR2; IRD = |IR2-IR1|
Source: John Hacking & Iain Buchan, pre-publication 2009 Percentage excess deaths in North vs. South England
Summarise Binary Data: Crosstabs Variables C1-Ck – what is a crosstab? – Cross-tabulate categorical variables say disease registration by gender 2 by 2 r by c tables – Usually two way or two dimensional – Models may need higher dimensions say disease registration by gender by speciality Is a data cube the same? – Data Cube: A relational aggregation operator generalizing group-by, crosstab, and subtotals
Contingency Table Dimension 1: Exposure/Treatment/Category 1 Dimension 2: Outcome /Status /Category 2 Present Absent Present Absent a b c d
Summarise Binary Data: Odds How do odds differ from risk/proportion/probability? – Ratio of occurrence to non occurrence – Odds = p(1-p) – OR = (a/c)/(b/d)=ad/bc – p=a/(a+c), so if a<
"name": "Summarise Binary Data: Odds How do odds differ from risk/proportion/probability.",
"description": "– Ratio of occurrence to non occurrence – Odds = p(1-p) – OR = (a/c)/(b/d)=ad/bc – p=a/(a+c), so if a<
Caution If the odds ratio is interpreted as a relative risk it will always overstate any effect size: the odds ratio is smaller than the relative risk for odds ratios of less than one, and bigger than the relative risk for odds ratios of greater than one The extent of overstatement increases as both the initial risk increases and the odds ratio departs from unity However, serious divergence between the odds ratio and the relative risk occurs only with large effects on groups at high initial risk. Therefore qualitative judgments based on interpreting odds ratios as though they were relative risks are unlikely to be seriously in error In studies which show reductions in risk (odds ratios of less than one), the odds ratio will never underestimate the relative risk by a greater percentage than the level of initial risk In studies which show increases in risk (odds ratios of greater than one), the odds ratio will be no more than twice the relative risk so long as the odds ratio times the initial risk is less than 100%
Visualise Categorical Data When is a pie chart useful? – Seldom: arguably only in metaphor How do you add dimensions to a bar chart? – Cluster When is a 3D effect useful – Not in 2D concepts! – Showing additional dimensions e.g. 2 nd level cluster
What is arguably wrong with this visualisation?
Preparation for 15 Feb Read chapters 4,5,6 to understand natural distributions and sampling Return to chapter 3, run the examples in R and generate some alternative examples Prepare to show ideal visualisations and summaries with your R scripts