Presentation on theme: "CPH Exam Review Biostatistics"— Presentation transcript:
1 CPH Exam Review Biostatistics Lisa Sullivan, PhDAssociate Dean for EducationProfessor and Chair, Department of BiostatisticsBoston University School of Public Health
2 Outline and Goals Overview of Biostatistics (Core Area) Terminology and DefinitionsPractice QuestionsAn archived version of this review, along with the PPT file, will be available on the NBPHE website (www.nbphe.org) under Study Resources
3 Biostatistics Two Areas of Applied Biostatistics: Descriptive StatisticsSummarize a sample selected from a populationInferential StatisticsMake inferences about population parameters based on sample statistics.
4 Variable TypesDichotomous variables have 2 possible responses (e.g., Yes/No)Ordinal and categorical variables have more than two responses and responses are ordered and unordered, respectivelyContinuous (or measurement) variables assume in theory any values between a theoretical minimum and maximum
5 Dichotomous Ordinal Categorical Continuous We want to study whether individuals over 45 years are at greater risk of diabetes than those younger than 45. What kind of variable is age?DichotomousOrdinalCategoricalContinuous
6 Dichotomous Ordinal Categorical Continuous We are interested in assessing disparities in infant morbidity by race/ethnicity. What kind of variable is race/ethnicity?DichotomousOrdinalCategoricalContinuous
7 Numerical Summaries of Dichotomous, Categorical and Ordinal Variables Frequency Distribution TableHeath StatusFreq.Rel. Freq.Cumulative FreqCumulative Rel. Freq.Excellent1938%Very Good1224%3162%Good918%4080%Fair612%4692%Poor48%50100%n=50Ordinal variables only
10 Continuous VariablesAssume, in theory, any value between a theoretical minimum and maximumQuantitative, measurement variablesExample – systolic blood pressureStandard Summary: n = 75, X = 123.6, s = 19.4Second sample n = 75, X = 128.1, s = 6.4
11 Summarizing Location and Variability When there are no outliers, the sample mean and standard deviation summarize location and variabilityWhen there are outliers, the median and interquartile range (IQR) summarize location and variability, where IQR = Q3-Q1Outliers <Q1–1.5 IQR or >Q3+1.5 IQR
14 Comparing Samples with Box and Whisker Plots 21Systolic Blood Pressure
15 What type of display is shown below? Percent Patients by Disease StageFrequency bar chartRelative frequency bar chartFrequency histogramRelative frequency histogram
16 The distribution of SBP in men, 20-29 years is shown below The distribution of SBP in men, years is shown below. What is the best summary of a typical valueMeanMedianInterquartile rangeStandard Deviation
17 When data are skewed, the mean is higher than the median. TrueFalse
18 The best summary of variability for the following continuous variable is MeanMedianInterquartile rangeStandard Deviation
19 Numerical and Graphical Summaries Dichotomous and categoricalFrequencies and relative frequenciesBar charts (freq. or relative freq.)OrdinalFrequencies, relative frequencies, cumulative frequencies and cumulative relative frequenciesHistograms (freq. or relative freq.Continuousn, X and s or median and IQR (if outliers)Box whisker plot
20 What is the probability of selecting a male with optimal blood pressure? Blood Pressure CategoryOptimal Normal Pre-Htn Htn TotalMaleFemaleTotal20/2520/8020/150
21 What is the probability of selecting a patient with Pre-Htn or Htn? Blood Pressure CategoryOptimal Normal Pre-Htn Htn TotalMaleFemaleTotal95/15045/8055/150
22 What proportion of men have prevalent CVD? CVD Free of CVDMenWomen35/8035/26535/300
23 What proportion of patients with CVD are men ? CVD Free of CVDMenWomen35/70035/8080/300
24 Are Family History and Current Status Independent? Example. Consider the following table which cross classifies subjects by their family history of CVD and current (prevalent) CVD status.Current CVDFamily HistoryNoYes215259015P(Current CVD| Family Hx) = 15/105 = 0.143P(Current CVD| No Family Hx) = 25/240 = 0.104
25 Are symptoms independent of disease? Disease No Disease TotalSymptomsNo SymptomsNoYes
26 Probability Models – Binomial Distribution Two possible outcomes: success and failureReplications of process are independentP(success) is constant for each replicationMean=np, variance=np(1-p)
27 Probability Models – Poisson Distribution Two possible outcomes: success and failureReplications of process are independentOften used to model counts (often used to model rare events)Mean=m, variance=m
28 Probability Models – Normal Distribution Model for continuous outcomeMean=median=mode
29 Normal Distribution Properties of Normal Distribution I) The normal distribution is symmetric about the mean (i.e., P(X > m) = P(X < m) = 0.5).ii) The mean and variance (m and s2) completely characterize the normal distribution.iii) The mean = the median = the modeiv) Approximately 68% of obs between mean + 1 sd 95% between mean + 2 sd, and >99% between mean + 3 sd
30 Normal DistributionBody mass index (BMI) for men age 60 is normally distributed with a mean of 29 and standard deviation of 6.What is the probability that a male has BMI < 29?P(X<29)= 0.5
31 Normal DistributionWhat is the probability that a male has BMI less than 30?P(X<30)=?
32 Standard Normal Distribution Z Normal distribution with m=0 and s=1
33 Normal Distribution P(X<30)= P(Z<0.17) = 0.5675 From a table of standard normal probabilities or statistical computing package.
34 Comparing Systolic Blood Pressure (SBP) Suppose for Males Age 50, SBP is approximately normally distributed with a mean of 108 and a standard deviation of 14Suppose for Females Age 50, SBP is approximately normally distributed with a mean of 100 and a standard deviation of 8If a Male Age 50 has a SBP = 140 and a Female Age 50 has a SBP = 120, who has the “relatively” higher SBP ?
35 Normal Distribution ZM = (140 - 108) / 14 = 2.29 ZF = ( ) / 8 = 2.50Which is more extreme?
36 Percentiles of the Normal Distribution The kth percentile is defined as the score that holds k percent of the scores below it.Eg., 90th percentile is the score that holds 90% of the scores below it.Q1 = 25th percentile, median = 50th percentile, Q3 = 75th percentile
37 PercentilesFor the normal distribution, the following is used to compute percentiles:X = m + Z swherem = mean of the random variable X,s = standard deviation, andZ = value from the standard normal distribution for the desired percentile (e.g., 95th, Z=1.645).95th percentile of BMI for Men: (6) = 38.9
38 Central Limit Theorem (Non-normal) population with m, s Take samples of size n – as long as n is sufficiently large (usually n > 30 suffices)The distribution of the sample mean is approximately normal, therefore can use Z to compute probabilitiesStandard error
39 Statistical Inference There are two broad areas of statistical inference, estimation and hypothesis testing.Estimation. Population parameter is unknown, sample statistics are used to generate estimates.Hypothesis Testing. A statement is made about parameter, sample statistics support or refute statement.
40 What Analysis To Do When Nature of primary outcome variableContinuous, dichotomous, categorical, time to eventNumber of comparison groupsOne, 2 independent, 2 matched or paired, > 2Associations between variablesRegression analysis
41 EstimationProcess of determining likely values for unknown population parameterPoint estimate is best single-valued estimate for parameterConfidence interval is range of values for parameter:point estimate + margin of errorpoint estimate + t SE (point estimate)
42 Hypothesis Testing Procedures 1. Set up null and research hypotheses, select a2. Select test statistic3. Set up decision rule4. Compute test statistic5. Draw conclusion & summarize significance (p-value)
43 P-values P-values represent the exact significance of the data Estimate p-values when rejecting H0 to summarize significance of the data (approximate with statistical tables, exact value with computing package)If p < a then reject H0
44 Errors in Hypothesis Tests Conclusion of Statistical TestDo Not Reject H0 Reject H0H0 true Correct Type I errorH0 false Type II error Correct
45 Continuous Outcome Confidence Interval for m Continuous outcome - 1 Samplen > 30n < 30Example.95% CI for mean waiting time at EDData: n=100, X =37.85 and s=9.5 mins(35.99 to 39.71)Statistical computing packages use t throughout.
46 New Scenario Outcome is dichotomous Result of surgery (success, failure)Cancer remission (yes/no)One study sampleDataOn each participant, measure outcome (yes/no)n, x=# positive responses,
47 Dichotomous Outcome Confidence Interval for p Dichotomous outcome - 1 SampleExample.In the Framingham Offspring Study (n=3532), 1219 patients were on antihypertensive medications. Generate 95% CI.(0.329, 0.361)
48 One Sample Procedures – Comparisons with Historical/External Control Continuous DichotomousH0: m=m0 H0: p=p0H1: m>m0, <m0, ≠m0 H1: p>p0, <p0, ≠p0n>30n<30
49 One Sample Procedures – Comparisons with Historical/External Control Categorical or Ordinal outcomec2 Goodness of fit testH0: p1=p10, p2=p20, , pk=pk0H1: H0 is false
50 New Scenario Outcome is continuous SBP, Weight, cholesterol Two independent study samplesDataOn each participant, identify group and measure outcome
51 Two Independent Samples Cohort Study - Set of Subjects WhoMeet Study Inclusion CriteriaGroup 1 Group 2Mean Group 1 Mean Group 2
52 Two Independent Samples RCT: Set of Subjects Who MeetStudy Eligibility CriteriaRandomizeTreatment 1 Treatment 2Mean Trt 1 Mean Trt 2
53 Continuous Outcome Confidence Interval for (m1-m2) Continuous outcome - 2 Independent Samplesn1>30 and n2>30n1<30 or n2<30
55 Hypothesis Testing for (m1-m2) Test Statisticn1>30 and n2> 30n1<30 or n2<30
56 An RCT is planned to show the efficacy of a new drug vs An RCT is planned to show the efficacy of a new drug vs. placebo to lower total cholesterol.What are the hypotheses?H0: mP=mN H1: mP>mNH0: mP=mN H1: mP<mNH0: mP=mN H1: mP≠mN
57 New Scenario Outcome is dichotomous Result of surgery (success, failure)Cancer remission (yes/no)Two independent study samplesDataOn each participant, identify group and measure outcome (yes/no)
77 A two sided test for the equality of means produces p=0.20. Reject H0? YesNoMaybe
78 Hypothesis Testing for More than 2 Means - Analysis of Variance Continuous outcomek Independent Samples, k > 2H0: m1=m2=m3 … =mkH1: Means are not all equalTest StatisticF is ratio of between group variation to within group variation (error)
79 ANOVA Table Source of Sums of Mean Variation Squares df Squares F BetweenTreatments k-1 SSB/k-1 MSB/MSEError N-k SSE/N-kTotal N-1
80 ANOVAWhen the sample sizes are equal, the design is said to be balancedBalanced designs give greatest power and are more robust to violations of the normality assumption
81 ExtensionsMultiple Comparison Procedures – Used to test for specific differences in means after rejecting equality of all means (e.g., Tukey, Scheffe)Higher-Order ANOVA - Tests for differences in means as a function of several factors
82 ExtensionsRepeated Measures ANOVA - Tests for differences in means when there are multiple measurements in the same participants (e.g., measures taken serially in time)
83 c2 Test of Independence Dichotomous, ordinal or categorical outcome 2 or More SamplesH0: The distribution of the outcome is independent of the groupsH1: H0 is falseTest Statistic
84 c2 Test of Independence Data organization (r by c table) Is there distribution of the outcome different (associated with) groupsOutcomeGroup123A20%40%B50%25%C90%5%
86 In Framingham Heart Study, we want to assess risk factors for Impaired Glucose Outcome = Glucose CategoryDiabetes (glucose > 126),Impaired Fasting Glucose (glucose ),Normal GlucoseRisk FactorsSexAgeBMI (normal weight, overweight, obese)Genetics
87 What test would be used to assess whether sex is associated with Glucose Category? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
88 What test would be used to assess whether age is associated with Glucose Category? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
89 What test would be used to assess whether BMI is associated with Glucose Category? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
90 Consider a Secondary Outcome = Fasting Glucose Level Risk Factors In Framingham Heart Study, we want to assess risk factors for Glucose LevelConsider a Secondary Outcome = Fasting Glucose LevelRisk FactorsSexAgeBMI (normal weight, overweight, obese)Genetics
91 What test would be used to assess whether sex is associated with Glucose Level? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
92 What test would be used to assess whether BMI is associated with Glucose Level? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
93 What test would be used to assess whether age is associated with Glucose Level? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
94 In Framingham Heart Study, we want to assess risk factors for Diabetes Consider a Tertiary Outcome = Diabetes Vs No DiabetesRisk FactorsSexAgeBMI (normal weight, overweight, obese)Genetics
95 What test would be used to assess whether sex is associated with Diabetes? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
96 What test would be used to assess whether BMI is associated with Diabetes? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
97 What test would be used to assess whether age is associated with Diabetes? ANOVAChi-Square GOFChi-Square test of independenceTest for equality of meansOther
98 CorrelationCorrelation (r)– measures the nature and strength of linear association between two variables at a timeRegression – equation that best describes relationship between variables
99 What is the most likely value of r for the data shown below?
100 What is the most likely value of r for the data shown below?
101 Simple Linear Regression Y = Dependent, Outcome variableX = Independent, Predictor variable= b0 + b1 xb0 is the Y-intercept, b1 is the slope
102 Simple Linear Regression Assumptions Linear relationship between X and YIndependence of errorsHomoscedasticity (constant variance) of the errorsNormality of errors
103 Multiple Linear Regression Useful when we want to jointly examine the effect of several X variables on the outcome Y variable.Y = continuous outcome variableX1, X2, …, Xp = set of independent or predictor variables.
104 Multiple Regression Analysis Model is conditional, parameter estimates are conditioned on other variables in modelPerform overall test of regressionIf significant, examine individual predictorsRelative importance of predictors by p-values (or standardized coefficients)
105 Multiple Regression Analysis Predictors can be continuous, indicator variables (0/1) or a set of dummy variablesDummy variables (for categorical predictors)Race: white, black, HispanicBlack (1 if black, 0 otherwise)Hispanic (1 if Hispanic, 0 otherwise)
106 DefinitionsConfounding – the distortion of the effect of a risk factor on an outcomeEffect Modification – a different relationship between the risk factor and an outcome depending on the level of another variable
107 Multiple Regression for SBP: Comparison of Parameter Estimates Simple Models Multiple Regressionb p b pAge < <.0001MaleBMI < <.0001BP Meds < <.0001Focus on the association between BP meds and SBP…
108 RCT of New Drug to Raise HDL Example of Effect Modification WomenNMeanStd DevNew drug4038.883.97Placebo4139.244.21Men1045.251.89939.062.22
109 Simple Logistic Regression Outcome is dichotomous (binary)We model the probability p of having the disease.
110 Multiple Logistic Regression Outcome is dichotomous (1=event, 0=non-event) and p=P(event)Outcome is modeled as log odds
111 Multiple Logistic Regression for Birth Defect (Y/N) Predictor b p OR (95% CI for OR)InterceptSmoke (0.34, 22.51)Age (1.02, 1.78)Interpretation of OR for age:The odds of having a birth defect for the older of twomothers differing in age by one year is estimated tobe 1.35 times higher after adjusting for smoking.
112 Survival Analysis Outcome is the time to an event. An event could be time to heart attack, cancer remission or death.Measure whether person has event or not (Yes/No) and if so, their time to event.Determine factors associated with longer survival.
113 Survival Analysis Incomplete follow-up information Censoring Measure follow-up time and not time to eventWe know survival time > follow-up timeLog rank test to compare survival in two or more independent groups
115 Comparing Survival Curves H0: Two survival curves are equalc2 Test with df=1. Reject H0 if c2 > 3.84c2 = Reject H0.
116 Cox Proportional Hazards Model ln(h(t)/h0(t)) = b1X1 + b2X2 + … + bpXpExp(bi) = hazard ratioModel used to jointly assess effects of independent variables on outcome (time to an event).
117 Outcome= all-cause mortality Age and Sex as predictorsbi p HRAgeMale Sex
118 Sample Size Determination Need sample to ensure precision in analysisSample size determined based on type of planned analysisCITest of hypothesis
119 Determining Sample Size for Confidence Interval Estimates Goal is to estimate an unknown parameter using a confidence interval estimatePlan a study to sample individuals, collect appropriate data and generate CI estimateHow many individuals should we sample?
120 Determining Sample Size for Confidence Interval Estimates Confidence intervals:point estimate + margin of errorDetermine n to ensure small margin of error (precision) – accounting for attrition!Must specify desired margin of error, confidence level and variability of parameter
121 Determining Sample Size for Hypothesis Testing How many participants are needed to ensure that there is a high probability of rejecting H0 when it is really false?Determine n to ensure high power (usually 80% or 90%) – accounting for attrition!Must specify desired power, a and effect size (difference in parameter under H0 versus H1)
122 Determining Sample Size for Hypothesis Testing b and Power are related to the sample size, level of significance (a) and the effect size (difference in parameter of interest under H0 versus H1)Power is higher with larger aPower is higher with larger effect sizePower is higher with larger sample size