Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007

Prior to any analysis Define research question(s). Write out using no more than one sentence per question. Determine statistical analysis plan to address each research question. Analysis: Confounder Predictor Outcome

Population versus Sample Population: includes all possible observations of a particular type. Observations may be people, animals, places, or things Ex.: Men & women aged 18 and older; infants; penguins; bodies of water Sample: includes only some of the observations but selected in a way that gives every possible observation an equal chance of being observed. Ex.: Men & women aged 18 and older in the TennCare Database from years 1998-2005; infants with a primary care physician at Vanderbilt during years 1990-2000; all penguins living in the Nashville zoo since 1995; bodies of water included in Bay Delta & Tributaries Database

Population versus Sample [2] In most cases in clinical research, we want to generalize from information about our sample to information about a population.

Descriptive Statistics To describe characteristics of the sample Ex.: demographics, distributions, frequencies May want to describe data with numerical or graphical summary Characteristics of sample may be continuous or categorical variables

Continuous Variables Continuous: a variable that can take on any number of possible values (Ex.: weight). Discrete Numeric: a variable whose set of possible values is a finite sequence of numbers (Ex.: pain scale 1 to 5). Numerical Summary: Often want to measure central tendency of data Sample mean: The sum of all of the observations divided by the number of observations. The mean is only useful when the data are normally distributed. Sample Median (50 th Percentile): Order the observations from smallest to largest: If n is odd, then the median is the middle ordered observation. If n is even, then the median is the average of the two middle ordered observations.

Continuous Variables [2] Other common percentiles include quartiles (25 th, 50 th, and 75 th percentiles) and deciles (10 th, 20 th, …, 90 th percentiles). The p-th percentile is the value that p-% of the data are less than or equal to. If p-% of the data lie below the p-th percentile, it follows the (100- p)-% of the data lie above it. Ex: If the 85-th% percentile of household income is $60,000 then 85% of the households have incomes of $60,000 or less and the top 15% of households have incomes of $60,000 or more. Measures of Dispersion When measurements are collected there will be scatter, dispersion, or variability. Sources of dispersion Random error: Error due to chance Systematic Error: Wrong result do to bias Biological variability

Continuous Variables [3] Minimum: Smallest observed value Maximum: Largest observed value Range: Difference between max and min (often reported as (min, max)) Interquartile Range (IQR): Difference between 75 th and 25 th percentiles (often reported as (25 th,75 th )) Variance: The average of the squares of the deviations of the observations from their mean. Standard Deviation: the square root of the variance Standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center Has the same unit of measurement as the mean

Continuous Variables [4] Graphical Summary:

Categorical Variables Categorical: a variable having only certain possible values (ex.: race). Binary: a categorical variable with only two possible values (ex.: gender). Ordinal: a categorical variable for which there is a definite ordering of the categories (ex.: severity of lower back pain ordered as none, mild, moderate, and severe). Numerical Summary: Frequency Distribution: A listing of distinct values for that characteristic and the number of observations having each value. Relative frequency: Proportion of the total number of observations that fall into each category. Cumulative frequency: Proportion of the total number of observations that fall into the current or previous categories listed (may be useful for ordinal variable). 6498 6058 4356 Cumulative Frequency 438/6498=.06740536 1702/6498=.2619267 4356/6498=.6703601 Relative frequency 438 (7%) 1702 (26%) 4356 (67%) Frequency High Dose Low Dose Nonuser Ibuprofen Use

Relationships between two variables Two variables measured on the same observations are associated if some values of the first variable tend to occur more often with some values of the second variable than with other values of that variable. Two continuous variables Ex.: Person’s weight and blood pressure Two categorical variables Ex.: gender and smoking status One continuous & one categorical variable Ex.: blood pressure and gender Keep in mind - relationship between two variables can be strongly influenced by other variables that are lurking in the background

Two Continuous Variables A scatterplot shows the relationship between two continuous variables measured on the same observations. The values of one variable appear on the x- axis, and the values of the other variable appear on the y-axis. Each observation appears as a point in the plot fixed by the values of both variables for that observation.

Two Continuous Variables [2] Graphical Summary: Scatterplot

Two Categorical Variables Numerical Summary: Cross-tabulation (or 2-way table) Ex.: Clinical Pregnancy by Age Groups 33731 (9%)71 (21%)235 (70%)Total 160 (47%)14 (4%)47 (14%)99 (29%)No 177 (53%)17 (5%)24 (7%)136 (40%)Yes TotalAge >=3835-37Age <35Clinical Pregnancy Age Group

One Categorical and One Continuous variable Consider descriptive statistics of the continuous variable separately for different values of the categorical variable Ex: Descriptive statistics of birth weight by smoking status during pregnancy for mothers 189729.02944.7Total 660.1 752.4 Standard Deviation 742773.2Smoker 1153055.0Nonsmoker FrequencyMean Birth weight Smoked during pregnancy

How to study the relationship between two different variables Quantify the relationship: Measure the strength of the relationship (linear, monotonic, …) between two continuous variables. Use hypothesis testing: Test theory to see if experimental results only reflect random chance. Fit model: Predict one measure of an individual from another.

Quantify the relationship Correlation coefficient (r): Quantitative summary of the strength of the relationship between two continuous variables. Pearson correlation: focuses on the raw data. Spearman correlation: focuses on the ranks of the raw data. Covariance (r 2 ): Square of the correlation coefficient that defines the strength or magnitude of the correlation. not a cause and effect relationship but quantifies how well one variable predicts another.

Hypothesis testing [1] Define null hypothesis for question of interest that assumes the experimental results are due to chance alone. Perform statistical test to determine if we can reject or fail to reject the null hypothesis. NOTE: Absence of evidence does not mean evidence of absence. In other words, if our test results in a non-significant p-value, we do not “accept” the null hypothesis. Rather we fail to reject the null hypothesis. It could be that for the same experiment but a different sample we would obtain significant results. P-value: the probability of obtaining a result at least as extreme as a given data point assuming the data point was the result of chance alone.

Hypothesis testing [2] Categorical data Chi-square tests Both row and column variables are nominal Row variable nominal; column variable ordinal Both row and column variables are ordinal Tests whether distribution of frequencies differs across rows (groups) or whether there is any association between the row and column variables.

Hypothesis testing [3] Nominal row and column variables Example: Given data on the neighborhood in which a person lives and his political affiliation, you wish to test whether a person’s politics influences where he/she lives. H 0 : No association exists between a person’s political affiliation and the neighborhood in which he lives. H A : An association exists between a person’s political affiliation and the neighborhood in which he lives.

Hypothesis testing [4] Nominal row variables and ordinal column variables Example: Given data studying hours of headache pain relief (hours ranging from 0 – 6) using three different treatments – placebo, standard, and test treatment. H 0 : No association between hours of pain relief and treatment. H A : A shift in row mean hours of headache pain relief exists between the treatment groups.

Hypothesis testing [5] Ordinal row and column variables Example: Given data assessing how water additives (water, standard, super) affect the washability of clothes (low, medium, high). H 0 : No association between the water additive and the washability of the clothes. H A : There is a linear association between water additive and washability of clothes.

Hypothesis Testing [6] Continuous variables Parametric tests: Make assumptions about underlying distribution of data. 1-sample t-test H 0 : Mean of data is equal to some fixed value (defined by study question). 2-sample t-test H 0 : No difference in means between the two independent groups. Paired t-test H 0 : Mean of difference in paired data is equal to 0.

Hypothesis testing [7] Non-parametric tests: No assumptions about underlying distribution of data. Wilcoxon signed rank test – analogous to the one-sample or paired t-test. 1-sample H 0 : Median is equal to specified value (defined by study question). Paired H 0 : Median difference is equal to 0. Wilcoxon rank sum test – analogous to the two-sample t-test. H 0 : The distribution of the response variable is the same in the two independent groups.

Modeling [1]: ANOVA 1-way/2-way: extends 2-sample t-test (with 1 factor/2 factors) to n-groups -- compares mean of continuous variable across n-groups. H 0 : No difference in means between the n-groups. H A : At least one group has a different mean than the other (n-1) groups. Avoids problems with multiple comparisons. Tests whether within-group variability is greater than between-group variability.

Modeling [2]: Linear regression Continuous outcome. Assumes relationship between predictor(s) and outcome is linear. Observations assumed to be independent (ie., only one observation per subject, no subjects that are related to each other, etc.) Number of predictors allowed in the model depends on the sample size. Rule of thumb: no more than n/10 predictors where n = # of subjects. Include confounders in the model for better parameter estimates. Output are parameter estimates – Can give information similar to that obtained from hypothesis testing. Allows the investigator to make inference based on the parameter estimates.

Modeling [3]: Logistic regression Categorical outcome – typically binary. Number of predictors in model depends on several things: Each group has at least 10 subjects. Cell counts in a cross-tab table meet certain sample size criteria: 80% of expected counts are at least 5. All other expected counts are greater than 2, with virtually no 0 counts. Output are parameter estimates and odds ratios calculated from these parameter estimates. Odds ratio: way of comparing whether the probability of a certain event is the same for two groups. OR = 1  The event is equally likely in both groups. OR > 1  The event is more likely in the first group. OR < 1  The event is less likely in the first group.

Conclusion Statistical analysis plan should be devised before collecting data. Use aims of study to decide the best way to study the relationship between variables of interest – correlations, hypothesis testing, modeling. Make use of the daily biostatistics clinics. Refer to http://biostat.mc.vanderbilt.edu for the clinic schedule. Click on the “Clinics” link (5 th link from the top). http://biostat.mc.vanderbilt.edu Check to see if your department is a part of the collaboration plan for more intensive biostatistics support.

Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Similar presentations

Presentation on theme: "Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Similar presentations

Presentation on theme: "Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007."— Presentation transcript:

Similar presentations

About project

Feedback