Statistical Data Analyst/Research Specialist

Statistical Data Analyst/Research Specialist
Katheryne Downes, MPH Statistical Data Analyst/Research Specialist Office of Clinical Research/GME

A Little Study Design Terminology- Descriptive Studies
Case Study: Single patient is reviewed in detail. Case Series: Similar to above- just expand the number to a small handful. *Ecological Studies: Describes what’s going on at the population (summary) level. All data are collected at the same time- no individual data are collected. *Cross-Sectional Studies: Similar in many ways to ecological, but examines individual level data instead of population level. * These studies have potential for some weak analytic statistics.

A Little Study Design Terminology- Analytic Studies
**Cohort Studies: This study identifies people on their exposure status (yes/no) and follows them to determine if they developed the outcome (yes/no). Great study for unusual or rare exposures. a. Retrospective: b. Prospective: Past… E  O? Present… E  O? * Cohort studies are sometimes used for purely descriptive purposes when we aren’t sure what phenomenon may occur.

A Little Study Design Terminology- Analytic Studies
2. Case-Control Studies: identify subjects by their disease/outcome status and then look backward to determine if they had the exposure of interest. 3. Randomized-Controlled Trials: Ah, yes… The Golden Child of research. Expose  Outcome? Randomize-- Don’t Expose  Outcome? E?----O

Study Design Quiz A young epidemiologist (& budding statistician!) was assigned to investigate an outbreak of an unusual fungus in the lungs of patients undergoing bronchoscopy. There’s about 15 patients and she’ll need to do a thorough review of the patient’s records to gather information to determine how these events may have taken place. (She’ll eventually spend DAYS in the medical record department and countless hours crawling through ventilation duct work and the hospital roof…but that’s another story…) Is it: A: A prospective cohort study B: A case-control Study C: A cross-sectional study D: A case-series E: A study on crazy epidemiologists Why did you select your answer?

Recap: 15 patients underwent bronchoscopy and ended up with really weird fungus growing in their lungs. In-depth review of charts. A: A prospective cohort study- NO. We need both exposed & unexposed groups for B: A case-control Study NO. We’d need both disease AND no disease groups. C: A cross-sectional study NO. We’d need everyone that underwent bronchoscopy D: A case-series YES!! E: A prospective study on crazy epidemiology interns NO. It would be a case study on crazy epidemiology interns.

Another Example… A group of researchers is interested in whether VAP is associated with the use of a particular tube type. They begin by identifying all patients diagnosed with VAP in 2007 and also identify a similar group that did NOT develop VAP. They then look at the frequency of tube types among these two groups. Is it: A: A retrospective cohort Study B: A case-control study C: A prospective cohort study Why did you select your answer?

Recap: IS VAP associated with a certain ET tube type
Recap: IS VAP associated with a certain ET tube type? Start by looking at patients with and without VAP…then look at their tube type. A: A retrospective cohort NO. Pts need to be identified by exposure status in a cohort study. B: A case-control study YES!! D: A prospective cohort study NO. Again, pts would be identified by exposure status

Almost done! (with this section, anyway)
A research group is interested in the impact of methadone use during pregnancy on baby outcomes. They have decided to follow a large group of pregnant women classified as either methadone users or non-users and will later gather information on GA, birthweight, Apgar Scores, etc. Is it: A: A prospective cohort study B: A case series C: An ecological study D: A case-control study Why did you choose your answer?

Recap: LOTS of pregnant women- some are on drugs
Recap: LOTS of pregnant women- some are on drugs. What happens to all the babies? A: A prospective cohort study YES!!! B: A case series NO. This is for small groups, unusual phenomenon. Maybe a small group on extremely high doses? C: An ecological study NO. We have individual level data here and we have a timeline. D: A case-control study NO. That’s identification by outcome status- We’re identifying on exposure status.

We’ve now made it to… The Stats Section!
Questions So Far?

Basic Stats: Data Types
Categorical: the data have “categories” instead of numeric values. (ex: male/female, disease/no disease, red/orange/yellow) Dichotomous: Categorical variable with only two possible categories. Continuous: this means the variable can take on a range of possible values. (weight, bp, height, etc)

Categorical and Continuous Data
Remember… Categorical data: yes/no, male/female, disease/no disease Continuous data: weight, height, scores, blood values, etc.

Drill! BMI Disease (Yes/No) Temperature Test Score (1-10) Continuous
Categorical

Drill! Categorical Continuous Test (positive/negative) Height
Survival (months) Gender Categorical Continuous

Descriptive Statistics

Describing the data… Data: 2, 7, 7, 8, 9, 11, 15
Mode: most frequently occurring number (7) Mean: average (9) Median: put numbers in order, middle number or average of two middle numbers (8) -AKA: 50th Percentile.

Drill! Data: 1, 1, 3, 5, 6 Mean, Median, Mode? 3.2 3 1

Basic Stats: Descriptive Stats for continuous data
N, or n: We need to know how many people were in the sample. Results drawn from a sample with n=5 aren’t very likely to be reliable. However, a sample of n=100 will make you feel a little more comfortable. Central tendency: Mean, median, mode Variation: Standard deviation, variance, standard error

Descriptive stats: Continuous
Normally distributed? Normal: mean, SD Not Normal: median, range or 95% CI * Special Case: Survival data are usually described with median and confidence interval.

The Empirical Rule How do we know if a distribution is “normal”??
-Visual Inspection (boxplots are very helpful) -Kolmogorov-Smirnov (sorry, no vodka involved) -Other tests

Basic Descriptive Stats for Categorical Data
Remember- you can’t take an average of yes/no (maybe?). (well, some people have tried to put that in papers…) So, how do we describe categorical data? N, or n Frequencies Percentages

Question: Best approximation of the actual value for non-normally distributed data? A: mean +/- standard error of the mean B: median +/- standard deviation C: median +/- confidence interval

Before we move onto statistical tests- some basic terminology…
Independent Variable: a predictor, a variable of interest Dependent Variable: the thing you’re trying to predict or the outcome of interest Ex: I’m conducting a study to determine whether administering antibiotic “x” approximately 12hrs before surgery reduces post-operative infection rates. What’s the independent variable? What’s the dependent variable?

Another Example… Does serum albumin level pre-surgery affect the 90 day survival of patients receiving an LVAD? What’s the dependent variable? What’s the independent variable?

Statistical Tests: Continuous
(Student’s) T-test: compares 2 groups on a continuous variable Paired t-test: compares 1 group, before and after on continuous variable ANOVA: Compares 3+ groups on a continuous variable *Post-hoc tests REQUIRED*

How the Guinness Brewery Changed History…
I know this has been a lot to digest so far- so, I make an effort to try to find out those little things that make what you’re learning more interesting….So, one of the tests we are going to go over in just a bit is the t-test- which you often see called “student’s t-test”. Well, a lot of people don’t know the history of this test- and it’s actually quite entertaining…. “Student” was the pen-name of the statistician who developed the theory of the t-distribution. His real name was William Gossett. So, you might be wondering- for something as major as the discovery/creation of the t-distribution (just think about how often it’s currently used)- why on earth would you publish under a pen-name? Because you’d want to be given the credit, right? Well, it turns out that Gossett actually worked for the Guinness Brewery- he had developed the t-distribution as a cheap means of checking the quality of the beer brews. The Guinness Brewery didn’t want it’s competitors to know that it was using statistics to improve it’s beers, so they insisted that Gossett use a pen-name to publish the paper on the t-distribution which appeared in Biometrika in Gossett then sent the tables he had created to R.A. Fisher (known for creation of the fisher exact test)- and it was Fisher who then introduced the t-form of the equation that we are familiar with today….

Statistical Tests: Non-Parametric (the rebels!)
Mann-Whitney U: compares 2 groups on a continuous variable (non-parametric version of t-test) Wilcoxon Signed Ranks: compares 1 group, before and after on continuous variable (non-parametric version of paired t-test) Kruskal-Wallis: Compares 3+ groups on a continuous variable (non-parametric version of ANOVA) *Post-hoc tests REQUIRED*

Statistical tests: Categorical
Chi-Square: used with categorical data with expected cell values 5+ McNemar: paired proportions Fisher Exact: categorical data with expected cell values <5.

Details, Details…. If all cell values are “5” or higher, you can use the Chi-Square. If you have at least one cell with a value of “4” or lower, you should use the Fisher Exact test. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Q: Umm, what’s a “cell?” Phospholipid bi-layers!?!? A: In this case, a cell refers to this the compartments of this 2x2 table--- > Group Group 2 Males Females 5 4 16 10

Analytic Statistics for Categorical variables
Q: But what about normality & that crazy Kolmo-whatchamacallit vodka test??!?! A: The tests for categorical variables don’t have any normality assumptions built in so your data can look as crazy as can be and you will be fine!

Drill ! A study is being conducted to evaluate the effectiveness of a new diet pill. There are two groups- one is receiving a placebo, the other the experimental drug. The outcome is BMI and is assumed to be normally distributed. What type of data? How would you summarize the data? What type of statistical test would you run? Continuous Mean +/- SD Two group t-test

Other Stats: Relative Risk/Odds Ratios
Relative Risk: Used in cohort studies when you have the incidence. (IR in exposed/IR in unexposed) (a/a+b) / (c/c+d) Odds Ratio: Used in case-control studies to approximate relative risk. (ad/bc) D ND E NE A B C D

Drill ! A group of patients are identified based off their exposure status to the H1N1 vaccine. They are being followed to determine whether they successfully develop antibodies to the novel virus. Q: What type of study is this? Q: What type of data does the outcome represent? Q: Name two statistical tests that could be used to evaluate this association. A: Prospective Cohort Study A: Categorical (yes/no for outcome) A: Chi-square (or fisher exact) and Relative Risk

Drill continued… O NO E 90 10 100 NE 40 50 So, here’s the data:
Calculate the RR !!! 90/100 10/50 RR = 4.5 O NO E 90 10 100 NE 40 50

Drill ! A group of patients are classified based on whether they have stomach cancer or not. They are then asked questions about their hot pepper consumption habits in the past 5 years (high consumption vs. low consumption). Q: What type of study is this? Q: What type of statistics could be used to evaluate the association? A: Case-Control Study A: Chi-Square/Fisher Exact or Odds Ratio

Drill continued… O NO E 90 10 100 NE 40 50 So, here’s the data:
Calculate the OR !!! 90*40 10*10 OR = 36 O NO E 90 10 100 NE 40 50

Regression Analysis Regression Analysis?
Everything we’ve look at so far is termed “univariate analysis” – meaning, we just look at the effect of ONE variable at a time, but what if there’s a lot of different risk factors? What if they interact with each other? Regression analysis is used when we want to look at the complex interaction between different predictive variables on the outcome of interest. This analysis allows us to determine the effect of each variable on the outcome when ALL the others are controlled.

Regression Analysis… Regression Type is based on OUTCOME type (not predictor variables) Two Basic Types LOGISTIC Regression: Outcome is “dichotomous” LINEAR Regression: Outcome is “continuous” In both types of regression, you can enter BOTH continuous and categorical predictors.

Hypothesis Testing… Null hypothesis: assumes that all the groups will behave similarly- no meaningful differences. Alternate hypothesis: There IS a difference One-sided: Group A is better than B Two-sided: Group A is different than B Note: This is the main type of hypothesis testing. There are some variations in which logic is flipped on it’s head: equivalence testing & non-inferiority testing are just two of them…

Hypothesis Testing Reality -> Test Result ↓ No Difference
Fail to Reject Null CORRECT Type II Error (beta) Reject Null Type I Error (alpha)

Hypothesis Testing Type I Error: Incorrectly reject the null, alpha (0.05 or 0.01) Type II Error: Incorrectly fail to reject the null, beta (1-beta = power) (power = 80%) 1: Sample size too small !!! 2: Observed difference was smaller than specified difference P-value: probability of observing the event if it occurred by chance.

Drill ! Large randomized multicenter trial where no difference is seen. Why? A: Too strict inclusion criterion B: Too different populations because of different centers C: The clinical difference is smaller than the expected difference

Hypothesis Testing: 95% CI
95% CI: provides an estimate of the true value. In hypothesis testing, we’re looking for a certain value in the interval that corresponds to the null… Sooooo….in Relative Risk or Odds Ratios, we’re looking at the ratio of risks for two groups. Q: If the risk is the same between the two groups, the ratio = ? Q: What value are we looking for in the associated 95% CI? A: Yes, we’re looking for the value of “1” If that value is in the confidence interval, than “no difference” is in the range of true values and the result wouldn’t be significant.

Hypothesis Testing: 95% CI
What about a paired t-test? Q: What type of data is the test used for? Q: What’s the null value in this case? Q: So, what value are we looking for in the CI? A: Remember, this is generally used for before/after tests. So, if before = after, then after - before = 0. Therefore, we’re looking for a value of “0” in the CI. If we find it, the result is considered non-significant.

Trials and Studies RCT: Reduces bias, evens distribution of confounding factors, but sometimes can’t be used. Double Blind: doctor/patient don’t know what the patient is getting. Reduces observational bias. Cohort Study: Patients identified by exposure status and followed for outcome Case-Control Study: Patients identified by outcome status (case or control) and look back for exposures.

Drill ! For a cohort study, what type of ratio can be calculated?
A: Relative Risk For a case-control study, what type of ratio can be calculated? A: Odds Ratio

Drill ! What’s the formula for Relative Risk?
A: IR in exposed/IR in unexposed What’s the formula for Odds Ratio? A: ad/bc

Misc… Meta-Analysis: combines the data from several different studies. Often used when individual sample sizes are too small and underpowered. Be careful when the studies are too different from each other. Prevalence: # of current cases/total population Incidence: # of new cases/total population at risk

Test Diagnostics D ND + 9 (a) 9 (b) 18 _ 1 (c) 81 (d) 82 10 90 100
Sensitivity: positive/ all diseased (a/a+c)= 90% Specificity: negative/all not diseased (d/b+d) = 90% PPV: diseased/all positive (a/a+b) = 50% NPV: no disease/all negative (d/c+d) = 98.8% Accuracy: correct results/all (a+d/ a+b+c+d)= 90% D ND + 9 (a) 9 (b) 18 _ 1 (c) 81 (d) 82 10 90 100

Drill ! D ND + 10 (a) 15 (b) 25 _ 10 (c) 30 (d) 40 20 45 65
Calculate Sensitivity, Specificity, PPV, NPV, Accuracy. Sensitivity: (10/20) 50% Specificity: (30/45) 66.7% PPV: (10/25) 40% NPV: (30/40) 75% Accuracy: (10+30/ 65) 61.5% D ND + 10 (a) 15 (b) 25 _ 10 (c) 30 (d) 40 20 45 65

Sample Question Prevalence of disease is 20%. Test is 80% sensitive and specific. What is the likelihood that a positive test is correct? First- What is this question asking for? A: Positive Predictive Value (PPV) So, we’re going to be reading across that first row… Second- How do we set up this table?

Steps to the answer… D ND + _
Step 1: Draw the basic table with the correct orientation. D ND + _

Steps to the answer… D ND + _ 20 80 100
Step 2: Begin with the prevalence they gave you…use EASY numbers “Prevalence of disease is 20%” D ND + _ 20 80 100

Steps to the answer… D ND + _
Step 3: Use the other information to fill in the table… “Test is 80% sensitive and specific.” D ND + 16 32 _ 4 64 68 20 80 100

Steps to answer the question…
Step 4: Answer the question! What are we looking for? PPV! PPV= Disease/all positives (a/a+b) 16/32 = 50% D ND + 16 32 _ 4 64 68 20 80 100

Questions

Thank you!

Statistical Data Analyst/Research Specialist

Similar presentations

Presentation on theme: "Statistical Data Analyst/Research Specialist"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Data Analyst/Research Specialist

Similar presentations

Presentation on theme: "Statistical Data Analyst/Research Specialist"— Presentation transcript:

Similar presentations

About project

Feedback