# Clinical Statistics for Non-Statisticians – Part II

## Presentation on theme: "Clinical Statistics for Non-Statisticians – Part II"— Presentation transcript:

Clinical Statistics for Non-Statisticians – Part II
Kay M. Larholt, Sc.D. Vice President, Biometrics & Clinical Operations Abt Bio-Pharma Solutions

Topics Review of Statistical Concepts Hypothesis Testing Power and Sample Size Interim Analysis

Basic Statistical Concepts

Per the American Heritage dictionary -
Statistics Per the American Heritage dictionary - “The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling.” Two broad areas Descriptive – Science of summarizing data Inferential – Science of interpreting data in order to make estimates, hypothesis testing, predictions, or decisions from the sample to target population.

Introduction to Clinical Statistics
Statistics - The science of making decisions in the face of uncertainty Probability - The mathematics of uncertainty The probability of an event is a measure of how likely the event is to happen

Sample versus Population

Descriptive Statistics for Continuous Variables
Measures of central tendency Mean, Median, Mode Measures of dispersion Range, Variance, Standard deviation Measures of relative standing Lower quartile (Q1) Upper quartile (Q3) Interquartile range (IQR) range (IQR)

Basic Probability Concepts
Sample spaces and events Simple probability Joint probability

Probability 1 Certain Probability is the numerical measure of the likelihood that an event will occur Value is between 0 and 1 .5 Impossible

Computing Probabilities
The probability of an event E: Assumes each of the outcomes in the sample space is equally likely to occur P( E ) = Number of event outcomes Total number of possible outcomes in the sample space

Gaussian or Normal Distribution aka “Bell Curve”
Most important probability distribution in the statistical analysis of experimental data. Data from many different types of processes follow a “normal” distribution: Heights of American women Returns from a diversified asset portfolio Even when the data do not follow a normal distribution, the normal distribution provides a good approximation

Gaussian or Normal Distribution aka “Bell Curve”
The Normal Distribution is specified by two parameters The mean,  The standard deviation, 

Standard Normal Distribution
=1 m=0

Characteristics of the Standard Normal Distribution
Mean µ of 0 and standard deviation σ of 1. It is symmetric about 0 (the mean, median and the mode are the same). The total area under the curve is equal to one. One half of the total area under the curve is on either side of zero.

Area in the Tails of Distribution
The total area under the curve that is more than 1.96 units away from zero is equal to 5%. Because the curve is symmetrical, there is 2.5% in each tail.

Normal Distribution 68% of observations lie within ± 1 std dev of mean

Study Design

Sample versus Population
A population is a whole, and a sample is a fraction of the whole. A population is a collection of all the elements we are studying and about which we are trying to draw conclusions. A sample is a collection of some, but not all, of the elements of the population

Sample versus Population

Sample versus Population
To make generalizations from a sample, it needs to be representative of the larger population from which it is taken. In the ideal scientific world, the individuals for the sample would be randomly selected. This requires that each member of the population has an equal chance of being selected each time a selection is made.

Randomisation To guard against any use of judgement or systematic arrangements i.e to avoid bias To provide a basis for the standard methods of statistical analysis such as significance tests Assures that treatment groups are balanced (on average) in all regards. i.e. balance occurs for known prognostic variables and for unknown or unrecorded variables

Inferential statistics calculated from a clinical trial make an allowance for differences between patients and that this allowance will be correct on average if randomisation has been employed.

Hypothesis Testing

Hypothesis Testing Steps in hypothesis testing: state problem, define endpoint, formulating hypothesis, - choice of statistical test, decision rule, calculation, decision, and interpretation Statistical significance: types of errors, p-value, one-tail vs. two-tail tests, confidence intervals

Descriptive and inferential statistics
Descriptive statistics is devoted to the summarization and description of data (population or sample) . Inferential statistics uses sample data to make an inference about a population .

Objectives and Hypotheses
Objectives are questions that the trial was designed to answer Hypotheses are more specific than objectives and are amenable to explicit statistical evaluation

Examples of Objectives
To determine the efficacy and safety of Product ABC in diabetic patients To evaluate the efficacy of Product DEF in the prevention of disease XYZ To demonstrate that images acquired with product GHI are comparable to images acquired with product JKL for the diagnosis of cancer

How do you measure the objectives?
Endpoints need to be defined in order to measure the objectives of a study.

Primary Effectiveness Endpoint –
Endpoints: Examples: Primary Effectiveness Endpoint – Percentage of patients requiring intervention due to pain, where an intervention is defined as : Change in pain medication Early device removal

Percentage of patients with a reduction in pain:
Endpoints: Examples: Primary Endpoint: Percentage of patients with a reduction in pain: Reduction in the Brief Pain Inventory (BPI) worst pain scores of ≥ 2 points at 4 weeks over baseline.

Endpoints: Examples Patient Survival
Proportion of patients surviving two years post-treatment Average length of survival of patients post-treatment

Objectives and Hypotheses
Primary outcome measure greatest importance in the study used for sample size More than one primary outcome measure - multiplicity issues

Alternate Hypothesis (HA)
Hypothesis Testing Null Hypothesis (H0) Status Quo Usually Hypothesis of no difference Hypothesis to be questioned/disproved Alternate Hypothesis (HA) Ultimate goal Usually Hypothesis of difference Hypothesis of interest

Decision Making Type II Error “Truth” Decision Type I Error

Not Suitable to be a Physician Suitable to be a Physician
Decision Making Not Suitable to be a Physician Suitable to be a Physician Don’t Accept to Medical School Type II Error Accept to Medical School “Truth” Decision Type I Error

Not Suitable to be a Teacher Suitable to be a Teacher
Decision Making Not Suitable to be a Teacher Suitable to be a Teacher Don’t Accept to Teacher Training School Type II Error Accept to Teacher Training School “Truth” Decision Type I Error

Decision Making Cancer Not Cancer Positive Type II Error Negative
“Truth” Test Type I Error

New Therapy doesn’t work New Therapy works
Decision Making New Therapy doesn’t work New Therapy works Not Positive Clinical Trial Type II Error Positive Clinical Trial “Truth” Decision Type I Error

Type I Error – Society’s Risk Type II Error – Sponsor’s Risk
Hypothesis Testing If H0 is True False Decision Fail to reject No Error Type II Error (β) Reject Type I Error (α) Type I Error – Society’s Risk Type II Error – Sponsor’s Risk

Two Possible Errors of Hypothesis Testing
The Type I Error occurs when we conclude from an experiment that a difference between groups exists when in truth it does not rejecting H0 when H0 is in Fact True Investigators reject H0 and declare that a real effect exists when the chance of this decision being wrong is less than 5%.

Two Possible Errors of Hypothesis Testing
The Type II Error occurs when we conclude that there is no difference between treatments when in truth there is a difference fail to reject H0 when H0 is in fact False

Two Possible Errors of Hypothesis Testing
In many circumstances a type I error is often regarded as more serious than a type II error. Example: H0: innocent vs. H1: guilty Type I error = declaring an innocent man guilty Type II error = declaring a guilty man innocent Presumption of innocence Negative test result means "There is not enough evidence to convict“ rather than "innocence"

Review of errors in hypothesis testing
One will never know whether one has committed either error unless data are available for the entire population. The only thing we are able to do is to assign α and β as the probabilities of making either type of error. It is important to keep in mind the difference between the truth and the decision that is being made as a result of the experiment.

Type I error, alpha, , p-value
Hypothesis testing Null Hypothesis No difference between Treatment and Control Type I error, alpha, , p-value The probability of declaring a difference between treatment and control groups even though one does not exist (ie treatment is not statistically different from control in this experiment) As this is “society’s risk” it is conventionally set at 0.05 (5%)

1 -  is the power of the study
Hypothesis testing Type II error, beta,  The probability of not declaring a difference between treatment and control groups even though one does exist (ie treatment is statistically different from control in this experiment) 1 -  is the power of the study Often set at 0.8 (80% power) however many companies use 0.9 (90% power) Underpowered studies have less probability of showing a difference if one exists

Steps in Hypothesis Testing
Choose the null hypothesis (H0) that is to be tested Choose an alternative hypothesis (HA) that is of interest Select a test statistic, define the rejection region for decision making about when to reject H0 Draw a random sample by conducting a clinical trial

Steps in Hypothesis Testing
Calculate the test statistic and its corresponding p-value Make conclusion according to the pre-determined rule specified in step 3

Hypothesis Testing - How to test a hypothesis
Assume that we believe that we have a fair coin – equal chance of getting H or T when we flip the coin Test the hypothesis by carrying out an experiment.

Hypothesis Testing - How to test a hypothesis
Flip the coin 4 times, each time is H. What is the likelihood of getting 4 H if this is a fair coin?

Remember the Binomial Probability Function
Let X be the event of getting a H X ~ Binomial (n = 4, p=0.5) In this case, we want x=4 = = 6.25%

There is a 6.25% probability of getting 4 H even if this is a completely fair coin. If we were to include 4 T then there would be a 12.5% probability of getting 4 H or 4 T with a fair coin.

What happens if we increase the sample size?
What is the probability of getting 10 H if you flip a fair coin 10 times? X ~ Binomial (n = 10, p=0.5) In this case, we want x=10 = =0 .098%

There is a 0.098% probability of getting 10 H even if this is a completely fair coin. If we were to include 10 T then there would be a 0.2% probability of getting 10 H or 10 T with a fair coin tossed 10 times.

How does this fit in with our decision making?
We hypothesised that this was a fair coin (50% chance of H and 50% chance of T) We carried out our experiment, flipped the coin 4 times and got 4 H. We calculated the probability of getting a result like this = 6.25% under H0 (fair coin)

Test of Significance and p-value
Statistically significant: Conclusion that the results of a study are not likely to be due to chance alone. Clinical significance is unrelated to statistical significance

Test of Significance and p-value
Probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance and that in the population from which the sample was drawn, no such relationship or differences exist. It is not the probability that given result is wrong.

Power and Sample Size Basic terms and concepts
Study parameters: design, confidence level, power, acceptable error, effect size, variability

One day there was a fire in a wastebasket in the Dean's office and in rushed a physicist, a chemist, and a statistician. The physicist immediately starts to work on how much energy would have to be removed from the fire to stop the combustion. The chemist works on which reagent would have to be added to the fire to prevent oxidation. While they are doing this, the statistician is setting fires to all the other wastebaskets in the office. "What are you doing?" they demanded. "Well to solve the problem, obviously you need a large sample size" the statistician replies.

Power Calculation – a guess masquerading as mathematics
Stephen Senn Statistical Issues In Drug Development

Sample versus population

Power Power is the probability of finding an effect when an effect actually exists. Power = Probability {correctly reject H0} = 1 – P (Type II Error) To increase power we want to decrease the Type II error

In our experiment with the coin we observed that changing the sample size from 4 to 10 changed the probability of a Type I error If we had rejected the Fair Coin hypothesis when we got 4/4 H we would have made a Type I error = 6.25%. If we rejected the Fair Coin hypothesis when we got 10/10 H the Type I error was 0.098% Assuming the coin was a Fair Coin

Sample size Power = 1 – Type II error (β) Type I error – α Meaningful effect size - δ Variability - σ

Sample Size Rules of Thumb
If variability (σ) increases, then n (sample size) increases If effect size (δ) increases, then n decreases If either α or β decreases, then n increases

Effect Size Effect size is the biologically significant difference e.g. size of the effect produced by a treatment. It is the generic term to describe the magnitude of the relationship between an independent variable and a dependent variable. Statistical significance demonstrates that the observed effect is unlikely to have occurred by chance, whereas effect size addresses the magnitude of the effect. Usually the symbol δ is used to refer to effect size.

Estimating the effect size
Estimating δ is definitely one of the most challenging aspect of these calculations. Specifically, we are conducting the study because the knowledge regarding the treatment under study is incomplete. We end up making guesses about the δ for the proposed research based on knowledge that is by necessity incomplete.

Estimating the effect size
When clinical questions the statistician as to how many subjects are needed, the cautious statistician replies that ‘ in order to do this, one needs the result from a study that is well designed, has ample power and tests the same hypothesis’. Clinical however replies that if such study were available, there would be no need to do the study.

Estimating the effect size
A similar study may have been done previously A pilot study can be done to provide an initial estimate of the δ A meta-analysis of prior research can be used to provide estimates

Estimating the effect size
Specifying the Minimum Effect of Interest Specify the minimum δ that would be meaningful. Is a 2% increase meaningful? Depends on the context. Is a 50% reduction meaningful? 50% of what? Relative reduction or absolute reduction? 60% vs 30% or 60% vs 10%?

Type I error In most clinical trial settings the established standard is Type I error = 5%, i.e. There is a 5% chance that the null hypothesis was rejected even if it is true (i.e. no difference between treatments).

Hypothesis Testing – Normal Distribution

Two-tailed test Usually a 2 tailed test is performed with the risk of making a Type I error set at α/ 2 in each tail. If the null hypothesis is: H0: Trt A = Trt B and the alternative hypothesis is: HA: Trt A ≠ Trt B Each of the two ways making a Type I error are equally undesirable.

One-tailed test Sometimes an investigator is only interested in a difference between treatments in one direction. This is appropriate when the scientific reasoning behind the experiment leads to a prediction in one direction. However FDA will not allow you to do One Tailed Tests at α = 0.05 but will use α = even if the study is designed as a one-tailed test.

Calculation of sample size
The calculation of sample size depends on the summary statistics chosen. The most common choices are Treatment mean e.g. average blood pressure, average cholesterol, average days in hospital Treatment proportion e.g. % of patients who die, recover, achieve some therapeutic goal or any defined state

Sample Size Calculations
Janet Wittes – Sample Size Calculations for Randomized Controlled Trials. Epidemiologic Reviews Vol. 24, No. 1, 2002 “Most informed consent documents for randomized controlled trials implicitly or explicitly promise the prospective participant that the trial has a reasonable chance of answering a medically important question.”

In order to fulfill that promise a clinical trial must be sized appropriately, have high enough power and long-enough follow up. Too many trials are designed with over optimistic assumptions about treatment effect, inappropriate assumptions about compliance or follow-up and inaccurate assumptions about the response in the control group.

Compute the sample size with the following assumptions: 80% power
Example A study is designed to determine the rate in the active treatment arm is significantly higher than the rate in the control arm. Compute the sample size with the following assumptions: 80% power Rate of 35% in the active treatment arm Rate of 15% in the control treatment arm Level of significance is 5%.

How would the sample size change if we change study parameters?
Power = 80% Power = 90% Treat Cont Total .35 .15 160 .20 298 .30 262 622 .25 532 2262 Treat Cont Total .35 .15 208 .20 392 .30 344 820 .25 702 3002

80% Power vs 90% Power? 80% power for detecting a statistically meaningful difference is generally considered desirable, however 90% or higher is preferable. Example: .35 vs. .15, 80% power …n=160 What if the rates are .34 vs. .16 instead? With n=160 patients the power is only 70%! Notice that the difference between .30 and .20 is .1 and the difference between .25 and .15 is .1 but the sample size needed to detect the same difference (0.1) is different.

Power = success? Designing a study with 80% power does not imply that there is an 80% chance that the study will be a success. Other factors influence the success of a study: Treatment does not work Placebo is better than expected Too much variability in the data

Implications of an under-powered study
The power of the study provides us with a probability of rejecting the Null Hypothesis if the Null Hypothesis is incorrect. If we under-power the study then we have put patients at risk with a reduced chance of being able to reject the null hypothesis.

Sample size determination
It is important to Identify primary endpoint Explicitly formulate hypothesis to be tested Explicitly formulate statistical analysis of endpoint Account for lost to follow-up, drop-outs, compliance

Put science before statistics
Studies should be designed to meet scientific goals. Although sometimes resources, time constraints and financial reasons may be issues, try not to estimate the number of patients who can be recruited into a trial and then ask the statistician to justify the sample size by calculating the "detectable" difference implied by the number of recruitable patients.

Put science before statistics (cont.)
Clinical trials should be large enough to detect a clinically important difference between two treatments. The appropriate inputs to power/sample-size calculations are effect sizes that are deemed clinically important, based on careful considerations of the underlying scientific (not statistical) goals of the study. It is easy to get caught up in statistical significance; but statistical considerations are used to identify a plan that is effective in meeting scientific goals -- not the other way around.

Interim Analysis

What is interim analysis?
Interim analysis is analysis of the data at one or more time points prior to the official close of the study with the intention of possibly terminating the study early.

Interim Analysis A Phase III study was designed with n=600. Recruitment is going slowly and the CEO asks you to do an interim analysis after 300 patients to see if the study can be stopped with a significant result. What are the problems with this approach?

Why was the study designed with 600 patients if 300 would be enough?
Is there any new evidence from outside the study that something has changed that would mean that 300 patients are enough? What are the implications for the blinding, power and Type I error of the study?

Interim analysis You might need to continue a trial, even after you have accumulated substantial evidence that the new therapy is superior, because you need the extra data to accurately characterize side effects. Interim analyses should be pre-specified to be valid. The level of evidence that you need to stop a study early is higher than what is needed at the end of the study.

Interim analysis Reasons for considering an interim analysis: In a study where you expect the new therapy to be better than placebo (for example, you might want to stop the study as soon as you have enough evidence that the new therapy is better). Ethical reasons (you want to minimize the number of subjects getting the placebo) Economic reasons (you don't want to spend extra money after enough evidence has been accumulated).

We need to be careful! There is no “free lunch” in statistics.
Interim analysis We need to be careful! There is no “free lunch” in statistics. If we carry out one or more interim analyses, the test at the end of the study can not be carried out at the 0.05 level. You have to “spend” some of your α level at each interim analysis leaving you with less at the end. This reduces the power at the final analysis unless you have designed the study appropriately.

Interim analysis The two classic approaches to interim analysis are: Pocock method and O'Brien-Fleming method.

Interim Analysis Procedure Clear details in the protocol Identification of independent team for blinded study Reporting and statistical analysis plan Data management strategy Cleaning data, database lock, etc.

New Approaches to Clinical Trial Design
Group Sequential Trial Design Adaptive Trials

What have we learnt? Statistics is all about a way of thinking
If you don’t have uncertainty you don’t need statistics p-values are probability statements that tell you something about your experiment The sample size of any study depends on the treatment effect you expect to see and the variability of the measurement in the sample

What haven’t we learnt? All the detailed theory and formulae that back up everything we have discussed How to be a statistician (for that you do have to go to graduate school) How to get the perfect answer each time we run a clinical trial: We are working with patients not widgets and human beings are incredibly complex

References ICH Guidelines E9, E3 and others Statistical Issues in Drug Development – Stephen Senn 1997 John Wiley & Sons Janet Wittes – Sample Size Calculations for Randomized Controlled Trials. Epidemiologic Reviews Vol. 24, No. 1, 2002

Thank You ! kay.larholt@abtbiopharma.com