Presentation on theme: "Designing an impact evaluation: Randomization, statistical power, and some more fun…"— Presentation transcript:
Designing an impact evaluation: Randomization, statistical power, and some more fun…
Designing a (simple) RCT in a couple steps You want to evaluate the impact of something (a program, a technology, a piece of information, etc.) on an outcome. Example: Evaluate the impact of free school meals on pupilss schooling outcomes. You decide to do it through a randomized controlled trial. – Why? The questions that follow: – Type of randomization – What is most appropriate? – Unit of randomization – What do we need to think about? – Sample size > These are the things we will talk about now.
I. Where to start You have an HYPOTHESIS Example: Free meals => increased school attendance => increased amount of schooling => improved test scores. Or could it go the other way? To test your hypothesis, you want to estimate the impact of a variable T on an outcome Y for an individual i. In a simple regression framework: How could you do this? – Compare schools with free meals to schools with no free meals? – Compare test scores before the free meal program was implemented to test scores after? Y i =α i +βT+ε i
You decided to do use a randomized design. Why?? – Randomization removes the selection bias > Trick question: Does the sample need to be randomly sampled from the entire population? – Randomization solves the causal inference issue, by providing a counterfactual = comparison group. While we cant observe Y i T and Y i C at the same time, we can measure the average treatment effect by computing the difference in mean outcome between two a priori comparable groups. We measure: ATE=E[Y T ]- E[Y C ] II. Randomization basics
What to think of when deciding on your design? – Types of randomization/ unit of randomization Block design Phase-in Encouragement design Stratification? The decision should come from (1) your hypothesis, (2) your partners implementation plans, (3) the type of intervention! Example: What would you do? Next step: How many units? = SAMPLE SIZE. Intuition --> Why do we need many observations? II. Randomization basics
Remember, were interested in Mean(T)-Mean(C) We measure scores in 1 treatment school and 1 control school > Can I say anything?
But how to pick the optimal size? -> It all depends on the minimum effect size youd want to be able to detect. Note: Standardized effect sizes. POWER CALCULATIONS link minimum effect size to design. They depend on several factors: – The effect size you want – Your randomization choices – The baseline characteristics of your sample – The statistical power you want – The significance you want for your estimates Well look into these factors one by one, starting by the end… III. Sample size
When trying to test an hypothesis, one actually tests the null hypothesis H 0 against the alternative hypothesis H a, and tries to reject the null. H 0 : Effect size=0 H a : Effect size0 Two types of error are to fear: III. Power calculations (1) Hypothesis testing TRUTH YOUR CONCLUSION Effective (reject H 0 )No effect (cant reject H 0 ) Effective TYPE II ERROR POWER No effectTYPE I ERROR SIGNIFICANCE
SIGNIFICANCE= Probability that youd conclude that T has an effect when in fact it doesnt. It tells you how confident you can be in your answer. (Denoted α) – Classical values: 1, 5, 10% – Hypothesis testing basically comes down to testing equality of means between T and C using a t-test. For the effect to be significant, it must be that the t-stat obtained be greater than the t-stat of the significance level wanted. Or again: must be greater or equal to t α =1.96 III. Power calculations (1) Significance
POWER= Probability that, if a significant effect exists, you will find it for a given sample size. (Denoted κ) – Classical values: 80, 90% To achieve a power κ, it must be that: Or graphically… In short: To have a high chance to detect an effect, one needs enough power, which depends on the standard error of the estimate of ß. III. Power calculations (2) Power
Intuition = the higher the standard error, the less precise the estimate, the more tricky it is to identify an effect, the higher the need for power! – Demonstration: How does the spread of a variable impact on the precision a mean comparison test?? We saw that power depended on the SE of the estimate of ß. But what does this standard error depend on? – Standard deviation of the error (how heterogenous the sample is) – The proportion of the population treated (Randomization choices) – The sample size III. Power calculations (3) Standard error of the estimate
We now have all the ingredients of the equation. The minimum detectable effect (MDE) is: As you can see: – The higher the heterogeneity of the sample, the higher the MDE, – The lower N, the higher the MDE, – The higher the power, the lower the MDE Power calculations in practice, will correspond to playing with all these ingredients to find the optimal design to satisfy your MDE.in practice – Optimal sample size? – Optimal portion treated? III. Power calculations (4) Calculations
Several treatments? – What happens when more than one treatment? – It all depends on what you want to compare !! Stratification? – Reduces the standard deviation Clustered (block) design? – When using clusters, the outcomes of the observations within a cluster can be correlated. What does this mean? – Intra-cluster correlation rhô, the portion of the total variance explained by within variance, implies an increase in overall variance. – Impact on MDE? – In short: the higher rhô, the higher the MDE (increase can be large) III. Power calculations (5) More complicated frameworks
When thinking of designing an experiment: 1.What is your hypothesis? 2.How many treatment groups? 3.What unit of randomization? 4.What is the minimum effect size of interest? 5.What optimal sample size considering power/budget? => Power calculations ! Summary