Povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo MIT and Poverty Action Lab.

Slides:



Advertisements
Similar presentations
Povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo J-PAL.
Advertisements

Designing an impact evaluation: Randomization, statistical power, and some more fun…
AADAPT Workshop South Asia Goa, December 17-21, 2009 Kristen Himelein 1.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
LSU-HSC School of Public Health Biostatistics 1 Statistical Core Didactic Introduction to Biostatistics Donald E. Mercante, PhD.
Estimating the Effects of Treatment on Outcomes with Confidence Sebastian Galiani Washington University in St. Louis.
Planning Sample Size for Randomized Evaluations Jed Friedman, World Bank SIEF Regional Impact Evaluation Workshop Beijing, China July 2009 Adapted from.
MARE 250 Dr. Jason Turner Hypothesis Testing II To ASSUME is to make an… Four assumptions for t-test hypothesis testing: 1. Random Samples 2. Independent.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Clustered or Multilevel Data
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
A new sampling method: stratified sampling
Today Concepts underlying inferential statistics
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
SAMPLING AND STATISTICAL POWER Erich Battistin Kinnon Scott Erich Battistin Kinnon Scott University of Padua DECRG, World Bank University of Padua DECRG,
Sample Size Determination Ziad Taib March 7, 2014.
Sampling and Sample Size Part 2 Cally Ardington. Lecture Outline  Standard deviation and standard error Detecting impact  Background  Hypothesis testing.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
AM Recitation 2/10/11.
Hypothesis Testing II The Two-Sample Case.
Estimation and Hypothesis Testing Now the real fun begins.
Fundamentals of Hypothesis Testing: One-Sample Tests
Section 9.1 Introduction to Statistical Tests 9.1 / 1 Hypothesis testing is used to make decisions concerning the value of a parameter.
Comparing Means From Two Sets of Data
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
T tests comparing two means t tests comparing two means.
Today’s lesson Confidence intervals for the expected value of a random variable. Determining the sample size needed to have a specified probability of.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Jan 17,  Hypothesis, Null hypothesis Research question Null is the hypothesis of “no relationship”  Normal Distribution Bell curve Standard normal.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 8 Introduction to Hypothesis Testing
AADAPT Workshop Latin America Brasilia, November 16-20, 2009 Laura Chioda World Bank 1.
January 31 and February 3,  Some formulae are presented in this lecture to provide the general mathematical background to the topic or to demonstrate.
Introduction to inference Use and abuse of tests; power and decision IPS chapters 6.3 and 6.4 © 2006 W.H. Freeman and Company.
P RACTICAL S AMPLING FOR I MPACT E VALUATIONS Marie-Hélène Cloutier 1.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Sampling for an Effectiveness Study or “How to reject your most hated hypothesis” Mead Over, Center for Global Development and Sergio Bautista, INSP Male.
Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.
통계적 추론 (Statistical Inference) 삼성생명과학연구소 통계지원팀 김선우 1.
RDPStatistical Methods in Scientific Research - Lecture 41 Lecture 4 Sample size determination 4.1 Criteria for sample size determination 4.2 Finding the.
AP STATISTICS LESSON INFERENCE FOR A POPULATION PROPORTION.
Copyright © 2010 Pearson Education, Inc. Slide Beware: Lots of hidden slides!
TRANSLATING RESEARCH INTO ACTION Sampling and Sample Size Marc Shotland J-PAL HQ.
Sample Size Determination Text, Section 3-7, pg. 101 FAQ in designed experiments (what’s the number of replicates to run?) Answer depends on lots of things;
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.
Chapter 6: Analyzing and Interpreting Quantitative Data
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
T tests comparing two means t tests comparing two means.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Inferential Statistics Psych 231: Research Methods in Psychology.
Chapter 9 Introduction to the t Statistic
Methods of Presenting and Interpreting Information Class 9.
Chapter 5 STATISTICAL INFERENCE: ESTIMATION AND HYPOTHESES TESTING
Statistical Core Didactic
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Power, Sample Size, & Effect Size:
Comparing Populations
Planning sample size for randomized evaluations
Another Example Consider a very popular computer game played by millions of people all over the world. The average score of the game is known to be
Sampling and Power Slides by Jishnu Das.
Inferential Statistics
SAMPLING AND STATISTICAL POWER
CHAPTER 16: Inference in Practice
Sample Sizes for IE Power Calculations.
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
Presentation transcript:

povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo MIT and Poverty Action Lab

Planning Sample Size for Randomized Evaluations General question: How large does the sample need to be to credibly detect a given effect size? Relevant factors: –Expected effect size –Variability in the outcomes of interest –Design of the experiment: stratification, control variables, baseline data, group v. individual level randomization

Basic set up At the end of an experiment, we will compare the outcome of interest in the treatment and the comparison groups. We are interested in the difference: Mean in treatment - Mean in control = Effect size This can also be obtained through a regression of Y on T: Y i =   i  = Effect size.

Estimation The sample average is an unbiased, consistent, and efficient estimator of the mean of Y in the population. The OLS estimate of beta in the previous equation is numerically equivalent to the difference in the sample averages.

Confidence Intervals Since our experiment is only one draw from the population, the estimated effect size does not directly tell us how effective the treatment was for the population. A 95% confidence interval for an effect size  tells us that, for 95% of the random samples of treatment and control units that we could have drawn, the estimated effect would have fallen into this interval. For large sample, an approximate confidence interval is:

Confidence Intervals Example 1: –Women Pradhans have 7.13 years of education –Men Pradhans have 9.92 years of education –The difference is 2.59 with a standard error of 0.54 –The 95% confidence interval is [-1.73;-3.84] Example 2: –Women Pradhans have 2.45 children –Men Pradhans have 2.50 children –The difference is 0.05, with a standard error of 0.26 –The 95% confidence interval is [-0.55;0.46]

Hypothesis testing Often we are interested in testing the hypothesis that the effect size is equal to zero: We want to test: Against: (other possible alternatives: H a >0, H a <0).

Two types of mistakes Type I error: Reject the null hypothesis H o when it is in fact true. The significance level (or level) of a test is the probability of a type I error  =P(Reject H o |H o is true) Example 1: Female Pradhan’s year of education is 7.13, and Male’s is 9.92 in our sample. Do female Pradhan have different level of education, the same? If I say no, how confident am I in the answer? Example 2: Female Pradhan’s number of children is 2.45, and Males’s is 2.40 in our sample. Do female Pradhan have different number of children, or the same? If I say no, how confident am I in the answer? Common level of  : 0.05, 0.01, 0.1.

Two types of mistakes Type II error: Failing to reject H o when it is in fact false. –The power of a test is one minus the probability of a type II error P i (0) = P(Reject H 0 |Effect size not zero) Example: If I run 100 experiments, in how many of them will I be able to reject the hypothesis that women and men have the same education at the 5% level? Or: how likely is my experiment to fail to detect an effect when there is one?

Testing equality of means We use So if t  1.96, we reject the hypothesis of equality at a 5% level of confidence. It t  1.96, we fail to reject the hypothesis of equality at a 5% level of confidence Example of Pradhan’s education: –Difference: 7.85 –Standard error: 1.45 –We definitely reject equality at 5% level.

Calculating Power When planning an evaluation, with some preliminary research we can calculate the minimum sample we need to get to: –Test a pre-specified hypothesis (e.g. treatment effect is 0) –For a pre-specified level (e.g. 0.05) –Given a pre-specified effect size (e.g. 0.2 standard deviation of the outcomes of interest). –To achieve a given power A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if H o is in fact false (e.g. the treatment effect is not zero), we will be able to reject it. The larger the sample, the larger the power. Common Power used: 80%, 90%

Ingredients for a power calculation in a simple study What we needWhere we get it Significance levelThis is conventionally set at 5% The mean and the variance of the outcome in the comparison group From a small survey in the same or a similar population The effect size that we want to detectWhat is the smallest effect that should prompt a policy response? Rationale: If the effect is any smaller than this, then it is not interesting to distinguish it from zero

Picking an effect size What is the smallest effect that should justify the program to be adopted: –Cost of this program vs the benefits it brings –Cost of this program vs the alternative use of the money If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero

Standardized Effect Sizes How large an effect you can detect with a given sample depends on how variable the outcomes is. –Example: If all children have very similar learning level without a program, a very small impact will be easy to detect The Standardized effect size is the effect size divided by the standard deviation of the outcome –  = effect size/St.dev. Common effect sizes:  small)  medium)  large)

Power Calculations using the OD software Choose “Power vs number of clusters” in the menu “clustered randomized trials”

Cluster Size Choose cluster with 1 units

Choose Significance Level and Treatment Effect Pick  –Normally you pick 0.05 Pick  –Can experiment with 0.20 You obtain the resulting graph showing power as a function of sample size.

Power and Sample Size

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Clustered Design Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups Examples: PROGRESAVillage Gender ReservationsPanchayats Flipcharts, Dewormingschool Iron supplementationFamily

Reason for adopting cluster randomization Need to minimize or remove contamination –Example: In the deworming program, schools was chosen as the unit because worms are contagious Basic Feasibility considerations –Example: The PROGRESA program would not have been politically feasible if some families were introduced and not others. Only natural choice –Example: Any education intervention that affect an entire classroom (e.g. flipcharts, teacher training).

Impact of Clustering The outcomes for all the individuals within a unit may be correlated –All villagers are exposed to the same weather –All Panchayats share a common history –All students share a schoolmaster –The program affect all students at the same time. –The member of a village interact with each other We call  the correlation between the units within the same cluster

Implications For Design and Analysis Analysis: The standard errors will need to be adjusted to take into account the fact that the observations within a cluster are correlated. Adjustment factor (design effect) for clusters of size m, with intra-cluster correlation of  D=1-(m-1)*  Design: We need to take clustering into account when planning sample size

Example of group effect multipliers ________________________________ Intraclass Randomized Group Size_ Correlation      

Implications We now need to consider  when choosing a sample size (as well as the other effects) It is extremely important to randomize an adequate number of groups. Often the number of individual within groups matter less We have to simulate various scenarios to estimate whether it is better to have more than one unit per cluster or less.

Choosing the number of clusters with a known number of units Example: Randomization of a treatment at the classroom level with 20 students per class: –Choose other options as before –Set the number of students per school (e.g. 20) –set 

Power Against Number of Clusters with 20 Students per Cluster

Choosing the number of clusters when we can choose the number of units To chose how many Panchayat to survey and how many village per Panchayats to detect whether water improvement are significantly different for women and men Mean in unreserved GPs: 14.7 Standard deviation: 19  0.07 We want to detect at least a 30% increase, so we set delta = 0.24 We look for a power of 80%

Power Against Number of Clusters with 5 Villages Per Panchayat

Minimum Number of GP’s, Fix Villages per GP We search for the minimum number of GP we need if we survey 1 village per GP: –Answer: 553 We search for the minimum number of GP if we survey 2, 3, 4, etc… village per GP For each combination, we calculate the number of villages we will need to survey, and the budget.

Number of Clusters for 80% power

What sample do we need?

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Availability of a Baseline A baseline has three main uses: –Can check whether control and treatment group were the same or different before the treatment –Reduce the sample size needed, but requires that you do a survey before starting the intervention: typically the evaluation cost go up and the intervention cost go down To compute power with a baseline: –You need to know the correlation between two subsequent measurement of the outcome (for example: pre and post test score in school). –The stronger the correlation, the bigger the gain. –Very big gains for very persistent outcomes such as tests scores Using OD –Pre-test score will be used as a covariate, r2 is it correlation over time.

Stratified Samples Stratification will reduce the sample size needed to achieve a given power (you saw this in real time in the Balsakhi exercise). The reason is that it will reduce the variance of the outcome of interest in each strata (and hence increase the standardized effect size for any given effect size) and the correlation of units within clusters. Example: if you randomize within school and grade which class is treated and which class is control: –The variance of test score goes down because age is controlled for –The within cluster correlation goes down because the “common principal effect” disappears. Common stratification variables: –Baseline values of the outcomes when possible –We expect the treatment to vary in different subgroups

Common Designs that Influence Power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Power in Stratified Samples with OD We use a lower value for standardized effect size We use a lower value for  (this is now the residual correlation within each site) We need to specify the variability in effect size Common values: 0.01; 0.05

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

The Hypothesis that is being tested Are you interested in the difference between two treatments as well as the difference between treatment and control? Are you interested in the interaction between the treatments? Are you interested in testing whether the effect is different in different subpopulations? Does your design involve only partial compliance? (e.g. encouragement design?)

Conclusions Power calculations involve some guess work. They also involve some pilot testing before the proper experiment begins However, they can save you a lot of heartache if you know what you are getting into!