Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experimental designs for Impact Evaluation: part II

Similar presentations


Presentation on theme: "Experimental designs for Impact Evaluation: part II"— Presentation transcript:

1 Experimental designs for Impact Evaluation: part II
Héctor Lamadrid Center for Population Health Research National Institute of Public Health, Mexico We will now focus on practical issues related to experimental designs.

2 In this second presentation we will give attention to:
Factors that influence the statistical power and required sample size of the evaluation study. Other important analytical considerations We will look into issues related with statistical power and other considerations when conducting an experiment in the context of program evaluation.

3 What is statistical power?
Statistical Power is the probability of rejecting the null hypothesis when it is in fact false In other words, it is the probability that we find a statistically significant programme impact when an effect actually exists. It is the complement of the type II error (not rejecting a false null hypothesis) The usual definition of “Acceptable” power is 0.80 or 0.90 Power is the capacity that a statistical test has to detect an association that indeed exists. In our case it is the probability that we find a program impact given that there really is an impact.

4 Factors that influence power
SAMPLE SIZE Significance level The variance of the outcome Proportion of subjects assigned to the treatment Effect size Sample size influences power in a directly proportional way, the larger the sample the larger the power. Other important factors are the significance level, the variance of the outcome variable, the proportion of subjects or units assigned to the treatment and the effect size (the true magnitude of the impact).

5 Factors that influence power: significance
Significance level (a): Maximum acceptable probability of rejecting the null hypothesis when it’s actually true (type I error) Usually a=0.05

6 Power and significance of a test
This graph illustrates the relationship between alpha and power. If we move the critical value to the right (alfa decreases) power will also decrease and viceversa. Power

7 Factors that influence power: Proportion of treated subjects
Remember that in the context of an experimental design we could estimate the impact fitting the model… Remember that b1 is the programme impact, the variance of its estimate is given by… The proportion of treated subjects influences the variance of the impact estimate and thus the power. The relationship is u-shaped, with maximum power reached when the proportion of treated subjects is 50% all other things equal. A frequently asked question is whether it is “mandatory” to have a “50-50” distribution of treated and untreated subjects: it is not; estimations can be done even with unequal proportions of treated and untreated subjects, they will just have less power (but it can be compensated with a larger sample!).

8 Looking at this equation one can see that the variance of the estimator is at its minimun when half the population is receiving the programme. (Where P is the proportion treated) In regression analysis, the variance of the beta coefficient is inversely proportional to the variance of the independent variable X. Looking at the formula above you can see the reason why a distribution of the treated and untreated reflects into more power: the maximum variance of the program variable P (a dummy variable) is reached at 50% treated (that is because the variance of a dummy variable is p(1-p) where p is the proportion of 1’s).

9 Effect size To achieve a power k the effect size should be…
Where b is the effect size, t1-k and ta represent the t-values corresponding to the desired power and significance levels, respectively, and SE(b) is the standard error of the effect estimate. The largest the effect size, the easier it will be to detect it; small effects will translate into less power, all other things being equal.

10 Effect size Minimum detectable effect size
Larger sample sizes translate into a smaller MDE… Therefore if the true effect is small we will require large sample sizes to achieve a “decent” power. We can re-work the equations in the previous slides and we will come to realize that evaluation studies have a minimum detectable effect size, given a sample size, a proportion of treated and a specified power and significance level. This minimum detectable effect size is a statistical term that does not necessarily correspond to a useful or meaningful effect size in biological or sociological terms (for example, an increase of 0.5 cm in average children height may be statistically significant but not meaningful in biological terms). We should plan our evaluation in a way that ensures that the MDE will be meaningful, this is normally done through acknowledging what the stated goals of the program are in terms of the outcome of interest.

11 Proportion of treated Optimal treatment allocation if 50 % (50 % treater, 50 % untreated but…) If there is a budget constraint (as usual), the optimal treated group would be smaller The optimum allocation for treatment in case of a budget constraint will be… Although statistically speaking, the optimal proportion of treated is 50%, sometimes budget constraints make this situation unfeasible. The optimal allocation, in financial terms, will vary according to the comparative cost of adding more either treated or untreated units to the sample.

12 Proportion of treated Where P is the proportion of treated units; Cc is the cost for the comparison units vs. Ct is the cost for the treated units… This formula, suggested in Duflo et al (Duflo E., R. Glennerster, and M. Kremer. (2006) “Using Randomization in Development Economics Research: A Toolkit.” Available at: Gives the optimal treatment allocation considering costs.

13 Special Situations Grouped errors
This occurs when subjects are not independent. i.e. PROGRESA: randomization occurred at the village level (although individual information is available) Outcomes may be correlated within the village This causes a design effect which increases the variance of the estimators. Thus the minimum detectable effect will be larger. Practical advice: Better to increase the number of clusters than the number of individuals within the clusters! We will now cover other practical issues concerning experimental designs. Grouped errors (as opposed to independent identically distributed errors), occur when we have cluster sampling. This causes a design effect that must be taken into account in the analysis and that translates into larger standard errors of the impact estimates (estimates are comparatively less precise). This causes power to decrease in comparison to an analysis performed on a simple random sample. A question often arises of whether in this case one should compensate by increasing the number of clusters or the number of units within the sample. The answer is that it is much better to have more clusters, even with a comparatively small within-cluster sample size; the reason for this is that units (i.e. individuals or households) tend to be highly correlated within clusters (such as villages), so adding more units will not increase the explanatory power of our models (it would be adding “more of the same thing”).

14 Imperfect compliance We talked about how sometimes the randomised assignment only influences the probability that someone is treated. That is, compliance may not be perfect Same situation occurs in Clinical Trials: not everybody takes their medicine! Also, some people in the control group may actually receive the treatment. This possibilities should be taken into account when doing the analysis (instrumental variables models can be a solution). One issue that often arises is that people sometimes do not comply with the programs, for example: some people will not care take the program even though they are assigned to it, and the opposite situation might happen as well. This can be problematic if the reason for not complying with the program (or to enroll while being assigned to be a control) are somehow correlated with the potential outcome, and can lead to bias in the impact estimates. This can be controlled by the use of instrumental variables.

15 Power and sample size: what we need to know
Remember what you need to know: Mean and variance of the outcome in the absence of the experiment Previous research Baseline survey Educated guess? For clustered designs: intra-class correlation Magnitude of the expected impact As a summary, information on the quantities listed here is crucial for a good power calculation.

16 Analytical Issues: Control for Covariates
In a randomised evaluation: controlling for baseline covariates does not change the expected value of the effect estimator but… It may decrease its variance (which is good!) If however you control for covariates that do not really affect the outcome you will increase the variance rather than reduce it. Please DO NOT control for covariates affected by the treatment! (bias will arise) We’ll now cover other analytical considerations of experimental designs. A frequently asked question relates to adjustment for covariates: although in terms of bias adjustment will not make much difference (due to confounding being controlled by the random assignment of the treatment), adjustment for covariates will decrease the uncertainty in all estimates, including of course our impact estimate, making it more precise. However, one must be careful to only include covariates that are true determinants of the outcome (this comment is valid for all statistical models), and not variables that are irrelevant, as the latter will introduce noise in the estimates, decreasing precision. The guide to include only the necessary covariates will be, as always, our conceptual framework.

17 Analytical Issues: Control for Covariates
A usual variable to adjust for are outcome levels at baseline. This is not recommended. Variables to adjust for must be selected BEFORE the analysis to avoid specification searching (data mining). Adjusting for baseline levels of the outcome is a frequent practice although we can not recommend it. The reason is that the baseline outcome will be, almost by definition, endogenous (as it will share unobservables with follow-up levels of the same outcome, and this endogeneity can be transmitted to other estimates in the model). A better way to control for baseline differences in the outcome is the use of Difference-in-Differences models, which will be covered later. It is strongly recommended that variables to adjust for are selected beforehand, for theoretical reasons; this is to avoid suspicion of “specification searching” which is the (unethical) practice of looking for the “right” set of covariates, often to achieve the desired result.

18 Analytical Issues: Stratification
Randomization could be done within strata (blocks) of the populations. Stratification assures that the variables defining these strata are equally distributed among treatment assignment groups. The most extreme case is when pairs of units are matched on observable characteristics and one of them is randomly assigned to treatment (i.e. Mexico’s People’s Health Insurance) This procedure tends to diminish the variance of the impact estimators Allows evaluators to look for heterogeneous effects Stratification or block randomization is generally regarded as a favorable practice as it increases the precision of the impact estimate.

19 Other Practical Issues
Level of Randomisation: Individual Community School Possible spillover: randomisation level should capture this effect Example: deworming medication Even though this issue depends on the context, group randomization is generally preferable Another interesting discussion deal with the level of randomization. As a general advice, it is preferable to randomize in a way that will minimize spillovers. For example: suppose we randomly allocate some households, within the same village, to receive treatment for intestinal worms: it is likely that at least some families will share the medication with neighbors or relatives which were meant to be controls, causing a spillover and an underestimation of the treatment effect. A better way to randomize will be to randomly assign entire villages to receive the treatment, which will largely decrease the chance of spillover to control villages.

20 Cross-cutting designs
This allows to compare different components of a program. Suppose a programme has components A and B, randomize to get the comparison groups: Pure control (no treatment) Just A Just B A & B This is useful to find out which components of the programme need to be scaled-up/modified. If the program to evaluate has several components, groups of units can be randomized to receive one, the other, or both components. As long as one group is unexposed to either component, the separate effect of each component can be evaluated, as well as potential sinergistic effects (interactions).

21 Baseline surveys In theory: unnecessary in randomized designs, however… Controlling for baseline variables may reduce variation in the outcome/sample size/cost. Makes it possible to examine interactions with baseline variables. In theory baseline surveys are unnecessary in experiments, because it is expected that variables are balanced between groups. In practice though, baselines surveys are almost always done. One of the reasons is that in that case you are able to check balance instead of trusting in luck, another reason is that that allows you to control for baseline covariates to increas precision of the estimates, as we have previously explained.

22 Use of administrative data
Information collected by programme implementors can be used sometimes instead of info collected ad hoc for the evaluation. It is important that the quality of the administrative data is the same in treated and untreated groups. Administrative data is often useful, its main limitations are bad data quality and scarcity of data on covariates.

23 More analytical Issues
What if compliance with the programme is not perfect? Do not CUT the sample! This would introduce the selection bias we were trying to avoid by randomising Intent-to-treat analysis is still useful Instrumental Variables regression could be helpful (will discuss this technique in a couple of days) People often do not 100% comply with the programs; this can introduce a bias if the reasons for which somebody does not comply are related to the outcome (causing endgeneity). Instrumental variables, using the random assignment as the instrument can be a powerful solution.

24 Attrition Random attrition (losses to follow up causing missing data) only reduces statistical power Frequently however attrition is not random: Example: those who benefit less from a programme abandon the study: overestimation of programme effect Attrition rates that are evenly balanced in T & C groups are reassuring but… It is possible that those who are lost are different in each group, still causing bias Thus, difficult to solve… Attrition is another potential problem, especially if the reason for the loss is related to the potential outcome of the subjects. For example, people who are not benefitting from a cash transfer program may be prone to migrate, leaving behind only those wh are receiving a benefit; thi situation would lead to an overestimation of the impact of the program. This is difficult to solve, the best is to avoid loses all together, if this is not possible and especially if attrition is large, Heckman selection models might be helpful.

25 Attrition Possible solutions: If this is not possible:
Tracking individuals even if they leave the programme A random sample of those lost to follow up is good If this is not possible: Identify sistematic differences of those lost to those who remained in terms of baseline variables. Adjust in statistical analysis for those differences, statistical approaches such as Heckman’s selection models can be tried.

26 Heterogeneous impacts
Ideally : researchers must specify before the experiment if heterogenous impacts are to be tested for some subgroups of the population Sometimes apparent unexpected heterogeneous effects appear Another issue to bear in mind is the possibility of Hetergeneous impact. It is ideal that the possibility of heterogeneous impcts is considered since the evaluatin plan, however, some time unexpected effects mayy appear and be reported. This must be clearly stated in evaluation reports to avoid suspicion of specification searching.

27 Normally this should not be reported as part of the main evaluation (suspicion of data mining).
It is recommended that these results are clearly reported as ex post and to suggest further evaluation designed to specifically confirm these hypotheses.

28


Download ppt "Experimental designs for Impact Evaluation: part II"

Similar presentations


Ads by Google