2Let’s Look at our Example Research Question How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes?PopulationWho Do You Want These Study Results to Generalize To??
3Population The group you wish to generalize to is your population. There are two types:Theoretical populationIn our example, this would be all pharmacy students in the USAccessible populationIn our example, this would be all COP pharmacy students
4SamplingTarget population or the Sampling frame: All in the accessible population that you can draw your sample from.Sample: The group of people you select to be in the study.A subgroup of the target population This is not necessarily the group that is actually in your study.
5Sampling How you select your sample: Sampling Strategies Probability SamplingSimple random samplingStratified samplingMultistage cluster samplingNonprobability samplingConvenience SamplingSnowball SamplingHow you select your sample:
6Sample Size Select as large a sample as possible from your population. There is less potential error that the sample is different from the population when you use a large sample.Sampling error: The difference between the sample estimate and the true population value (example: exam score).
7Sample SizeSample size formulas/tables can be used. Factors that are considered include: Confidence in the statistical test Sampling errorSee Appendix B in Creswell (pg 630) Sampling error formula – used to determine sample size for a survey Power analysis formula – used to determine group size in an experimental study.
8Back to Our ExampleHow do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes?What is our theoretical population?What is our accessible population?What sampling strategy should we use?
9Important Concept Random sampling vs random assignment We have talked about random sampling in this session.Random sampling is not the same as random assignment. Random sampling is used to select individuals from the population who will be in the sample. Random assignment is used in an experimental design to assign individuals to groups.
10Director of Program Evaluation and Medical Education Research VALIDITYLou Ann Cooper, PhDDirector of Program Evaluation and Medical Education ResearchUniversity of FloridaCollege of Medicine
11INTRODUCTION Both research and evaluation include: Design – how the study is conductedInstruments – how data is collectedAnalysis of the data to make inferences about the effect of a treatment or intervention.Each of these components can be affected by bias.Bias attributable to the investigator, the sample, the method, or the instrument may not be completely avoidable in every instance, but scientists want to know the possible sources of bias and how bias is likely to influence evidence. The presence of bias, while often not avoidable and indeed inherent to certain study designs,can limit, to varying degrees, the relevance and applicability of a given study.It is important to address the issue of bias early in the design phase, to ensure appropriate design for the study hypothesis and to outline procedures for data collection and analysis. Whereas some bias (eg, confounding) can be adjusted or corrected for in the statistical analysis, much cannot and may render a study's results invalid.When faced with a claim that something is true, scientists respond by asking what evidence supports it.But scientific evidence can be biased in how the data are interpreted, in the recording or reporting of the data, or even in the choice of what data to consider in the first place.One safeguard against undetected bias in an area of study is to have many different investigators or groups of investigators working in it.
12INTRODUCTION Two types of error in research Random error due to random variation in participants’ responses at measurement. Inferential statistics, i.e. the p-value and 95% confidence interval, measure random error and allow us to draw conclusions based on research data.Systematic error or bias.
13BIAS: DEFINITIONDeviations of results (or inferences) from the truth, or processes leading to such deviation. Any trend in the selection of subjects, data collection, analysis, interpretation, publication or review of data that can lead to conclusions that are systematically different from the truth.Systematic deviation from the truth that distorts the results of research.Bias is a form of systematic error that can affect scientific investigations and distort the measurement process.A biased study loses validity in relation to the degree of the bias.While some study designs are more prone to bias, its presence is universal. It is difficult or even impossible to completely eliminate bias.In the process of attempting to do so, new bias may be introduced or a study may be rendered less generalizable.Therefore, the goals are to minimize bias and for both investigators and readers to comprehend its residual effects, limiting misinterpretation and misuse of data. Numerous forms of bias have been described, and the terminology can be confusing, overlapping, and specific to a medical specialty.
14BIASBias is a form of systematic error that can affect scientific investigations and distort the measurement process.Bias is primarily a function of study design and execution, not of results, and should be addressed early in the study planning stages.Not all bias can be controlled or eliminated; attempting to do so may limit usefulness and generalizability.Awareness of the presence of bias will allow more meaningful scrutiny of the results and conclusions.A biased study loses validity and is a common reason for invalid research.
15POTENTIAL BIASES IN RESEARCH AND EVALUATION Study DesignIssues related to Internal validityIssues related to External validityInstrument DesignIssues related to Construct validityData AnalysisIssues related to Statistical Conclusion validity
16VALIDITYValidity is discussed and applied based on two complimentary conceptualizations in education and psychology:Test validity: the degree to which a test measures what it was designed to measure.Experimental validity: the degree to which a study supports the intended conclusion drawn from the results.
17FOUR TYPES OF VALIDITY QUESTIONS Can we generalize to other persons, places, times?ExternalCan we generalize to the constructs?ConstructInternalIs the relationship causal?ConclusionIs there a relationship between cause and effect?
18CONCLUSION VALIDITYConclusion validity is the degree to which conclusions we reach about relationships are reasonable, credible or believable.Relevant for both quantitative and qualitative research studies.Is there a relationship in your data or not?In evaluating any experiment three decisions about covariation have to be made with the sample data on hand:Is the study sensitive enough to permit reasonable statements about covariation? Statistical power/sample sizeIf it is sensitive enough, is there any reasonable evidence from which to infer that the presumed cause and effect covary? Given a specified alpha level and the obtained variancesIf there is such evidence, how strongly do the two variables covary? Effect sizes
19STATISTICAL CONCLUSION VALIDITY Basing conclusions on proper use of statisticsReliability of measuresReliability of implementationType I Errors and Statistical SignificanceType II Errors and Statistical PowerFallacies of AggregationCorrelational studies are plagued with causal ambiguity – correlation does not imply causation.
20STATISTICAL CONCLUSION VALIDITY Interaction and non-linearityRandom irrelevancies in the experimental settingRandom heterogeneity of respondentsRandom irrelevancies of an experimental setting other than the treatment will undoubtedly affect scores on the dependent variable and will inflate error variance. Control by choosing settings free from extraneous sources of variation or by choosing experimental procedures that force participants attention on the treatment and lower the salience of environmental variables. This is very difficult to do. Measure the anticipated sources of extraneous variance which are common to all the treatment groups as validly as possible Monitor and measure variables that add to the error variance to include and control for in the statistical analysis.Random heterogeneity of participants: respondents in any
21VIOLATED ASSUMPTIONS OF STATISTICAL TESTS The particular assumptions of a statistical test must be met if the results of the analysis are to be meaningfully interpreted.Levels of measurement.Example: Analysis of Variance (ANOVA)The particular assumptions of a chosen statistical test have to be known and when possible tested in the data at hand.
22LEVELS OF MEASUREMENTA hierarchy is implied in the ides of level of measurement.At lower levels, assumptions tend to be less restrictive and data analyses tend to be less sensitive.In general, it is desirable to have a higher level of measurement (interval or ratio) rather than a lower one (nominal or ordinal).NominalCategorical, uniquely name the attributeNo ordering impliedOrdinalAttributes can be rank-orderedDistances between values do not have meaningIntervalThe interval between values is interpretableAverages can be computedRatioZero point is meaningfulA meaningful fraction/ratio can be constructedIt's important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or ordinal).
23STATISTICAL ANALYSIS AND LEVEL OF MEASUREMENT ANALYSIS OF VARIANCE ASSUMPTIONSIndependence of cases.Normality. In each of the groups, the data are continuous and normally distributed.Equal variances or homoscedasticity. The variance of data in groups should be the same.The Kruskal-Wallis test is a nonparametric alternative which does not rely on an assumption of normality.Independence - each person contributes only one score to the analysis.Levene's test for homogeneity of variances is typically used to confirm homoscedasticity. The Kolmogorov-Smirnov or the Shapiro-Wilk test may be used to confirm normality.Kruskal-Wallis or a method such as Welch’s ANOVA that is robust to the equal variances assumption.
24RELIABILITYMeasures (tests and scales) of low reliability may not register true changes.Reliability of treatment implementation – when treatments/procedures are not administered in a standard fashion, error variance is increased and the chance of obtaining true differences will decrease.Measures of low reliability (conceptualized as stability or test-retest cannot be depended upon to register true changee.Unreliability inflates standard errors of estimates – standard errors play a crucial role in inferring statistical difference between groups. Some ways to control for unreliability areUsing longer tests in which items have high intercorrelationsUsing more aggregated units, i.e. groups instead of individuals since a group mean will be more stable.Where justified can use corrections for attenuation.Reliability of treatment implementation – the way a treatment is implemented may differ from one person to another if different persons are responsible for implementing the treatment. There may also be differences from occasion to occasion even when the same researcher implements the treatment.Lack of standardization both within and between persons will inflate error variance and decrease the chance of obtaining true differences.
25TRUE POPULATION STATUS STATISTICAL DECISIONTRUE POPULATION STATUSWe know the probability of a Type I error because we set alpha, by convention this is usually – alpha is the probability of making a correct decision also referred to as the confidence level. The probability of making a type II error is beta. Power is the probability of correctly rejecting the null hypothesis. That is, it is the probability of rejecting the null hypothesis when it is really false. Power is another way of talking about Type II errors. This is defined in terms of beta as 1 – beta.
26TYPE I ERRORS AND STATISTICAL SIGNIFICANCE A Type I error is made when a researcher concludes that there is a relationship and there really isn’t (False positive)If the researcher rejects H0 because p ≤ .05, ask:If data are from a random sample, is significance level appropriate?Are significance tests applied to a priori hypotheses?Fishing and the error rate problemFishing and the error rate problem – the probability of making a Type I error on a particular comparison in a given experiment increases with the number of comparisons to be made in that experiment.Example: Analysis of scales item by item.
27TYPE II ERRORS AND STATISTICAL POWER A Type II error is made when a researcher concludes that there is not a relationship and there really is (False negative)If the researcher fails to reject H0 because p > .05, ask:Has the researcher used statistical procedures of adequate power?Does failure to reject H0 merely reflect a small sample size?The lower the power of the statistical test, the lower the likelihood of capturing an effect which does in fact exist.
28FACTORS THAT INFLUENCE POWER AND STATISTICAL INFERENCE Alpha levelEffect sizeDirectional vs. Non-directional testSample sizeUnreliable measuresViolating the assumptions of a statistical test
29RANDOM IRRELEVANCIESFeatures of the experimental setting other than the treatment affect scores on the dependent variableControlled by choosing settings free from extraneous sources of variationMeasure anticipated sources of variance to include in the statistical analysis
30RANDOM HETEROGENEITY OF RESPONDENTS Participants can differ on factors that are correlated with the major dependent variablesCertain respondents will be more affected by the treatment than othersMinimized byBlocking variables and covariatesWithin subjects designs
31STRATEGIES TO REDUCE ERROR TERMS Subjects as own controlHomogeneous samplesPretest measures on the same scales used for measuring the effectMatching on variables correlated with the post-testEffects of other variables correlated with the post-test used as covariatesIncrease the reliability of the dependent variable measures
32STRATEGIES TO REDUCE ERROR TERMS Estimates of the desired magnitude of a treatment effect should be elicited before research beginsAbsolute magnitude of the treatment effect should be presented so readers can infer whether a statistically reliable effect is practically significant.
33INTERNAL VALIDITYInternal validity has to do with defending against sources of bias arising in a research design.To what degree is the study designed such that we can infer that the educational intervention caused the measured effect.An internally valid study will minimize the influence of extraneous variables.Example: Did participation in a series of Webinars on TB in children change the practice of physicans?Internal validity has to do with defending against sources of bias arising in a research design. Internal validity has to do with the true causes of the outcomes observed in your study. According to Campbell and Stanley, internal validity is the basic minimum without which an experiment is uninterpretable.Strong internal validity means that you not only have reliable measures of you independent and dependent variable but a strong justification that causally links your independent variables to your dependent variables. At the same time you are able to rule out extraneous variables ot alternative often unanticipated causes for you dependent variables. Internal validity is about causal control.
34THREATS TO INTERNAL VALIDITY HISTORYMATURATIONINTERACTIONSWITHSELECTIONMORTALITYTESTINGTHREATS TO INTERNAL VALIDITYSELECTIONINSTRUMENTATIONSTATISTICALREGRESSION
35INTERNAL VALIDITY: THREATS IN SINGLE GROUP REPEATED MEASURES DESIGNS HistoryMaturationTestingInstrumentationMortalityRegression
36THREATS TO INTERNAL VALIDITY HISTORY The observed effects may be due to or be confounded with nontreatment events occurring between the pretest and the post-testHistory is a threat to conclusions drawn from longitudinal studiesGreater time period between measurements = more risk of a history effectHistory is not a threat in cross sectional designs conducted at one point in timeSome kind of event occurred during the study period and it is reactions to these events that caused the observed outcomes. Sometimes this is a medical event or a political or historical event.In laboratory research the history effect is controlled by insolating respondents from outside influences or by choosing dependent variables that could not plausibly have been effected by outside forces. Unfortunately these techniques are not available to applied researchers.
37THREATS TO INTERNAL VALIDITY MATURATION Invalid inferences may be made when the maturation of participants between measurements has an effect and this maturation is not the research interest.Internal (physical or psychological) changes in participants unrelated to the independent variable – older, wiser, stronger, more experienced.Maturation effects are especially important for studies using children and youth. For example some studies have found that most college students pull out of a depression within six months even if they receive no treatment.
38THREATS TO INTERNAL VALIDITY TESTING Reactivity as a result of testingThe effects of taking a test on the outcomes of a second testPracticeLearningImproved scores on the second administration of a test can be expected even in the absence of intervention due to familiarityThe effect of giving the pretest itself may effect the outcomes of the second test. Part of a student’s performance in assessment tests depends on their familiarity with the format and it has been shown that IQ tests taken a second time result in a 3 – 5 point increase from the first time.In the social sciences the process of measuring may change that which is being measured. The reactive effect occurs then the testing process itself leads to the change in behavior rather than it being a passive record of behavior.
39THREATS TO INTERNAL VALIDITY INSTRUMENTATION Changes in instruments, observers or scorers which may produce changes in outcomesObservers/raters, through experience, become more adept at their taskCeiling and floor effectsLongitudinal studies
40THREATS TO INTERNAL VALIDITY STATISTICAL REGRESSION Test-retest scores tend to drift systematically to the mean rather than remain stable or become more extremeRegression effects may obscure treatment effects or developmental changesMost problematic when participants are selected because they are extreme on the classification variable of interestRegression towards the mean is especially likely when you study extreme groups. Persons with extreme scores will often fall back to the average or regress to the mean on a second administrationStatistical regressionOperates to increase gain scores among low pretest scorers since this group’s pretest scores are more likely to have been depressed by error (students scoring at the bottom of the class typically improve their scores at least a little when they retake the test)Operates to decrease change scores among persons with high pretest scores since their pretest scores are likely to have been inflated by error (Students with perfect scores may miss an item the second time around)Does not affect difference scores among scorers at the center of the pretest distribution since this group is likely to contain as many units whose pretest scores are inflated by error as units whose pretest scores are deflated by it.
41THREATS TO INTERNAL VALIDITY MORTALITY Differences in drop-out rates/attrition across conditions of the experimentMakes “before” and “after” samples not comparableThis selection artifact may become operative in spite of random assignmentMajor threat in longitudinal studiesWhen subjects discontinue the study and this occurs more in certain conditions than others, we do not know how to causally interpret the results because we don’t know how subjects who discontinued participation differed from those who completed it.
42INTERNAL VALIDITY: MULTIPLE GROUP THREATS SelectionInteractions with SelectionSelection-HistorySelection-MaturationSelection-TestingSelection-InstrumentationSelection-MortalitySelection-RegressionSelection bias is a threat to internal validity that can occur when nonrandom procedures are used to assign participants to treatments/groups or when random assignment fails to balance out differences among subjects across the different conditions of the experiment. When subjects can select their own treatments, we do not know whether the intervention or apre-existing factor of the subject caused the outcomes we observed.Selection interactions is a family of threats to internal validity produced when a selection threat combines with one or more of the other threats to internal validity. When a selection threat is already present, other threats can affect some experimental groups but not others.Selection history can result from the various treatment groups coming from different settings such that each group experiences a unique local history that might affect outcome variables.Selection-maturation results when experimental groups are maturing at different speeds.Selection-instrumentation occurs when different groups score at different mean positions on a test whose intervals are not equal. Differential ceiling or floor effects like when an instrument cannot register any more true gain in one of the groups or when more scores from one group than another are clustered at the lower end of the scale.
43THREATS IN DESIGNS WITH GROUPS: SOCIAL INTERACTION THREATS Compensatory equalization of treatmentsCompensatory rivalryResentful demoralizationTreatment imitation or diffusionUnintended treatments
44EXTERNAL VALIDITYThe extent to which the results of a study can be generalizedPopulation validity – generalizations related to other groups of peopleEcological validity – generalizations related to other settings, times, contexts, etc.External validity addresses the ability to generalize your study to other people and other situations. To have strong external validity you need a probability sample of subjects or respondents drawn using chance methods from a clearly defined population. Ideally you will have a good sample of groups and a sample of measurements and situations.External validity is the degree to which the conclusions in your study would hold for other persons in other places and at other times. Questions raised include:Are findings using a scale to measure the construct of interest consistent across samples?To what population does the researcher wish to generalize his/her conclusions?Is there something unique about the study’s participants, the place where they live, the setting involved, or the times of the study that would prevent generalization?When a sample of observations in non-random in unknown ways, the likelihood of external validity is low. This is the case when we use convenience samples.Ecological validity is not to be confused with the ecological fallacy. Rather, ecological validity has to do with whether or not subjects are studies in their natural environment or say in a lab setting. Ecological Validity the extent to which the results of an experiment can be generalized from the set of environmental conditions created by the researcher to other environmental conditions (settings and conditions).Explicit description of the experimental treatment (not sufficiently described for others to replicate) If the researcher fails to adequately describe how he or she conducted a study, it is difficult to determine whether the results are applicable to other settings.
45THREATS TO EXTERNAL VALIDITY Pre-test treatment interactionMultiple treatment interferenceInteraction of selection and treatmentInteraction of setting and treatmentInteraction of history and treatmentExperimenter effectsPretest sensitization (pretest sets the stage) A treatment might only work if a pretest is given. Because they have taken a pretest, the subjects may be more sensitive to the treatment. Had they not taken a pretest, the treatment would not have worked.Multiple-treatment interference (catalyst effect) If a researcher were to apply several treatments, it is difficult to determine how well each of the treatments would work individually. It might be that only the combination of the treatments is effective.Interaction of selection and treatment. People who agree to participate in a particular study may differ substantially from those who refuse and so results obtained on the former may not be generalizable to the latter.Interaction of setting and treatment. Result obtained in one setting may not be obtained in another.Interaction of history and treatment effect (...to everything there is a time...) Not only should researchers be cautious about generalizing to other population, caution should be taken to generalize to a different time period. As time passes, the conditions under which treatments work change. Also causal relationships obtained on a particular day (e.g. 9/11 as an extreme example) may not hold up under more mundane circumstances.Experimenter effect (it only works with this experimenter) The treatment might have worked because of the person implementing it. Given a different person, the treatment might not work at all. Also there are demand effects in which subjects follow orders or cooperate in ways they would be unlikely to do in their daily lives.
46THREATS TO EXTERNAL VALIDITY Reactive arrangementsArtificial environmentHawthorne effectHalo effectJohn Henry effectPlacebo effectParticipant-researcher interactionNovelty effectReactivity refers to changes in the subjects’ behavior simply because they know they are being studied.Faking good and social desirability.Artificial environment (Lab experiments with human subjects) Removing subjects from their environment may lead them to display different behavior from their “true” behavior. The greater the difference in cues and influences between the natural environment and the measurement setting, the greater the potential bias.Hawthorne effect (experimenter expectation) – Do the expectations or actions of the investigator contaminate the outcomes?Named after famous studies at Western Electric’s Hawthorne plant where work productivity improvements were found to reflect researcher attention, not interventions like better lighting. Hawthorne effect (attention causes differences) Subjects perform differently because they know they are being studied. "...External validity of the experiment is jeopardized because the findings might not generalize to a situation in which researchers or others who were involved in the research are not present"The halo effect is a variation of the Hawthorne effect which occurs when participants know they are part of the experimental group and their belief that they are part of a special group pushes them to improve performance. Halo effect is also used to describe rater effects associated with overrater or underrater error.Another variation is the opposite of the halo effect – the John Henry effect. The John Henry effect takes its name from the legendary railroad steel driver who exhausted himself to death. Participants know they are part of the control group and make an extra effort to improve performance.Placebo Effect: participants receiving treatment believe the treatment will have an effect (use single-blind approach)Participant-Researcher Interaction Effect: Gender issues, age, etcNovelty and disruption effect (anything different makes a difference) A treatment may work because it is novel and the subjects respond to the uniqueness, rather than the actual treatment. The opposite may also occur, the treatment may not work because it is unique, but given time for the subjects to adjust to it, it might have worked.
47SELECTING A RESEARCH DESIGN Lou Ann Cooper, PhDDirector of Program Evaluation and Medical Education ResearchUniversity of FloridaCollege of Medicine
48What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..Our inferences are based on general expectations of what data would have been had X not occurred.48
49PRE-EXPERIMENTAL DESIGNS One Group Posttest DesignX OX = Implementation of the treatmentO = Measurement of the participants in the experimental groupAlso referred to as ‘One Shot Case Study’THE FIRST THREE DESIGNS ARE WEAK, BUT FREQUENTLY USED DESIGNS IN THE SOCIAL SCIENCES. WHILE THEY MAY BE USEFUL FOR GENERATING IDEAS, THEY GENERALLY DO NOT PERMIT US TO MAKE CAUSAL INFERENCES BECAUSE THEY FAIL TO RULE OUT A NUMBER OF PLAUSIBLE ALTRENATIVE INTERPRETATIONS.THE FIRST IS THE ONE GROUP, POSTTEST ONLY DESIGN. It involves making observations only on persons who have received an intervention and only after they have received itDeficiencies:1. Lack of pretest leaves us unable to easily infer that the treatment is related to any kind of change2. Lack of a control group. Without this it is difficult to conceptualize the relevant threats and conceptualize the relevant threats and measure them
50What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..Pre-experimental – One Group Posttest DesignOur inferences are based on general expectations of what data would have been had X not occurred. Most all threats are present.For Discussion:What are the treats to validity?50
51SOURCES OF INVALIDITY Internal History – Maturation Testing InstrumentationRegressionMortalitySelectionSelection InteractionsExternalInteraction of Testing and XInteraction of Selection and X–Reactive ArrangementsMultiple X InterferenceHistory, maturation, mortality and selection are all threats to internal validity.The interaction of selection and the experimental variable is a threat to external validity.
52What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..Pre-experimental – One Group Posttest DesignOur inferences are based on general expectations of what data would have been had X not occurred. Most all threats are present.52
53PRE-EXPERIMENTAL DESIGNS Comparison Group Posttest DesignX OOStatic Group ComparisonEx post facto researchNo pretest observationsSometimes a treatment is implemented before the researcher can plan for it and the research design is worked out after the treatment has begun.The most obvious flaw is the absence of pretests which leads to the possibility that any posttest differences between the groups can be attributed either to a treatment effect or to selection differences between the different groups. The plausibility of selection differences in research with nonequivalent groups usually renders this design uninterpretable.
54What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..Pre-experimental – One Group Posttest DesignOur inferences are based on general expectations of what data would have been had X not occurred. Most all threats are present.For Discussion:What if we compare test scores for these students with last year’s scores (assume last year had no streaming video)?54
55SOURCES OF INVALIDITY Internal History + Maturation ? Testing InstrumentationRegressionMortality–SelectionSelection InteractionsExternalInteraction of Testing and XInteraction of Selection and X–Reactive ArrangementsMultiple X InterferenceThreats to validity includeSelection – groups selected may actually be different prior to any treatment.Mortality – the difference between O1 and O2 may be because of the dropout rate of subjects from a specific experimental group which would cause the groups to be unequal.Interaction of selection and maturationInteraction of selection and the experimental variable.
56What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)…..History:Other things could change during “X”Examples:The streaming server was down during the study interval and made it difficult for students to access whenever they wanted it.Maturation:Do students get more mature as learners as they “settle down” during the semester.Testing:What if the pre-test caused the students to think more about the content as they watched the videos?Instrumentation:What if the exam has short answer questions that have to be hand-graded and the faculty member is exhausted at the end of the semester and therefore, is less accurate in grading?Statistical regressionWhat about regression to the mean?56
57PRE-EXPERIMENTAL DESIGNS One Group Pretest/Posttest DesignO X ONot a true experimentBecause participants serve as their own control, results may be less biasedUse of this design is widespread but there are several weaknesses:Observed change, or gain scores on a knowledge test, may be due to history, other events that influence the posttest measure. In order to rule this out the researcher has to make the case either that this is implausible in the context of a particular study or that such events are plausible, but did not operate.Regression toward the mean – most common regression artifact arises when a special program is given only to those with extreme scores on the pretest. Produces a spurious improvement.Maturation.Testing – exposure to an outcome measure at one time can lead to performance at another. In education, the pretest can be an impetus for a student to learn the correct answers to items thus improving their performance on the posttest regardless of the impact of the intervention.
58What If….We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)…..History:Other things could change during “X”Examples:The streaming server was down during the study interval and made it difficult for students to access whenever they wanted it.Maturation:Do students get more mature as learners as they “settle down” during the semester.Testing:What if the pre-test caused the students to think more about the content as they watched the videos?Instrumentation:What if the exam has short answer questions that have to be hand-graded and the faculty member is exhausted at the end of the semester and therefore, is less accurate in grading?Statistical regressionWhat about regression to the mean?For Discussion:What are the threats to validity (What are the plausible hypotheses that could explain any difference)??58
59SOURCES OF INVALIDITY Internal History – Maturation Testing InstrumentationRegression?Mortality+SelectionSelection InteractionsExternalInteraction of Testing and X–Interaction of Selection and XReactive Arrangements?Multiple X InterferenceUncontrolled threats to validity include:History – between O1 and O2 many events may have occurred apart from X to produce the differences in outcomes. The longer the time lapse between observations the more likely that history becomes a threat.Maturation – between O1 and O2 students may have grown older or internal states may have changed.Testing – the process of measuring may change what is being measured – reactive effectsof testingInstrumentationStatistical regressionInteraction of selection and maturationInteraction of testing and the experimental variableInteraction of selection and the experimental variable
60What If….We could randomize all 300 pharmacy students to the following groups:Group 1: access only streaming videoGroup 2:attend lecturesFor each group, administer both a pre-test and a post-testExperimental Design “Pretest-Posttest Control Group Design”60
61TRUE EXPERIMENTAL DESIGNS Pretest/Posttest Design with Control Group and Random AssignmentR O X OR O OMeasurement of pre-existing differencesControls most threats to internal validityIn a true experiment whether laboratory, field or simulation subjects are randomly assigned to treatment groups. It is this randomization that makes true experiments so strong in internal valdlity and typically allows us to make relatively strong inferences about causality.Random assignment means that on the average at the beginning of a study all your treatment groups are about the same.This said, without a control or comparison group your study is open to the criticism and alternative causal explanations.This study design also has two other important features a pretest and posttest. The pretest allows you to double check that your participants are pretty much alike at the beginning of the study. Because you have both pretest and posttest, you can also assess the level of change.
62What If…. For Discussion: We could randomize all 300 pharmacy students to the following groups:Group 1: access only streaming videoGroup 2:attend lecturesFor each group, administer both a pre-test and a post-testExperimental Design “Pretest-Posttest Control Group Design”R O X OR O OAll threats to internal validity are controlledFor Discussion:What are the threats to validity (What are the plausible hypotheses that could explain any difference)??62
63SOURCES OF INVALIDITY Internal History + Maturation Testing InstrumentationRegressionMortalitySelectionSelection InteractionsExternalInteraction of Testing and X–Interaction of Selection and X?Reactive ArrangementsMultiple X InterferenceThis standard design may pose some threats to validity.History – this is controlled in that the general history events which may have contributed to the O1 and O2 effects would also produce the O# and O4 effects. This is true if both groups are tested simultaneoudly. Intrasession history must be taken into consderation.Maturation and testing – both should be manifested equally in treatment and control groups.Instrumentation is controlled where conditions control for intrasession history especially where fixed tests are used. When observers or interviewers are being used there exists a potential for problems. There should be enough raters/observers for random assignment to condition. Blinding observers to the purpose of the experiment will also help.Regression – this is controlled by the mean differenced regardless of the extremety of scores or characteristics, if the treatment and control groups are randomly assigned from the same extreme pool. If this occurs the groups should regress similarly.Selection controlled by random assignment.Mortality – this is said to be controlled for in this design, but unless the mortality rate is equal in treatment and control groups it is not possible to state with certainty that mortality did not contribute to experimental results.Interaction of testing and X – because the interaction between taking a pretest and the treatment itself may effect the results of the experimental group, it is desirable to use a design that does not use a pretest.Interaction of selection and X – although selection is controlled for by randomly assigning subjects to groups there remains a possibility that the effects demonstrated hold true only for that population from which the experimental and control groups were selected.Reactive arrangements – this refers to the artificiality of the experimental setting and the subject’s knowledge that he is participating in an experiment.Simply being measured or observed during the pretest may sensitize some subjects and they will behave differently as a result. In addition, a pretest may interact with an experimental treatment to heighten the effect of the experimental intervention more than it would have ordinarily.
64What If….We could randomize all 300 pharmacy students to the following groups:Group 1: access only streaming video and post-testGroup 2: attend lectures and post-testExperimental Design “Post-test only control group”64
65TRUE EXPERIMENTAL DESIGNS Posttest Only Control GroupR X OR OThe most obvious flaw is the absence of pretests which leads to the possibility that any posttest differences between the groups can be attributed either to a treatment effect or to selection differences between the different groups. The plausibility of selection differences in research with nonequivalent groups usually renders this design uninterpretable.
66What If….We could randomize all 300 pharmacy students to the following groups:Group 1: access only streaming video and post-testGroup 2: attend lectures and post-testExperimental Design “Post-test only control group”For Discussion:What have we lost by not using a pre-test? (as compared to the experimental randomized pre-test and post-test design)66
67SOURCES OF INVALIDITY Internal History + Maturation Testing InstrumentationRegressionMortalitySelectionSelection InteractionsExternalInteraction of Testing and X+Interaction of Selection and X?Reactive ArrangementsMultiple X Interference
68What If….We could randomize all 300 pharmacy students to the following groups:Group 1: pre-test, access only streaming video, and post-testGroup 2: pre-test, attend lectures, and post-testGroup 3: access only streaming video and post-test onlyGroup 4: attend lectures and post-test onlyExperimental Design “Solomon 4-Group Design”68
69TRUE EXPERIMENTAL DESIGNS Solomon Four Group ComparisonR O X OR O OR X OR OOne of the strongest experimental designs with respect to internal validity is the Solomon four Group Design. In this design there are four randomized groups of subjects. One group receives a pretest, the experimental treatment and a posttest. The second group receives a pretest and a posttest but not treatment (or a different treatments). The third group is identical to the first group except it does not receive the pretest and the fourth group receives posttest only.More expensive – require more subjects and resources.
70What If….We could randomize all 300 pharmacy students to the following groups:Group 1: pre-test, access only streaming video, and post-testGroup 2: pre-test, attend lectures, and post-testGroup 3: access only streaming video and post-test onlyGroup 4: attend lectures and post-test onlyFor Discussion:What have we gained by having 4 groups (esp group 3 and 4)?Experimental Design “Solomon 4-Group Design”70
71What If….It is NOT feasible to use randomization. What if we were to have the following groups:Group 1 (all distant campuses): access only streaming videoGroup 2 (GNV campus):attend lecturesFor each group, administer both a pre-test and a post-testQuasi - Experimental Design “Nonequivalent control group”O X OO OControls:HistoryMaturationTestingInstrumentationSelectionMortalityRegression is ????Interaction of Selection and Maturation is negativeExternal Validity is weak.71
72QUASI-EXPERIMENTAL DESIGNS Nonequivalent Control GroupO X OO OPre-existing differences can be measuredControls some threats to validityA true experiment is simply not always possible, yet investigators still want to make causal statements.If your study has different levels of treatments and people or groups are assigned to those treatments without random assignment you have a quasi-experiment.Some treatments are initially formed on the basis of performance or some variables that are not experimentally induced. Self-selection to treatment groups is also common in quasi-experiments such as program evaluation.The difficulty in quasi-experiments is trying to find just how similar the groups were at the beginning before any treatment began. Sometime, in fact, if groups are created on the basis of dissimilarities such as ability we know the groups are different at the beginning. If we have prior information about the individuals who comprise the individuals or groups in the different treatments we may at least try to institute statistical controls for the variables.Background information – grades, test scores, personality tests collected before the study ever began.Some kind of pretest measures – demographic information such as educational level, occupation, income.Supplemental information from other peopleThere is still the possibility that the groups may differ on variables that we haven’t measured, such as differences in motivation or confidence. It could be these differences instead of a difference in ability that were the true causes of observed differences.The control group is nonequivalent because we did not use random assignment and therefore we cannot assume that on the average that the groups are the same, or equivalent to begin with.
73What If….It is NOT feasible to use randomization. What if we were to have the following groups:Group 1 (all distant campuses): access only streaming videoGroup 2 (GNV campus):attend lecturesFor each group, administer both a pre-test and a post-testQuasi - Experimental Design “Nonequivalent control group”O X OO OControls:HistoryMaturationTestingInstrumentationSelectionMortalityRegression is ????Interaction of Selection and Maturation is negativeExternal Validity is weak.For Discussion:What have we “lost” by not randomizing?73
74SOURCES OF INVALIDITY Internal History – Maturation Testing InstrumentationRegressionMortalitySelectionSelection InteractionsExternalInteraction of Testing and XInteraction of Selection and X–Reactive ArrangementsMultiple X Interference
75QUASI-EXPERIMENTAL DESIGNS Time SeriesO O O O X O O O OIn a times series design you have several observations over time. While you may have some type of experimental intervention, often nature does the experimenting for you. There was a series of pre-intervnetion measures which you continue following after the intervention. In this design the data is often extant especially for the observations before the intervention. In all likelihood you won’t have a control group let alone a randomized control group. This makes is very difficult to tease out specifically what it was about the intervention that caused the observed outcomes.
76QUASI-EXPERIMENTAL DESIGNS Counterbalanced DesignO X1 O X O X3O X3 O X O X2O X2 O X O X1Multiple treatments and tests for all groups.Number of groups = number of treatmentsCan be employed when pretest is not possible and intact groups must be used.Exposure to one treatment may influence subsequent treatments – order effectsIn our example:O video O lecture OO lecture O video O
78CLASSIC VIEW OF TEST VALIDITY Traditional triarchic view of validityContentCriterionConcurrentPredictiveConstructTests were described as “valid” or “invalid”Reliability was considered a separate test trait
79MODERN VIEW OF VALIDITY Scientific evidence needed to support test score interpretationStandards for Educational & Psychological Testing (1999)Cronbach, Messick, KaneSome theory, key concepts, examplesReliability as part of validity
80VALIDITY: DEFINITIONS “A proposition deserves some degree of trust only when it has survived serious attempts to falsify it.” (Cronbach, 1980)According to the Standards, validity refers to “the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.”“Validity is an integrative summary.” (Messick, 1995)“Validation is the process of building an argument supporting interpretation of test scores.” (Kane, 1992)Over the last half century, the concept of validation has evolved from establishing correlation with a dependent variable to the idea that researchers must validate each scale, test, or instrument measuring a construct and do so in multiple ways which taken together form the whole of what validity is.
81WHAT IS A CONSTRUCT?Constructs are psychological attributes, hypothetical conceptsA defensible construct hasA theoretical basisClear operational definitions involving measurable indicatorsDemonstrated relationships to other constructs or observable phenomenaA construct should be differentiated from related theoretical constructs as well as from methodological irrelevanciesConstruct validity is about correspondence between your concepts and the actual measurements that you use. This makes clear that we must have clear conceptual definitions of our variables.
82THREATS TO CONSTRUCT VALIDITY (Cook & Campbell) Inadequate preoperational explication of constructsMono-operation biasMono-method biasInteraction of different treatmentsInteraction of testing and treatmentRestricted generalizability across constructsInadequate preoperational explication of constructsA precise explication of constructs is vital for the linkage between treatments and outcomes.Some possible solutions:think through your concepts betteruse methods (e.g., concept mapping) to articulate your conceptsget experts to critique your operationalizationsMono-operation bias pertains to the independent variable, cause, program or treatment in your study. If you only use a single version of a program in a single place at a single point in time, you may not be capturing the full breadth of the concept of the program.Mono-Method Bias refers to your measures or observations, not to your programs or causes. With only a single version of a self esteem measure, you can't provide much evidence that you're really measuring self esteem. Solution: try to implement multiple measures of key constructs and try to demonstrate (perhaps through a pilot or side study) that the measures you use behave as you theoretically expect them to.Interaction of Different Treatments occurs when the group in your study is also likely to be involved simultaneously in several other programs designed to have similar effects. Can you really label the program effect as a consequence of your program?Interaction of Testing and Treatment Does testing or measurement itself make the groups more sensitive or receptive to the treatment? If it does, then the testing is in effect a part of the treatment, it's inseparable from the effect of the treatment. This is a labeling issue (and, hence, a concern of construct validity) because you want to use the label "program" to refer to the program alone, but in fact it includes the testing.Restricted Generalizability Across Constructs this "unintended consequences" treat to construct validity. This threat reminds us that we have to be careful about whether our observed effects (Treatment X is effective) would generalize to other potential outcomes.
83THREATS TO CONSTRUCT VALIDITY (Cook & Campbell) Confounding constructsConfounding levels of constructsHypothesis guessing within experimental conditionsEvaluation apprehensionResearcher expectanciesConfounding Constructs and Levels of Constructs Like the other construct validity threats, this is essentially a labeling issue -- your label is not a good description for what you implemented.Hypothesis Guessing Most people don't just participate passively in a research project. They are trying to figure out what the study is about. They are "guessing" at what the real purpose of the study is. And, they are likely to base their behavior on what they guess, not just on your treatment.Evaluation Apprehension Many people are anxious about being evaluated. If their apprehension makes them perform poorly (and not your program conditions) then you certainly can't label that as a treatment effect. Another form of evaluation apprehension concerns the human tendency to want to "look good" or "look smart" and so on. If, in their desire to look good, participants perform better (and not as a result of your program!) then you would be wrong to label this as a treatment effect. In both cases, the apprehension becomes confounded with the treatment itself and you have to be careful about how you label the outcomes.Researcher Expectancies The researcher can bias the results of a study in countless ways, both consciously or unconsciously. Sometimes the researcher can communicate what the desired outcome for a study might be (and participant desire to "look good" leads them to react that way). For instance, the researcher might look pleased when participants give a desired answer. If this is what causes the response, it would be wrong to label the response as a treatment effect.
84SOURCES OF VALIDITY EVIDENCE 1. Test ContentTask RepresentationConstruct Domain2. Response Process – Item Psychometrics3. Internal Structure – Test Psychometrics4. Relationships with Other Variables – CorrelationsTest-Criterion RelationshipsConvergent and Divergent Data5. Consequences of Testing – Social contextStandards for Educational and Psychological Testing, 1999
85ASPECTS OF VALIDITY: CONTENT Content validity refers to how well elements of the test or scale relate to the content domain.Content relevance.Content representativeness.Content coverage.Systematic analysis of what the test is intended to measure.Technical quality.Construct irrelevant variance
86SOURCES OF VALIDITY EVIDENCE: TEST CONTENT Detailed understanding of the content sampled by the instrument and its relationship to content domainContent-related evidence is often established during the planning stages of an assessment or scale.Content-related validity studiesExact sampling plan, table of specifications, blueprintRepresentativeness of items/prompts →DomainAppropriate content for instructional objectivesCognitive level of itemsMatch to instructional objectivesReview by panel of experts.Content expertise of item/prompt writersExpertise of content reviewersQuality of items/prompts, sensitivity review
87ASPECTS OF VALIDITY: RESPONSE PROCESSES Emphasis is on the role of theory.Tasks sample domain processes as well as content.Accuracy in combining scores from different item formats or subscales.Quality control – scanning, assignment of grades, score reports.
88SOURCES OF VALIDITY EVIDENCE: RESPONSE PROCESSES Fit of student responses to hypothesized construct?Basic quality control information – accuracy of item responses, recording, data handling, scoringStatistical evidence that items/tasks measure the intended constructAchievement items measure intended content and not other contentAbility items predict targeted achievement outcomeAbility items fail to predict a non-related ability or achievement outcome
89SOURCES OF EVIDENCE: RESPONSE PROCESSES Debrief examinees regarding solution processes.“Think-aloud” during pilot testing.Subscore/subscale analyses- i.e., correlation patterns among part scores.Accurate and understandable interpretations of scores for examinees.
90SOURCES OF VALIDITY EVIDENCE: INTERNAL STRUCTURE Statistical evidence of the hypothesized relationship between test item scores and the constructReliabilityTest scale reliabilityRater reliabilityGeneralizabilityItem analysis dataItem difficulty and discriminationMCQ option function analysisInter-item correlationsScale factor structureDimensionality studiesDifferential item functioning (DIF) studies
91ASPECTS OF VALIDITY: EXTERNAL Can the test results be evaluated by objective criteria?Correlations with other relevant variablesTest-criterion correlations Concurrent or predictiveMTMM matrix Convergent correlations Divergent (discriminant) correlations
92SOURCES OF VALIDITY EVIDENCE: RELATIONSHIPS TO OTHER VARIABLES Statistical evidence of the hypothesized relationship between test scores and the constructCriterion-related validity studiesCorrelations between test scores/subscores and other measuresConvergent-Divergent studiesMTMM
93RELATIONSHIPS WITH OTHER VARIABLES Predictive validity: Variation of concurrent validity where the criterion is in the future.Classic example is to determine whether students who score high on an admissions test such as the MCAT earn higher preclinical GPAs?
94RELATIONSHIPS WITH OTHER VARIABLES Convergent validity: Assessed by the correlation among items which make up the scale (internal consistency), by the correlation of a the given scale with measures of the same construct using instruments proposed by other researchers, and by the correlation of relationships involving the given scale across samples or across methods.
95RELATIONSHIPS WITH OTHER VARIABLES Criterion (concurrent) validity: correlation between scale or instrument measurement items and known accepted standard measures or criteria.Do the proposed measures for a given concept exhibit generally the same direction and magnitude of correlation with other variables as measures of that concept already accepted in this area of research?
96RELATIONSHIPS WITH OTHER VARIABLES Divergent (discriminant) validity: The indicators of different constructs should not be highly correlated as to lead us to conclude that they measure the same thing. This would happen is there is definitional overlap between two constructs
97MULTI-TRAIT MULTI-METHOD MTMM MATRIX Mono-method and/or mono-method biases – use of a single data gathering method or a single indicator for a concept may result in biasMulti-trait/Multi-method validation uses multiple indicators per concept and gathers data for each indicator by multiple methods or multiple sources.Evidence for construct validity especially when developing a new scale is often established through the use if a multi-trait, multi-method matrix. At least two constructs are measured. Each construct is measured at least two different ways and the type of measures is repeated across constructs. Typically under conditions of high construct validity, correlations are high for the same construct (or trait) across a host of different measures. Correlations are low across constructs that are different but measured using the same general technique.Under low construct validity the reverse holds. Correlations are high across traits using the same method but low for the same trait measured in different ways.
98MULTI-TRAIT MULTI-METHOD MTMM MATRIX Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40:ILS = Index of Learning StylesLSTI = Learning Style Type IndicatorActive-reflectiveSensing-intuitiveVisual- verbalSequential-globalExtrovert-introvertSensing-intuitionThinking-feelingJudging- perceiving
99MULTI-TRAIT MULTI-METHOD MTMM MATRIX Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40:ILS = Index of Learning StylesLSTI = Learning Style Type Indicator
100RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY Neither is a property of a test or scale.Reliability is important validity evidence.Without reliability, there can be no validity. Reliability is necessary, but not sufficient for validity.Purpose of an instrument dictates what type of reliability is important and the sources of validity evidence necessary to support the desired inferences.
101SOURCES OF VALIDITY EVIDENCE: CONSEQUENCES Evidence of the effects of tests on students,instruction, schools, societyConsequential validitySocial consequences of assessmentEffects of passing-failing testsEconomic costs of failureCosts to society of false positive/false negative decisionsEffects of tests on instruction/learningIntended vs. unintended
102RELIABILITY AND INSTRUMENTATION Lou Ann Cooper, PhDDirector of Program Evaluation and Medical Education ResearchUniversity of FloridaCollege of Medicine
103TYPES OF RELIABILITY Different types of assessments require different kinds of reliabilityWritten MCQ/Likert-scale itemsScale reliabilityInternal consistencyWritten Constructed Response and EssaysInter-rater agreementGeneralizability theoryIn order to make causal assessments in your research, you must first have reliable measures – stable and/or repeated measures. If the random error variation in your measurements is so large that there is almost no stability you can’t explain anything. Picture an IQ test where an individual’s scores ranged from mornoic to genius level. No one would place any faith in the results of such a test.Reliability is required to make statements of validity. However, reliable measures could be biased and hence “untrue” measures of a phenomenon or confounded with other factors such as an acquiescence response set. Picture a scale that always weighs 5 pounds heavy. Could be very reliable, but not a valid assessment of a person’s true weight.
105ROUGH GUIDELINES FOR RELIABILITY The higher the better!Depends on purpose of testVery high-stakes: > 0.90(Licensure exams)Moderate stakes: at least ~ (Classroom test, Medical school OSCE)Low stakes: > 0.60 (Quiz, test for feedback only)
106INCREASING RELIABILITY Written testsUse objectively scored formatsAt least MCQsMCQs that differentiate between high and low scorersPerformance examsAt least 7-12 casesWell trained standardized patients and/or other ratersMonitoring and quality controlObservational ExamsMany independent raters (7-11)Standard checklists/rating scalesTimely ratings
107SCALE DEVELOPMENT1. Identify the primary purpose for which scores will be used. Validity is the most importantconsideration. Validity is not a property of aninstrument. Inferences to be made determine thetype of items you will write.2. Specify the important aspects of the construct to be measured.
108SCALE DEVELOPMENT 3. Initial pool of items. 4. Expert review (content validity)5. Preliminary item ‘tryout’6. Statistical properties of the itemsItem analysisReliability estimateDimensionality
109ITEM ANALYSIS Item ‘difficulty’ – item variance, frequencies Inter-item covariances/correlationsItem discrimination – an item that discriminates well correlates with the total score.Cronbach’s coefficient alphaFactor Analysis – Multidemensional ScalingIRTStructural aspect of validity.
110NEED TO EVALUATE SCALE Jarvis & Petty (1996) Hypothesis: Individuals differ in the extent to which they engage in evaluative responding.Subjects were undergraduate psychology students.Comprehensive reliability and validity studies.Concluded the scale was ‘unidimensional’.Scale construction is an iterative process whereby items are generated, tested and analyzed and then either retained for further testing, revised or deleted. Through this process, Jarvis and Petty reduced the pool of 46 items they started with to a highly reliable 16-item scalecoefficient alpha = .87.The statistical criteria they used wereItem-total correlation (point biserial correlation greater than .30.Average inter-item correlation greater than .20.Item mean > 2 and < 4 on a 5-point scale.SD ≥ 1.Then they began a rigorous investigation of the factor structure of the scale.Started with EFAFour CFA models
111REFERENCESCook, T.D. & Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings.Downing, S. M. Threats to the validity of locally developedmultiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv in Health Sci Educ 2002; 7: Downing, S. M. Validity: On the meaningful interpretation of assessment data. Med Educ 2003; 37:Messick, S. (1989) Validity. In Educational Measurement 3rd Ed. R. L. Linn, Ed.Downing, S. M. Reliability: On the reproducibility of assessment data. Med Educ, 2004; 38: