POPULATIONS AND SAMPLING

POPULATIONS AND SAMPLING
1

Let’s Look at our Example Research Question
How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes? Population Who Do You Want These Study Results to Generalize To??

Population The group you wish to generalize to is your population.
There are two types: Theoretical population In our example, this would be all pharmacy students in the US Accessible population In our example, this would be all COP pharmacy students

Sampling Target population or the Sampling frame: All in the accessible population that you can draw your sample from. Sample: The group of people you select to be in the study. A subgroup of the target population  This is not necessarily the group that is actually in your study.

Sampling How you select your sample: Sampling Strategies
Probability Sampling Simple random sampling Stratified sampling Multistage cluster sampling Nonprobability sampling Convenience Sampling Snowball Sampling How you select your sample:

Sample Size Select as large a sample as possible from your population.
There is less potential error that the sample is different from the population when you use a large sample. Sampling error: The difference between the sample estimate and the true population value (example: exam score).

Sample Size Sample size formulas/tables can be used. Factors that are considered include:  Confidence in the statistical test  Sampling error See Appendix B in Creswell (pg 630)  Sampling error formula – used to determine sample size for a survey  Power analysis formula – used to determine group size in an experimental study.

Back to Our Example How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes? What is our theoretical population? What is our accessible population? What sampling strategy should we use?

Important Concept Random sampling vs random assignment
We have talked about random sampling in this session. Random sampling is not the same as random assignment.  Random sampling is used to select individuals from the population who will be in the sample.  Random assignment is used in an experimental design to assign individuals to groups.

Director of Program Evaluation and Medical Education Research
VALIDITY Lou Ann Cooper, PhD Director of Program Evaluation and Medical Education Research University of Florida College of Medicine

INTRODUCTION Both research and evaluation include:
Design – how the study is conducted Instruments – how data is collected Analysis of the data to make inferences about the effect of a treatment or intervention. Each of these components can be affected by bias. Bias attributable to the investigator, the sample, the method, or the instrument may not be completely avoidable in every instance, but scientists want to know the possible sources of bias and how bias is likely to influence evidence. The presence of bias, while often not avoidable and indeed inherent to certain study designs, can limit, to varying degrees, the relevance and applicability of a given study. It is important to address the issue of bias early in the design phase, to ensure appropriate design for the study hypothesis and to outline procedures for data collection and analysis. Whereas some bias (eg, confounding) can be adjusted or corrected for in the statistical analysis, much cannot and may render a study's results invalid. When faced with a claim that something is true, scientists respond by asking what evidence supports it. But scientific evidence can be biased in how the data are interpreted, in the recording or reporting of the data, or even in the choice of what data to consider in the first place. One safeguard against undetected bias in an area of study is to have many different investigators or groups of investigators working in it.

INTRODUCTION Two types of error in research
Random error due to random variation in participants’ responses at measurement. Inferential statistics, i.e. the p-value and 95% confidence interval, measure random error and allow us to draw conclusions based on research data. Systematic error or bias.

BIAS: DEFINITION Deviations of results (or inferences) from the truth, or processes leading to such deviation. Any trend in the selection of subjects, data collection, analysis, interpretation, publication or review of data that can lead to conclusions that are systematically different from the truth. Systematic deviation from the truth that distorts the results of research. Bias is a form of systematic error that can affect scientific investigations and distort the measurement process. A biased study loses validity in relation to the degree of the bias. While some study designs are more prone to bias, its presence is universal. It is difficult or even impossible to completely eliminate bias. In the process of attempting to do so, new bias may be introduced or a study may be rendered less generalizable. Therefore, the goals are to minimize bias and for both investigators and readers to comprehend its residual effects, limiting misinterpretation and misuse of data. Numerous forms of bias have been described, and the terminology can be confusing, overlapping, and specific to a medical specialty.

BIAS Bias is a form of systematic error that can affect scientific investigations and distort the measurement process. Bias is primarily a function of study design and execution, not of results, and should be addressed early in the study planning stages. Not all bias can be controlled or eliminated; attempting to do so may limit usefulness and generalizability. Awareness of the presence of bias will allow more meaningful scrutiny of the results and conclusions. A biased study loses validity and is a common reason for invalid research.

POTENTIAL BIASES IN RESEARCH AND EVALUATION
Study Design Issues related to Internal validity Issues related to External validity Instrument Design Issues related to Construct validity Data Analysis Issues related to Statistical Conclusion validity

VALIDITY Validity is discussed and applied based on two complimentary conceptualizations in education and psychology: Test validity: the degree to which a test measures what it was designed to measure. Experimental validity: the degree to which a study supports the intended conclusion drawn from the results.

FOUR TYPES OF VALIDITY QUESTIONS
Can we generalize to other persons, places, times? External Can we generalize to the constructs? Construct Internal Is the relationship causal? Conclusion Is there a relationship between cause and effect?

CONCLUSION VALIDITY Conclusion validity is the degree to which conclusions we reach about relationships are reasonable, credible or believable. Relevant for both quantitative and qualitative research studies. Is there a relationship in your data or not? In evaluating any experiment three decisions about covariation have to be made with the sample data on hand: Is the study sensitive enough to permit reasonable statements about covariation? Statistical power/sample size If it is sensitive enough, is there any reasonable evidence from which to infer that the presumed cause and effect covary? Given a specified alpha level and the obtained variances If there is such evidence, how strongly do the two variables covary? Effect sizes

STATISTICAL CONCLUSION VALIDITY
Basing conclusions on proper use of statistics Reliability of measures Reliability of implementation Type I Errors and Statistical Significance Type II Errors and Statistical Power Fallacies of Aggregation Correlational studies are plagued with causal ambiguity – correlation does not imply causation.

STATISTICAL CONCLUSION VALIDITY
Interaction and non-linearity Random irrelevancies in the experimental setting Random heterogeneity of respondents Random irrelevancies of an experimental setting other than the treatment will undoubtedly affect scores on the dependent variable and will inflate error variance. Control by choosing settings free from extraneous sources of variation or by choosing experimental procedures that force participants attention on the treatment and lower the salience of environmental variables. This is very difficult to do. Measure the anticipated sources of extraneous variance which are common to all the treatment groups as validly as possible Monitor and measure variables that add to the error variance to include and control for in the statistical analysis. Random heterogeneity of participants: respondents in any

VIOLATED ASSUMPTIONS OF STATISTICAL TESTS
The particular assumptions of a statistical test must be met if the results of the analysis are to be meaningfully interpreted. Levels of measurement. Example: Analysis of Variance (ANOVA) The particular assumptions of a chosen statistical test have to be known and when possible tested in the data at hand.

LEVELS OF MEASUREMENT A hierarchy is implied in the ides of level of measurement. At lower levels, assumptions tend to be less restrictive and data analyses tend to be less sensitive. In general, it is desirable to have a higher level of measurement (interval or ratio) rather than a lower one (nominal or ordinal). Nominal Categorical, uniquely name the attribute No ordering implied Ordinal Attributes can be rank-ordered Distances between values do not have meaning Interval The interval between values is interpretable Averages can be computed Ratio Zero point is meaningful A meaningful fraction/ratio can be constructed It's important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or ordinal).

STATISTICAL ANALYSIS AND LEVEL OF MEASUREMENT
ANALYSIS OF VARIANCE ASSUMPTIONS Independence of cases. Normality. In each of the groups, the data are continuous and normally distributed. Equal variances or homoscedasticity. The variance of data in groups should be the same. The Kruskal-Wallis test is a nonparametric alternative which does not rely on an assumption of normality. Independence - each person contributes only one score to the analysis. Levene's test for homogeneity of variances is typically used to confirm homoscedasticity. The Kolmogorov-Smirnov or the Shapiro-Wilk test may be used to confirm normality. Kruskal-Wallis or a method such as Welch’s ANOVA that is robust to the equal variances assumption.

RELIABILITY Measures (tests and scales) of low reliability may not register true changes. Reliability of treatment implementation – when treatments/procedures are not administered in a standard fashion, error variance is increased and the chance of obtaining true differences will decrease. Measures of low reliability (conceptualized as stability or test-retest cannot be depended upon to register true changee. Unreliability inflates standard errors of estimates – standard errors play a crucial role in inferring statistical difference between groups. Some ways to control for unreliability are Using longer tests in which items have high intercorrelations Using more aggregated units, i.e. groups instead of individuals since a group mean will be more stable. Where justified can use corrections for attenuation. Reliability of treatment implementation – the way a treatment is implemented may differ from one person to another if different persons are responsible for implementing the treatment. There may also be differences from occasion to occasion even when the same researcher implements the treatment. Lack of standardization both within and between persons will inflate error variance and decrease the chance of obtaining true differences.

TRUE POPULATION STATUS
STATISTICAL DECISION TRUE POPULATION STATUS We know the probability of a Type I error because we set alpha, by convention this is usually – alpha is the probability of making a correct decision also referred to as the confidence level. The probability of making a type II error is beta. Power is the probability of correctly rejecting the null hypothesis. That is, it is the probability of rejecting the null hypothesis when it is really false. Power is another way of talking about Type II errors. This is defined in terms of beta as 1 – beta.

TYPE I ERRORS AND STATISTICAL SIGNIFICANCE
A Type I error is made when a researcher concludes that there is a relationship and there really isn’t (False positive) If the researcher rejects H0 because p ≤ .05, ask: If data are from a random sample, is significance level appropriate? Are significance tests applied to a priori hypotheses? Fishing and the error rate problem Fishing and the error rate problem – the probability of making a Type I error on a particular comparison in a given experiment increases with the number of comparisons to be made in that experiment. Example: Analysis of scales item by item.

TYPE II ERRORS AND STATISTICAL POWER
A Type II error is made when a researcher concludes that there is not a relationship and there really is (False negative) If the researcher fails to reject H0 because p > .05, ask: Has the researcher used statistical procedures of adequate power? Does failure to reject H0 merely reflect a small sample size? The lower the power of the statistical test, the lower the likelihood of capturing an effect which does in fact exist.

FACTORS THAT INFLUENCE POWER AND STATISTICAL INFERENCE
Alpha level Effect size Directional vs. Non-directional test Sample size Unreliable measures Violating the assumptions of a statistical test

RANDOM IRRELEVANCIES Features of the experimental setting other than the treatment affect scores on the dependent variable Controlled by choosing settings free from extraneous sources of variation Measure anticipated sources of variance to include in the statistical analysis

RANDOM HETEROGENEITY OF RESPONDENTS
Participants can differ on factors that are correlated with the major dependent variables Certain respondents will be more affected by the treatment than others Minimized by Blocking variables and covariates Within subjects designs

STRATEGIES TO REDUCE ERROR TERMS
Subjects as own control Homogeneous samples Pretest measures on the same scales used for measuring the effect Matching on variables correlated with the post-test Effects of other variables correlated with the post-test used as covariates Increase the reliability of the dependent variable measures

STRATEGIES TO REDUCE ERROR TERMS
Estimates of the desired magnitude of a treatment effect should be elicited before research begins Absolute magnitude of the treatment effect should be presented so readers can infer whether a statistically reliable effect is practically significant.

INTERNAL VALIDITY Internal validity has to do with defending against sources of bias arising in a research design. To what degree is the study designed such that we can infer that the educational intervention caused the measured effect. An internally valid study will minimize the influence of extraneous variables. Example: Did participation in a series of Webinars on TB in children change the practice of physicans? Internal validity has to do with defending against sources of bias arising in a research design. Internal validity has to do with the true causes of the outcomes observed in your study. According to Campbell and Stanley, internal validity is the basic minimum without which an experiment is uninterpretable. Strong internal validity means that you not only have reliable measures of you independent and dependent variable but a strong justification that causally links your independent variables to your dependent variables. At the same time you are able to rule out extraneous variables ot alternative often unanticipated causes for you dependent variables. Internal validity is about causal control.

THREATS TO INTERNAL VALIDITY
HISTORY MATURATION INTERACTIONS WITH SELECTION    MORTALITY TESTING THREATS TO INTERNAL VALIDITY     SELECTION INSTRUMENTATION  STATISTICAL REGRESSION

INTERNAL VALIDITY: THREATS IN SINGLE GROUP REPEATED MEASURES DESIGNS
History Maturation Testing Instrumentation Mortality Regression

THREATS TO INTERNAL VALIDITY HISTORY
The observed effects may be due to or be confounded with nontreatment events occurring between the pretest and the post-test History is a threat to conclusions drawn from longitudinal studies Greater time period between measurements = more risk of a history effect History is not a threat in cross sectional designs conducted at one point in time Some kind of event occurred during the study period and it is reactions to these events that caused the observed outcomes. Sometimes this is a medical event or a political or historical event.In laboratory research the history effect is controlled by insolating respondents from outside influences or by choosing dependent variables that could not plausibly have been effected by outside forces. Unfortunately these techniques are not available to applied researchers.

THREATS TO INTERNAL VALIDITY MATURATION
Invalid inferences may be made when the maturation of participants between measurements has an effect and this maturation is not the research interest. Internal (physical or psychological) changes in participants unrelated to the independent variable – older, wiser, stronger, more experienced. Maturation effects are especially important for studies using children and youth. For example some studies have found that most college students pull out of a depression within six months even if they receive no treatment.

THREATS TO INTERNAL VALIDITY TESTING
Reactivity as a result of testing The effects of taking a test on the outcomes of a second test Practice Learning Improved scores on the second administration of a test can be expected even in the absence of intervention due to familiarity The effect of giving the pretest itself may effect the outcomes of the second test. Part of a student’s performance in assessment tests depends on their familiarity with the format and it has been shown that IQ tests taken a second time result in a 3 – 5 point increase from the first time. In the social sciences the process of measuring may change that which is being measured. The reactive effect occurs then the testing process itself leads to the change in behavior rather than it being a passive record of behavior.

THREATS TO INTERNAL VALIDITY INSTRUMENTATION
Changes in instruments, observers or scorers which may produce changes in outcomes Observers/raters, through experience, become more adept at their task Ceiling and floor effects Longitudinal studies

THREATS TO INTERNAL VALIDITY STATISTICAL REGRESSION
Test-retest scores tend to drift systematically to the mean rather than remain stable or become more extreme Regression effects may obscure treatment effects or developmental changes Most problematic when participants are selected because they are extreme on the classification variable of interest Regression towards the mean is especially likely when you study extreme groups. Persons with extreme scores will often fall back to the average or regress to the mean on a second administration Statistical regression Operates to increase gain scores among low pretest scorers since this group’s pretest scores are more likely to have been depressed by error (students scoring at the bottom of the class typically improve their scores at least a little when they retake the test) Operates to decrease change scores among persons with high pretest scores since their pretest scores are likely to have been inflated by error (Students with perfect scores may miss an item the second time around) Does not affect difference scores among scorers at the center of the pretest distribution since this group is likely to contain as many units whose pretest scores are inflated by error as units whose pretest scores are deflated by it.

THREATS TO INTERNAL VALIDITY MORTALITY
Differences in drop-out rates/attrition across conditions of the experiment Makes “before” and “after” samples not comparable This selection artifact may become operative in spite of random assignment Major threat in longitudinal studies When subjects discontinue the study and this occurs more in certain conditions than others, we do not know how to causally interpret the results because we don’t know how subjects who discontinued participation differed from those who completed it.

INTERNAL VALIDITY: MULTIPLE GROUP THREATS
Selection Interactions with Selection Selection-History Selection-Maturation Selection-Testing Selection-Instrumentation Selection-Mortality Selection-Regression Selection bias is a threat to internal validity that can occur when nonrandom procedures are used to assign participants to treatments/groups or when random assignment fails to balance out differences among subjects across the different conditions of the experiment. When subjects can select their own treatments, we do not know whether the intervention or a pre-existing factor of the subject caused the outcomes we observed. Selection interactions is a family of threats to internal validity produced when a selection threat combines with one or more of the other threats to internal validity. When a selection threat is already present, other threats can affect some experimental groups but not others. Selection history can result from the various treatment groups coming from different settings such that each group experiences a unique local history that might affect outcome variables. Selection-maturation results when experimental groups are maturing at different speeds. Selection-instrumentation occurs when different groups score at different mean positions on a test whose intervals are not equal. Differential ceiling or floor effects like when an instrument cannot register any more true gain in one of the groups or when more scores from one group than another are clustered at the lower end of the scale.

THREATS IN DESIGNS WITH GROUPS: SOCIAL INTERACTION THREATS
Compensatory equalization of treatments Compensatory rivalry Resentful demoralization Treatment imitation or diffusion Unintended treatments

EXTERNAL VALIDITY The extent to which the results of a study can be generalized Population validity – generalizations related to other groups of people Ecological validity – generalizations related to other settings, times, contexts, etc. External validity addresses the ability to generalize your study to other people and other situations. To have strong external validity you need a probability sample of subjects or respondents drawn using chance methods from a clearly defined population. Ideally you will have a good sample of groups and a sample of measurements and situations. External validity is the degree to which the conclusions in your study would hold for other persons in other places and at other times. Questions raised include: Are findings using a scale to measure the construct of interest consistent across samples? To what population does the researcher wish to generalize his/her conclusions? Is there something unique about the study’s participants, the place where they live, the setting involved, or the times of the study that would prevent generalization? When a sample of observations in non-random in unknown ways, the likelihood of external validity is low. This is the case when we use convenience samples. Ecological validity is not to be confused with the ecological fallacy. Rather, ecological validity has to do with whether or not subjects are studies in their natural environment or say in a lab setting. Ecological Validity the extent to which the results of an experiment can be generalized from the set of environmental conditions created by the researcher to other environmental conditions (settings and conditions). Explicit description of the experimental treatment (not sufficiently described for others to replicate) If the researcher fails to adequately describe how he or she conducted a study, it is difficult to determine whether the results are applicable to other settings.

THREATS TO EXTERNAL VALIDITY
Pre-test treatment interaction Multiple treatment interference Interaction of selection and treatment Interaction of setting and treatment Interaction of history and treatment Experimenter effects Pretest sensitization (pretest sets the stage) A treatment might only work if a pretest is given. Because they have taken a pretest, the subjects may be more sensitive to the treatment. Had they not taken a pretest, the treatment would not have worked. Multiple-treatment interference (catalyst effect) If a researcher were to apply several treatments, it is difficult to determine how well each of the treatments would work individually. It might be that only the combination of the treatments is effective. Interaction of selection and treatment. People who agree to participate in a particular study may differ substantially from those who refuse and so results obtained on the former may not be generalizable to the latter. Interaction of setting and treatment. Result obtained in one setting may not be obtained in another. Interaction of history and treatment effect (...to everything there is a time...) Not only should researchers be cautious about generalizing to other population, caution should be taken to generalize to a different time period. As time passes, the conditions under which treatments work change. Also causal relationships obtained on a particular day (e.g. 9/11 as an extreme example) may not hold up under more mundane circumstances. Experimenter effect (it only works with this experimenter) The treatment might have worked because of the person implementing it. Given a different person, the treatment might not work at all. Also there are demand effects in which subjects follow orders or cooperate in ways they would be unlikely to do in their daily lives.

THREATS TO EXTERNAL VALIDITY
Reactive arrangements Artificial environment Hawthorne effect Halo effect John Henry effect Placebo effect Participant-researcher interaction Novelty effect Reactivity refers to changes in the subjects’ behavior simply because they know they are being studied. Faking good and social desirability. Artificial environment (Lab experiments with human subjects) Removing subjects from their environment may lead them to display different behavior from their “true” behavior. The greater the difference in cues and influences between the natural environment and the measurement setting, the greater the potential bias. Hawthorne effect (experimenter expectation) – Do the expectations or actions of the investigator contaminate the outcomes? Named after famous studies at Western Electric’s Hawthorne plant where work productivity improvements were found to reflect researcher attention, not interventions like better lighting. Hawthorne effect (attention causes differences) Subjects perform differently because they know they are being studied. "...External validity of the experiment is jeopardized because the findings might not generalize to a situation in which researchers or others who were involved in the research are not present" The halo effect is a variation of the Hawthorne effect which occurs when participants know they are part of the experimental group and their belief that they are part of a special group pushes them to improve performance. Halo effect is also used to describe rater effects associated with overrater or underrater error. Another variation is the opposite of the halo effect – the John Henry effect. The John Henry effect takes its name from the legendary railroad steel driver who exhausted himself to death. Participants know they are part of the control group and make an extra effort to improve performance. Placebo Effect: participants receiving treatment believe the treatment will have an effect (use single-blind approach) Participant-Researcher Interaction Effect: Gender issues, age, etc Novelty and disruption effect (anything different makes a difference) A treatment may work because it is novel and the subjects respond to the uniqueness, rather than the actual treatment. The opposite may also occur, the treatment may not work because it is unique, but given time for the subjects to adjust to it, it might have worked.

SELECTING A RESEARCH DESIGN
Lou Ann Cooper, PhD Director of Program Evaluation and Medical Education Research University of Florida College of Medicine

What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)….. Our inferences are based on general expectations of what data would have been had X not occurred. 48

PRE-EXPERIMENTAL DESIGNS
One Group Posttest Design X O X = Implementation of the treatment O = Measurement of the participants in the experimental group Also referred to as ‘One Shot Case Study’ THE FIRST THREE DESIGNS ARE WEAK, BUT FREQUENTLY USED DESIGNS IN THE SOCIAL SCIENCES. WHILE THEY MAY BE USEFUL FOR GENERATING IDEAS, THEY GENERALLY DO NOT PERMIT US TO MAKE CAUSAL INFERENCES BECAUSE THEY FAIL TO RULE OUT A NUMBER OF PLAUSIBLE ALTRENATIVE INTERPRETATIONS. THE FIRST IS THE ONE GROUP, POSTTEST ONLY DESIGN. It involves making observations only on persons who have received an intervention and only after they have received it Deficiencies: 1. Lack of pretest leaves us unable to easily infer that the treatment is related to any kind of change 2. Lack of a control group. Without this it is difficult to conceptualize the relevant threats and conceptualize the relevant threats and measure them

What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)….. Pre-experimental – One Group Posttest Design Our inferences are based on general expectations of what data would have been had X not occurred. Most all threats are present. For Discussion: What are the treats to validity? 50

SOURCES OF INVALIDITY Internal History – Maturation Testing
Instrumentation Regression Mortality Selection Selection Interactions External Interaction of Testing and X Interaction of Selection and X – Reactive Arrangements Multiple X Interference History, maturation, mortality and selection are all threats to internal validity. The interaction of selection and the experimental variable is a threat to external validity.

What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)….. Pre-experimental – One Group Posttest Design Our inferences are based on general expectations of what data would have been had X not occurred. Most all threats are present. 52

Comparison Group Posttest Design X O O Static Group Comparison Ex post facto research No pretest observations Sometimes a treatment is implemented before the researcher can plan for it and the research design is worked out after the treatment has begun. The most obvious flaw is the absence of pretests which leads to the possibility that any posttest differences between the groups can be attributed either to a treatment effect or to selection differences between the different groups. The plausibility of selection differences in research with nonequivalent groups usually renders this design uninterpretable.

What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)….. Pre-experimental – One Group Posttest Design Our inferences are based on general expectations of what data would have been had X not occurred. Most all threats are present. For Discussion: What if we compare test scores for these students with last year’s scores (assume last year had no streaming video)? 54

SOURCES OF INVALIDITY Internal History + Maturation ? Testing
Instrumentation Regression Mortality – Selection Selection Interactions External Interaction of Testing and X Interaction of Selection and X – Reactive Arrangements Multiple X Interference Threats to validity include Selection – groups selected may actually be different prior to any treatment. Mortality – the difference between O1 and O2 may be because of the dropout rate of subjects from a specific experimental group which would cause the groups to be unequal. Interaction of selection and maturation Interaction of selection and the experimental variable.

What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)….. History: Other things could change during “X” Examples: The streaming server was down during the study interval and made it difficult for students to access whenever they wanted it. Maturation: Do students get more mature as learners as they “settle down” during the semester. Testing: What if the pre-test caused the students to think more about the content as they watched the videos? Instrumentation: What if the exam has short answer questions that have to be hand-graded and the faculty member is exhausted at the end of the semester and therefore, is less accurate in grading? Statistical regression What about regression to the mean? 56

One Group Pretest/Posttest Design O X O Not a true experiment Because participants serve as their own control, results may be less biased Use of this design is widespread but there are several weaknesses: Observed change, or gain scores on a knowledge test, may be due to history, other events that influence the posttest measure. In order to rule this out the researcher has to make the case either that this is implausible in the context of a particular study or that such events are plausible, but did not operate. Regression toward the mean – most common regression artifact arises when a special program is given only to those with extreme scores on the pretest. Produces a spurious improvement. Maturation. Testing – exposure to an outcome measure at one time can lead to performance at another. In education, the pretest can be an impetus for a student to learn the correct answers to items thus improving their performance on the posttest regardless of the impact of the intervention.

What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)….. History: Other things could change during “X” Examples: The streaming server was down during the study interval and made it difficult for students to access whenever they wanted it. Maturation: Do students get more mature as learners as they “settle down” during the semester. Testing: What if the pre-test caused the students to think more about the content as they watched the videos? Instrumentation: What if the exam has short answer questions that have to be hand-graded and the faculty member is exhausted at the end of the semester and therefore, is less accurate in grading? Statistical regression What about regression to the mean? For Discussion: What are the threats to validity (What are the plausible hypotheses that could explain any difference)?? 58

Instrumentation Regression ? Mortality + Selection Selection Interactions External Interaction of Testing and X – Interaction of Selection and X Reactive Arrangements ? Multiple X Interference Uncontrolled threats to validity include: History – between O1 and O2 many events may have occurred apart from X to produce the differences in outcomes. The longer the time lapse between observations the more likely that history becomes a threat. Maturation – between O1 and O2 students may have grown older or internal states may have changed. Testing – the process of measuring may change what is being measured – reactive effectsof testing Instrumentation Statistical regression Interaction of selection and maturation Interaction of testing and the experimental variable Interaction of selection and the experimental variable

What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video Group 2:attend lectures For each group, administer both a pre-test and a post-test Experimental Design “Pretest-Posttest Control Group Design” 60

TRUE EXPERIMENTAL DESIGNS
Pretest/Posttest Design with Control Group and Random Assignment R O X O R O O Measurement of pre-existing differences Controls most threats to internal validity In a true experiment whether laboratory, field or simulation subjects are randomly assigned to treatment groups. It is this randomization that makes true experiments so strong in internal valdlity and typically allows us to make relatively strong inferences about causality. Random assignment means that on the average at the beginning of a study all your treatment groups are about the same. This said, without a control or comparison group your study is open to the criticism and alternative causal explanations. This study design also has two other important features a pretest and posttest. The pretest allows you to double check that your participants are pretty much alike at the beginning of the study. Because you have both pretest and posttest, you can also assess the level of change.

What If…. For Discussion:
We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video Group 2:attend lectures For each group, administer both a pre-test and a post-test Experimental Design “Pretest-Posttest Control Group Design” R O X O R O O All threats to internal validity are controlled For Discussion: What are the threats to validity (What are the plausible hypotheses that could explain any difference)?? 62

SOURCES OF INVALIDITY Internal History + Maturation Testing
Instrumentation Regression Mortality Selection Selection Interactions External Interaction of Testing and X – Interaction of Selection and X ? Reactive Arrangements Multiple X Interference This standard design may pose some threats to validity. History – this is controlled in that the general history events which may have contributed to the O1 and O2 effects would also produce the O# and O4 effects. This is true if both groups are tested simultaneoudly. Intrasession history must be taken into consderation. Maturation and testing – both should be manifested equally in treatment and control groups. Instrumentation is controlled where conditions control for intrasession history especially where fixed tests are used. When observers or interviewers are being used there exists a potential for problems. There should be enough raters/observers for random assignment to condition. Blinding observers to the purpose of the experiment will also help. Regression – this is controlled by the mean differenced regardless of the extremety of scores or characteristics, if the treatment and control groups are randomly assigned from the same extreme pool. If this occurs the groups should regress similarly. Selection controlled by random assignment. Mortality – this is said to be controlled for in this design, but unless the mortality rate is equal in treatment and control groups it is not possible to state with certainty that mortality did not contribute to experimental results. Interaction of testing and X – because the interaction between taking a pretest and the treatment itself may effect the results of the experimental group, it is desirable to use a design that does not use a pretest. Interaction of selection and X – although selection is controlled for by randomly assigning subjects to groups there remains a possibility that the effects demonstrated hold true only for that population from which the experimental and control groups were selected. Reactive arrangements – this refers to the artificiality of the experimental setting and the subject’s knowledge that he is participating in an experiment. Simply being measured or observed during the pretest may sensitize some subjects and they will behave differently as a result. In addition, a pretest may interact with an experimental treatment to heighten the effect of the experimental intervention more than it would have ordinarily.

What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video and post-test Group 2: attend lectures and post-test Experimental Design “Post-test only control group” 64

Posttest Only Control Group R X O R O The most obvious flaw is the absence of pretests which leads to the possibility that any posttest differences between the groups can be attributed either to a treatment effect or to selection differences between the different groups. The plausibility of selection differences in research with nonequivalent groups usually renders this design uninterpretable.

What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video and post-test Group 2: attend lectures and post-test Experimental Design “Post-test only control group” For Discussion: What have we lost by not using a pre-test? (as compared to the experimental randomized pre-test and post-test design) 66

SOURCES OF INVALIDITY Internal History + Maturation Testing
Instrumentation Regression Mortality Selection Selection Interactions External Interaction of Testing and X + Interaction of Selection and X ? Reactive Arrangements Multiple X Interference

What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: pre-test, access only streaming video, and post-test Group 2: pre-test, attend lectures, and post-test Group 3: access only streaming video and post-test only Group 4: attend lectures and post-test only Experimental Design “Solomon 4-Group Design” 68

Solomon Four Group Comparison R O X O R O O R X O R O One of the strongest experimental designs with respect to internal validity is the Solomon four Group Design. In this design there are four randomized groups of subjects. One group receives a pretest, the experimental treatment and a posttest. The second group receives a pretest and a posttest but not treatment (or a different treatments). The third group is identical to the first group except it does not receive the pretest and the fourth group receives posttest only. More expensive – require more subjects and resources.

What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: pre-test, access only streaming video, and post-test Group 2: pre-test, attend lectures, and post-test Group 3: access only streaming video and post-test only Group 4: attend lectures and post-test only For Discussion: What have we gained by having 4 groups (esp group 3 and 4)? Experimental Design “Solomon 4-Group Design” 70

What If…. It is NOT feasible to use randomization. What if we were to have the following groups: Group 1 (all distant campuses): access only streaming video Group 2 (GNV campus):attend lectures For each group, administer both a pre-test and a post-test Quasi - Experimental Design “Nonequivalent control group” O X O O O Controls: History Maturation Testing Instrumentation Selection Mortality Regression is ???? Interaction of Selection and Maturation is negative External Validity is weak. 71

QUASI-EXPERIMENTAL DESIGNS
Nonequivalent Control Group O X O O O Pre-existing differences can be measured Controls some threats to validity A true experiment is simply not always possible, yet investigators still want to make causal statements. If your study has different levels of treatments and people or groups are assigned to those treatments without random assignment you have a quasi-experiment. Some treatments are initially formed on the basis of performance or some variables that are not experimentally induced. Self-selection to treatment groups is also common in quasi-experiments such as program evaluation. The difficulty in quasi-experiments is trying to find just how similar the groups were at the beginning before any treatment began. Sometime, in fact, if groups are created on the basis of dissimilarities such as ability we know the groups are different at the beginning. If we have prior information about the individuals who comprise the individuals or groups in the different treatments we may at least try to institute statistical controls for the variables. Background information – grades, test scores, personality tests collected before the study ever began. Some kind of pretest measures – demographic information such as educational level, occupation, income. Supplemental information from other people There is still the possibility that the groups may differ on variables that we haven’t measured, such as differences in motivation or confidence. It could be these differences instead of a difference in ability that were the true causes of observed differences. The control group is nonequivalent because we did not use random assignment and therefore we cannot assume that on the average that the groups are the same, or equivalent to begin with.

What If…. It is NOT feasible to use randomization. What if we were to have the following groups: Group 1 (all distant campuses): access only streaming video Group 2 (GNV campus):attend lectures For each group, administer both a pre-test and a post-test Quasi - Experimental Design “Nonequivalent control group” O X O O O Controls: History Maturation Testing Instrumentation Selection Mortality Regression is ???? Interaction of Selection and Maturation is negative External Validity is weak. For Discussion: What have we “lost” by not randomizing? 73

Instrumentation Regression Mortality Selection Selection Interactions External Interaction of Testing and X Interaction of Selection and X – Reactive Arrangements Multiple X Interference

Time Series O O O O X O O O O In a times series design you have several observations over time. While you may have some type of experimental intervention, often nature does the experimenting for you. There was a series of pre-intervnetion measures which you continue following after the intervention. In this design the data is often extant especially for the observations before the intervention. In all likelihood you won’t have a control group let alone a randomized control group. This makes is very difficult to tease out specifically what it was about the intervention that caused the observed outcomes.

Counterbalanced Design O X1 O X O X3 O X3 O X O X2 O X2 O X O X1 Multiple treatments and tests for all groups. Number of groups = number of treatments Can be employed when pretest is not possible and intact groups must be used. Exposure to one treatment may influence subsequent treatments – order effects In our example: O video O lecture O O lecture O video O

MEASURMENT VALIDITY: SOURCES OF EVIDENCE

CLASSIC VIEW OF TEST VALIDITY
Traditional triarchic view of validity Content Criterion Concurrent Predictive Construct Tests were described as “valid” or “invalid” Reliability was considered a separate test trait

MODERN VIEW OF VALIDITY
Scientific evidence needed to support test score interpretation Standards for Educational & Psychological Testing (1999) Cronbach, Messick, Kane Some theory, key concepts, examples Reliability as part of validity

VALIDITY: DEFINITIONS
“A proposition deserves some degree of trust only when it has survived serious attempts to falsify it.” (Cronbach, 1980) According to the Standards, validity refers to “the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.” “Validity is an integrative summary.” (Messick, 1995) “Validation is the process of building an argument supporting interpretation of test scores.” (Kane, 1992) Over the last half century, the concept of validation has evolved from establishing correlation with a dependent variable to the idea that researchers must validate each scale, test, or instrument measuring a construct and do so in multiple ways which taken together form the whole of what validity is.

WHAT IS A CONSTRUCT? Constructs are psychological attributes, hypothetical concepts A defensible construct has A theoretical basis Clear operational definitions involving measurable indicators Demonstrated relationships to other constructs or observable phenomena A construct should be differentiated from related theoretical constructs as well as from methodological irrelevancies Construct validity is about correspondence between your concepts and the actual measurements that you use. This makes clear that we must have clear conceptual definitions of our variables.

THREATS TO CONSTRUCT VALIDITY (Cook & Campbell)
Inadequate preoperational explication of constructs Mono-operation bias Mono-method bias Interaction of different treatments Interaction of testing and treatment Restricted generalizability across constructs Inadequate preoperational explication of constructs A precise explication of constructs is vital for the linkage between treatments and outcomes. Some possible solutions: think through your concepts better use methods (e.g., concept mapping) to articulate your concepts get experts to critique your operationalizations Mono-operation bias pertains to the independent variable, cause, program or treatment in your study. If you only use a single version of a program in a single place at a single point in time, you may not be capturing the full breadth of the concept of the program. Mono-Method Bias refers to your measures or observations, not to your programs or causes. With only a single version of a self esteem measure, you can't provide much evidence that you're really measuring self esteem. Solution: try to implement multiple measures of key constructs and try to demonstrate (perhaps through a pilot or side study) that the measures you use behave as you theoretically expect them to. Interaction of Different Treatments occurs when the group in your study is also likely to be involved simultaneously in several other programs designed to have similar effects. Can you really label the program effect as a consequence of your program? Interaction of Testing and Treatment Does testing or measurement itself make the groups more sensitive or receptive to the treatment? If it does, then the testing is in effect a part of the treatment, it's inseparable from the effect of the treatment. This is a labeling issue (and, hence, a concern of construct validity) because you want to use the label "program" to refer to the program alone, but in fact it includes the testing. Restricted Generalizability Across Constructs this "unintended consequences" treat to construct validity. This threat reminds us that we have to be careful about whether our observed effects (Treatment X is effective) would generalize to other potential outcomes.

THREATS TO CONSTRUCT VALIDITY (Cook & Campbell)
Confounding constructs Confounding levels of constructs Hypothesis guessing within experimental conditions Evaluation apprehension Researcher expectancies Confounding Constructs and Levels of Constructs Like the other construct validity threats, this is essentially a labeling issue -- your label is not a good description for what you implemented. Hypothesis Guessing Most people don't just participate passively in a research project. They are trying to figure out what the study is about. They are "guessing" at what the real purpose of the study is. And, they are likely to base their behavior on what they guess, not just on your treatment. Evaluation Apprehension Many people are anxious about being evaluated. If their apprehension makes them perform poorly (and not your program conditions) then you certainly can't label that as a treatment effect. Another form of evaluation apprehension concerns the human tendency to want to "look good" or "look smart" and so on. If, in their desire to look good, participants perform better (and not as a result of your program!) then you would be wrong to label this as a treatment effect. In both cases, the apprehension becomes confounded with the treatment itself and you have to be careful about how you label the outcomes. Researcher Expectancies The researcher can bias the results of a study in countless ways, both consciously or unconsciously. Sometimes the researcher can communicate what the desired outcome for a study might be (and participant desire to "look good" leads them to react that way). For instance, the researcher might look pleased when participants give a desired answer. If this is what causes the response, it would be wrong to label the response as a treatment effect.

SOURCES OF VALIDITY EVIDENCE
1. Test Content Task Representation Construct Domain 2. Response Process – Item Psychometrics 3. Internal Structure – Test Psychometrics 4. Relationships with Other Variables – Correlations Test-Criterion Relationships Convergent and Divergent Data 5. Consequences of Testing – Social context Standards for Educational and Psychological Testing, 1999

ASPECTS OF VALIDITY: CONTENT
Content validity refers to how well elements of the test or scale relate to the content domain. Content relevance. Content representativeness. Content coverage. Systematic analysis of what the test is intended to measure. Technical quality. Construct irrelevant variance

SOURCES OF VALIDITY EVIDENCE: TEST CONTENT
Detailed understanding of the content sampled by the instrument and its relationship to content domain Content-related evidence is often established during the planning stages of an assessment or scale. Content-related validity studies Exact sampling plan, table of specifications, blueprint Representativeness of items/prompts →Domain Appropriate content for instructional objectives Cognitive level of items Match to instructional objectives Review by panel of experts. Content expertise of item/prompt writers Expertise of content reviewers Quality of items/prompts, sensitivity review

ASPECTS OF VALIDITY: RESPONSE PROCESSES
Emphasis is on the role of theory. Tasks sample domain processes as well as content. Accuracy in combining scores from different item formats or subscales. Quality control – scanning, assignment of grades, score reports.

SOURCES OF VALIDITY EVIDENCE: RESPONSE PROCESSES
Fit of student responses to hypothesized construct? Basic quality control information – accuracy of item responses, recording, data handling, scoring Statistical evidence that items/tasks measure the intended construct Achievement items measure intended content and not other content Ability items predict targeted achievement outcome Ability items fail to predict a non-related ability or achievement outcome

SOURCES OF EVIDENCE: RESPONSE PROCESSES
Debrief examinees regarding solution processes. “Think-aloud” during pilot testing. Subscore/subscale analyses- i.e., correlation patterns among part scores. Accurate and understandable interpretations of scores for examinees.

SOURCES OF VALIDITY EVIDENCE: INTERNAL STRUCTURE
Statistical evidence of the hypothesized relationship between test item scores and the construct Reliability Test scale reliability Rater reliability Generalizability Item analysis data Item difficulty and discrimination MCQ option function analysis Inter-item correlations Scale factor structure Dimensionality studies Differential item functioning (DIF) studies

ASPECTS OF VALIDITY: EXTERNAL
Can the test results be evaluated by objective criteria? Correlations with other relevant variables Test-criterion correlations  Concurrent or predictive MTMM matrix  Convergent correlations  Divergent (discriminant) correlations

SOURCES OF VALIDITY EVIDENCE: RELATIONSHIPS TO OTHER VARIABLES
Statistical evidence of the hypothesized relationship between test scores and the construct Criterion-related validity studies Correlations between test scores/subscores and other measures Convergent-Divergent studies MTMM

RELATIONSHIPS WITH OTHER VARIABLES
Predictive validity: Variation of concurrent validity where the criterion is in the future. Classic example is to determine whether students who score high on an admissions test such as the MCAT earn higher preclinical GPAs?

Convergent validity: Assessed by the correlation among items which make up the scale (internal consistency), by the correlation of a the given scale with measures of the same construct using instruments proposed by other researchers, and by the correlation of relationships involving the given scale across samples or across methods.

Criterion (concurrent) validity: correlation between scale or instrument measurement items and known accepted standard measures or criteria. Do the proposed measures for a given concept exhibit generally the same direction and magnitude of correlation with other variables as measures of that concept already accepted in this area of research?

Divergent (discriminant) validity: The indicators of different constructs should not be highly correlated as to lead us to conclude that they measure the same thing. This would happen is there is definitional overlap between two constructs

MULTI-TRAIT MULTI-METHOD MTMM MATRIX
Mono-method and/or mono-method biases – use of a single data gathering method or a single indicator for a concept may result in bias Multi-trait/Multi-method validation uses multiple indicators per concept and gathers data for each indicator by multiple methods or multiple sources. Evidence for construct validity especially when developing a new scale is often established through the use if a multi-trait, multi-method matrix. At least two constructs are measured. Each construct is measured at least two different ways and the type of measures is repeated across constructs. Typically under conditions of high construct validity, correlations are high for the same construct (or trait) across a host of different measures. Correlations are low across constructs that are different but measured using the same general technique. Under low construct validity the reverse holds. Correlations are high across traits using the same method but low for the same trait measured in different ways.

Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40: ILS = Index of Learning Styles LSTI = Learning Style Type Indicator Active-reflective Sensing-intuitive Visual- verbal Sequential-global Extrovert-introvert Sensing-intuition Thinking-feeling Judging- perceiving

Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40: ILS = Index of Learning Styles LSTI = Learning Style Type Indicator

RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY
Neither is a property of a test or scale. Reliability is important validity evidence. Without reliability, there can be no validity. Reliability is necessary, but not sufficient for validity. Purpose of an instrument dictates what type of reliability is important and the sources of validity evidence necessary to support the desired inferences.

SOURCES OF VALIDITY EVIDENCE: CONSEQUENCES
Evidence of the effects of tests on students, instruction, schools, society Consequential validity Social consequences of assessment Effects of passing-failing tests Economic costs of failure Costs to society of false positive/false negative decisions Effects of tests on instruction/learning Intended vs. unintended

RELIABILITY AND INSTRUMENTATION
Lou Ann Cooper, PhD Director of Program Evaluation and Medical Education Research University of Florida College of Medicine

TYPES OF RELIABILITY Different types of assessments require
different kinds of reliability Written MCQ/Likert-scale items Scale reliability Internal consistency Written Constructed Response and Essays Inter-rater agreement Generalizability theory In order to make causal assessments in your research, you must first have reliable measures – stable and/or repeated measures. If the random error variation in your measurements is so large that there is almost no stability you can’t explain anything. Picture an IQ test where an individual’s scores ranged from mornoic to genius level. No one would place any faith in the results of such a test. Reliability is required to make statements of validity. However, reliable measures could be biased and hence “untrue” measures of a phenomenon or confounded with other factors such as an acquiescence response set. Picture a scale that always weighs 5 pounds heavy. Could be very reliable, but not a valid assessment of a person’s true weight.

TYPES OF RELIABILITY Oral Exams Observational Assessments
Rater reliability Generalizability Theory Observational Assessments Inter-rater agreement Performance Exams (OSCEs)

ROUGH GUIDELINES FOR RELIABILITY
The higher the better! Depends on purpose of test Very high-stakes: > 0.90 (Licensure exams) Moderate stakes: at least ~ (Classroom test, Medical school OSCE) Low stakes: > 0.60 (Quiz, test for feedback only)

INCREASING RELIABILITY
Written tests Use objectively scored formats At least MCQs MCQs that differentiate between high and low scorers Performance exams At least 7-12 cases Well trained standardized patients and/or other raters Monitoring and quality control Observational Exams Many independent raters (7-11) Standard checklists/rating scales Timely ratings

SCALE DEVELOPMENT 1. Identify the primary purpose for which scores will be used.  Validity is the most important consideration.  Validity is not a property of an instrument.  Inferences to be made determine the type of items you will write. 2. Specify the important aspects of the construct to be measured.

SCALE DEVELOPMENT 3. Initial pool of items.
4. Expert review (content validity) 5. Preliminary item ‘tryout’ 6. Statistical properties of the items Item analysis Reliability estimate Dimensionality

ITEM ANALYSIS Item ‘difficulty’ – item variance, frequencies
Inter-item covariances/correlations Item discrimination – an item that discriminates well correlates with the total score. Cronbach’s coefficient alpha Factor Analysis – Multidemensional Scaling IRT Structural aspect of validity.

NEED TO EVALUATE SCALE Jarvis & Petty (1996)
Hypothesis: Individuals differ in the extent to which they engage in evaluative responding. Subjects were undergraduate psychology students. Comprehensive reliability and validity studies. Concluded the scale was ‘unidimensional’. Scale construction is an iterative process whereby items are generated, tested and analyzed and then either retained for further testing, revised or deleted. Through this process, Jarvis and Petty reduced the pool of 46 items they started with to a highly reliable 16-item scale coefficient alpha = .87. The statistical criteria they used were Item-total correlation (point biserial correlation greater than .30. Average inter-item correlation greater than .20. Item mean > 2 and < 4 on a 5-point scale. SD ≥ 1. Then they began a rigorous investigation of the factor structure of the scale. Started with EFA Four CFA models

REFERENCES Cook, T.D. & Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings. Downing, S. M. Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv in Health Sci Educ 2002; 7: Downing, S. M. Validity: On the meaningful interpretation of assessment data. Med Educ 2003; 37: Messick, S. (1989) Validity. In Educational Measurement 3rd Ed. R. L. Linn, Ed. Downing, S. M. Reliability: On the reproducibility of assessment data. Med Educ, 2004; 38:

POPULATIONS AND SAMPLING

Similar presentations

Presentation on theme: "POPULATIONS AND SAMPLING"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

POPULATIONS AND SAMPLING

Similar presentations

Presentation on theme: "POPULATIONS AND SAMPLING"— Presentation transcript:

Similar presentations

About project

Feedback