Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 POPULATIONS AND SAMPLING. 2 Let’s Look at our Example Research Question How do UF COP pharmacy students who only watch videostreamed lectures differ.

Similar presentations


Presentation on theme: "1 POPULATIONS AND SAMPLING. 2 Let’s Look at our Example Research Question How do UF COP pharmacy students who only watch videostreamed lectures differ."— Presentation transcript:

1 1 POPULATIONS AND SAMPLING

2 2 Let’s Look at our Example Research Question How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes? Population Who Do You Want These Study Results to Generalize To??

3 3 Population  The group you wish to generalize to is your population.  There are two types: –Theoretical population In our example, this would be all pharmacy students in the US –Accessible population In our example, this would be all COP pharmacy students

4 4 Sampling  Target population or the Sampling frame: All in the accessible population that you can draw your sample from.  Sample: The group of people you select to be in the study.  A subgroup of the target population  This is not necessarily the group that is actually in your study.

5 5 Sampling How you select your sample:

6 6 Sample Size Select as large a sample as possible from your population.  There is less potential error that the sample is different from the population when you use a large sample.  Sampling error: The difference between the sample estimate and the true population value (example: exam score).

7 7 Sample Size  Sample size formulas/tables can be used. Factors that are considered include:  Confidence in the statistical test  Sampling error  See Appendix B in Creswell (pg 630)  Sampling error formula – used to determine sample size for a survey  Power analysis formula – used to determine group size in an experimental study.

8 8 Back to Our Example  What is our theoretical population?  What is our accessible population?  What sampling strategy should we use?  How do UF COP pharmacy students who only watch videostreamed lectures differ from those who attend class lectures (and also have access to videostreamed lectures) in terms of learning outcomes?

9 9 Important Concept Random sampling vs random assignment  We have talked about random sampling in this session.  Random sampling is not the same as random assignment.  Random sampling is used to select individuals from the population who will be in the sample.  Random assignment is used in an experimental design to assign individuals to groups.

10 VALIDITY Lou Ann Cooper, PhD Director of Program Evaluation and Medical Education Research University of Florida College of Medicine

11 11 INTRODUCTION  Both research and evaluation include:  Design – how the study is conducted  Instruments – how data is collected  Analysis of the data to make inferences about the effect of a treatment or intervention.  Each of these components can be affected by bias.

12 12 INTRODUCTION  Two types of error in research  Random error due to random variation in participants’ responses at measurement. Inferential statistics, i.e. the p-value and 95% confidence interval, measure random error and allow us to draw conclusions based on research data.  Systematic error or bias.

13 13 BIAS: DEFINITION  Deviations of results (or inferences) from the truth, or processes leading to such deviation. Any trend in the selection of subjects, data collection, analysis, interpretation, publication or review of data that can lead to conclusions that are systematically different from the truth.  Systematic deviation from the truth that distorts the results of research.

14 14 BIAS  Bias is a form of systematic error that can affect scientific investigations and distort the measurement process.  Bias is primarily a function of study design and execution, not of results, and should be addressed early in the study planning stages.  Not all bias can be controlled or eliminated; attempting to do so may limit usefulness and generalizability.  Awareness of the presence of bias will allow more meaningful scrutiny of the results and conclusions.  A biased study loses validity and is a common reason for invalid research.

15 15 POTENTIAL BIASES IN RESEARCH AND EVALUATION  Study Design  Issues related to Internal validity  Issues related to External validity  Instrument Design  Issues related to Construct validity  Data Analysis  Issues related to Statistical Conclusion validity

16 16 VALIDITY Validity is discussed and applied based on two complimentary conceptualizations in education and psychology:  Test validity: the degree to which a test measures what it was designed to measure.  Experimental validity: the degree to which a study supports the intended conclusion drawn from the results.

17 17 Conclusion Is there a relationship between cause and effect? Internal Is the relationship causal? Construct Can we generalize to other persons, places, times? Can we generalize to the constructs? External FOUR TYPES OF VALIDITY QUESTIONS

18 18 CONCLUSION VALIDITY  Conclusion validity is the degree to which conclusions we reach about relationships are reasonable, credible or believable.  Relevant for both quantitative and qualitative research studies.  Is there a relationship in your data or not?

19 19 STATISTICAL CONCLUSION VALIDITY  Basing conclusions on proper use of statistics  Reliability of measures  Reliability of implementation  Type I Errors and Statistical Significance  Type II Errors and Statistical Power  Fallacies of Aggregation

20 20 STATISTICAL CONCLUSION VALIDITY  Interaction and non-linearity  Random irrelevancies in the experimental setting  Random heterogeneity of respondents

21 21 VIOLATED ASSUMPTIONS OF STATISTICAL TESTS  The particular assumptions of a statistical test must be met if the results of the analysis are to be meaningfully interpreted.  Levels of measurement.  Example: Analysis of Variance (ANOVA)

22 22 LEVELS OF MEASUREMENT  A hierarchy is implied in the ides of level of measurement.  At lower levels, assumptions tend to be less restrictive and data analyses tend to be less sensitive.  In general, it is desirable to have a higher level of measurement (interval or ratio) rather than a lower one (nominal or ordinal).

23 23 STATISTICAL ANALYSIS AND LEVEL OF MEASUREMENT ANALYSIS OF VARIANCE ASSUMPTIONS  Independence of cases.  Normality. In each of the groups, the data are continuous and normally distributed.  Equal variances or homoscedasticity. The variance of data in groups should be the same.  The Kruskal-Wallis test is a nonparametric alternative which does not rely on an assumption of normality.

24 24 RELIABILITY  Measures (tests and scales) of low reliability may not register true changes.  Reliability of treatment implementation – when treatments/procedures are not administered in a standard fashion, error variance is increased and the chance of obtaining true differences will decrease.

25 25 STATISTICAL DECISION TRUE POPULATION STATUS

26 26 TYPE I ERRORS AND STATISTICAL SIGNIFICANCE  A Type I error is made when a researcher concludes that there is a relationship and there really isn’t (False positive)  If the researcher rejects H 0 because p ≤.05, ask:  If data are from a random sample, is significance level appropriate?  Are significance tests applied to a priori hypotheses?  Fishing and the error rate problem

27 27 TYPE II ERRORS AND STATISTICAL POWER  A Type II error is made when a researcher concludes that there is not a relationship and there really is (False negative)  If the researcher fails to reject H 0 because p >.05, ask:  Has the researcher used statistical procedures of adequate power?  Does failure to reject H 0 merely reflect a small sample size?

28 28 FACTORS THAT INFLUENCE POWER AND STATISTICAL INFERENCE  Alpha level  Effect size  Directional vs. Non-directional test  Sample size  Unreliable measures  Violating the assumptions of a statistical test

29 29 RANDOM IRRELEVANCIES  Features of the experimental setting other than the treatment affect scores on the dependent variable  Controlled by choosing settings free from extraneous sources of variation  Measure anticipated sources of variance to include in the statistical analysis

30 30 RANDOM HETEROGENEITY OF RESPONDENTS  Participants can differ on factors that are correlated with the major dependent variables  Certain respondents will be more affected by the treatment than others  Minimized by  Blocking variables and covariates  Within subjects designs

31 31 STRATEGIES TO REDUCE ERROR TERMS  Subjects as own control  Homogeneous samples  Pretest measures on the same scales used for measuring the effect  Matching on variables correlated with the post-test  Effects of other variables correlated with the post-test used as covariates  Increase the reliability of the dependent variable measures

32 32 STRATEGIES TO REDUCE ERROR TERMS  Estimates of the desired magnitude of a treatment effect should be elicited before research begins  Absolute magnitude of the treatment effect should be presented so readers can infer whether a statistically reliable effect is practically significant.

33 33 INTERNAL VALIDITY  Internal validity has to do with defending against sources of bias arising in a research design.  To what degree is the study designed such that we can infer that the educational intervention caused the measured effect.  An internally valid study will minimize the influence of extraneous variables. Example: Did participation in a series of Webinars on TB in children change the practice of physicans?

34 THREATS TO INTERNAL VALIDITY HISTORY MATURATION TESTING INSTRUMENTATION STATISTICAL REGRESSION SELECTION INTERACTIONS WITH SELECTION MORTALITY        

35 35 INTERNAL VALIDITY: THREATS IN SINGLE GROUP REPEATED MEASURES DESIGNS  History  Maturation  Testing  Instrumentation  Mortality  Regression

36 36 THREATS TO INTERNAL VALIDITY HISTORY  The observed effects may be due to or be confounded with nontreatment events occurring between the pretest and the post-test  History is a threat to conclusions drawn from longitudinal studies  Greater time period between measurements = more risk of a history effect  History is not a threat in cross sectional designs conducted at one point in time

37 37 THREATS TO INTERNAL VALIDITY MATURATION  Invalid inferences may be made when the maturation of participants between measurements has an effect and this maturation is not the research interest.  Internal (physical or psychological) changes in participants unrelated to the independent variable – older, wiser, stronger, more experienced.

38 38 THREATS TO INTERNAL VALIDITY TESTING  Reactivity as a result of testing  The effects of taking a test on the outcomes of a second test  Practice  Learning  Improved scores on the second administration of a test can be expected even in the absence of intervention due to familiarity

39 39 THREATS TO INTERNAL VALIDITY INSTRUMENTATION  Changes in instruments, observers or scorers which may produce changes in outcomes  Observers/raters, through experience, become more adept at their task  Ceiling and floor effects  Longitudinal studies

40 40 THREATS TO INTERNAL VALIDITY STATISTICAL REGRESSION  Test-retest scores tend to drift systematically to the mean rather than remain stable or become more extreme  Regression effects may obscure treatment effects or developmental changes  Most problematic when participants are selected because they are extreme on the classification variable of interest

41 41 THREATS TO INTERNAL VALIDITY MORTALITY  Differences in drop-out rates/attrition across conditions of the experiment  Makes “before” and “after” samples not comparable  This selection artifact may become operative in spite of random assignment  Major threat in longitudinal studies

42 42 INTERNAL VALIDITY: MULTIPLE GROUP THREATS  Selection  Interactions with Selection  Selection-History  Selection-Maturation  Selection-Testing  Selection-Instrumentation  Selection-Mortality  Selection-Regression

43 43 THREATS IN DESIGNS WITH GROUPS: SOCIAL INTERACTION THREATS  Compensatory equalization of treatments  Compensatory rivalry  Resentful demoralization  Treatment imitation or diffusion  Unintended treatments

44 44 EXTERNAL VALIDITY  The extent to which the results of a study can be generalized  Population validity – generalizations related to other groups of people  Ecological validity – generalizations related to other settings, times, contexts, etc.

45 45 THREATS TO EXTERNAL VALIDITY  Pre-test treatment interaction  Multiple treatment interference  Interaction of selection and treatment  Interaction of setting and treatment  Interaction of history and treatment  Experimenter effects

46 46 THREATS TO EXTERNAL VALIDITY  Reactive arrangements  Artificial environment  Hawthorne effect ◊ Halo effect ◊ John Henry effect  Placebo effect  Participant-researcher interaction  Novelty effect

47 SELECTING A RESEARCH DESIGN Lou Ann Cooper, PhD Director of Program Evaluation and Medical Education Research University of Florida College of Medicine

48 48 What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..

49 49 PRE-EXPERIMENTAL DESIGNS One Group Posttest Design X O X = Implementation of the treatment O = Measurement of the participants in the experimental group Also referred to as ‘One Shot Case Study’

50 50 What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)….. For Discussion: What are the treats to validity?

51 51 SOURCES OF INVALIDITY Internal History– Maturation– Testing Instrumentation Regression Mortality– Selection– Selection Interactions External Interaction of Testing and X Interaction of Selection and X – Reactive Arrangements Multiple X Interference

52 52 What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)…..

53 53 PRE-EXPERIMENTAL DESIGNS Comparison Group Posttest Design X O O  Static Group Comparison  Ex post facto research  No pretest observations

54 54 What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and then measured their performance on a written exam (measures achievement of learning outcomes)….. For Discussion: What if we compare test scores for these students with last year’s scores (assume last year had no streaming video)?

55 55 SOURCES OF INVALIDITY Internal History+ Maturation? Testing+ Instrumentation+ Regression+ Mortality– Selection– Selection Interactions – External Interaction of Testing and X Interaction of Selection and X – Reactive Arrangements Multiple X Interference

56 56 What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)…..

57 57 PRE-EXPERIMENTAL DESIGNS One Group Pretest/Posttest Design O X O  Not a true experiment  Because participants serve as their own control, results may be less biased

58 58 What If…. We gave 150 pharmacy students (all are distance campus) access to streaming video and measured their performance on a written exam both before and after the intervention (measures achievement of learning outcomes)….. For Discussion: What are the threats to validity (What are the plausible hypotheses that could explain any difference)??

59 59 SOURCES OF INVALIDITY Internal History– Maturation– Testing– Instrumentation– Regression? Mortality+ Selection+ Selection Interactions – External Interaction of Testing and X – Interaction of Selection and X – Reactive Arrangements ? Multiple X Interference

60 60 What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video Group 2:attend lectures For each group, administer both a pre-test and a post-test

61 61 TRUE EXPERIMENTAL DESIGNS Pretest/Posttest Design with Control Group and Random Assignment R O X O R O O  Measurement of pre-existing differences  Controls most threats to internal validity

62 62 What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video Group 2:attend lectures For each group, administer both a pre-test and a post-test For Discussion: What are the threats to validity (What are the plausible hypotheses that could explain any difference)??

63 63 SOURCES OF INVALIDITY Internal History+ Maturation+ Testing+ Instrumentation+ Regression+ Mortality+ Selection+ Selection Interactions + External Interaction of Testing and X – Interaction of Selection and X ? Reactive Arrangements ? Multiple X Interference

64 64 What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video and post-test Group 2: attend lectures and post-test

65 65 TRUE EXPERIMENTAL DESIGNS Posttest Only Control Group R X O R O

66 66 What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: access only streaming video and post-test Group 2: attend lectures and post-test For Discussion: What have we lost by not using a pre-test? (as compared to the experimental randomized pre-test and post-test design)

67 67 SOURCES OF INVALIDITY Internal History+ Maturation+ Testing+ Instrumentation+ Regression+ Mortality+ Selection+ Selection Interactions + External Interaction of Testing and X + Interaction of Selection and X ? Reactive Arrangements ? Multiple X Interference

68 68 WhatIf…. What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: pre-test, access only streaming video, and post-test Group 2: pre-test, attend lectures, and post-test Group 3: access only streaming video and post-test only Group 4: attend lectures and post-test only

69 69 TRUE EXPERIMENTAL DESIGNS Solomon Four Group Comparison R O X O R O O R X O R O

70 70 What If…. We could randomize all 300 pharmacy students to the following groups: Group 1: pre-test, access only streaming video, and post-test Group 2: pre-test, attend lectures, and post-test Group 3: access only streaming video and post-test only Group 4: attend lectures and post-test only For Discussion: What have we gained by having 4 groups (esp group 3 and 4)?

71 71 What If…. It is NOT feasible to use randomization. What if we were to have the following groups: Group 1 (all distant campuses): access only streaming video Group 2 (GNV campus):attend lectures For each group, administer both a pre-test and a post-test

72 72 QUASI-EXPERIMENTAL DESIGNS Nonequivalent Control Group O X O O  Pre-existing differences can be measured  Controls some threats to validity

73 73 What If…. It is NOT feasible to use randomization. What if we were to have the following groups: Group 1 (all distant campuses): access only streaming video Group 2 (GNV campus):attend lectures For each group, administer both a pre-test and a post-test For Discussion: What have we “lost” by not randomizing?

74 74 SOURCES OF INVALIDITY Internal History– Maturation– Testing Instrumentation Regression Mortality– Selection– Selection Interactions External Interaction of Testing and X Interaction of Selection and X – Reactive Arrangements Multiple X Interference

75 75 QUASI-EXPERIMENTAL DESIGNS Time Series O O O O X O O O O

76 76 QUASI-EXPERIMENTAL DESIGNS Counterbalanced Design O X 1 O X 2 O X 3 O X 3 O X 1 O X 2 O X 2 O X 3 O X 1

77 MEASURMENT VALIDITY: SOURCES OF EVIDENCE

78 78 CLASSIC VIEW OF TEST VALIDITY Traditional triarchic view of validity  Content  Criterion  Concurrent  Predictive  Construct  Tests were described as “valid” or “invalid”  Reliability was considered a separate test trait

79 79 MODERN VIEW OF VALIDITY  Scientific evidence needed to support test score interpretation  Standards for Educational & Psychological Testing (1999)  Cronbach, Messick, Kane  Some theory, key concepts, examples  Reliability as part of validity

80 80 VALIDITY: DEFINITIONS “A proposition deserves some degree of trust only when it has survived serious attempts to falsify it.” (Cronbach, 1980) According to the Standards, validity refers to “the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.” “Validity is an integrative summary.” (Messick, 1995) “Validation is the process of building an argument supporting interpretation of test scores.” (Kane, 1992)

81 81 WHAT IS A CONSTRUCT?  Constructs are psychological attributes, hypothetical concepts  A defensible construct has  A theoretical basis  Clear operational definitions involving measurable indicators  Demonstrated relationships to other constructs or observable phenomena  A construct should be differentiated from related theoretical constructs as well as from methodological irrelevancies

82 82 THREATS TO CONSTRUCT VALIDITY (Cook & Campbell)  Inadequate preoperational explication of constructs  Mono-operation bias  Mono-method bias  Interaction of different treatments  Interaction of testing and treatment  Restricted generalizability across constructs

83 83 THREATS TO CONSTRUCT VALIDITY (Cook & Campbell)  Confounding constructs  Confounding levels of constructs  Hypothesis guessing within experimental conditions  Evaluation apprehension  Researcher expectancies

84 84 SOURCES OF VALIDITY EVIDENCE 1. Test Content Task Representation Construct Domain 2. Response Process – Item Psychometrics 3. Internal Structure – Test Psychometrics 4. Relationships with Other Variables – Correlations Test-Criterion Relationships Convergent and Divergent Data 5. Consequences of Testing – Social context Standards for Educational and Psychological Testing, 1999

85 85 ASPECTS OF VALIDITY: CONTENT  Content validity refers to how well elements of the test or scale relate to the content domain.  Content relevance.  Content representativeness.  Content coverage.  Systematic analysis of what the test is intended to measure.  Technical quality.  Construct irrelevant variance

86 86 SOURCES OF VALIDITY EVIDENCE: TEST CONTENT Detailed understanding of the content sampled by the instrument and its relationship to content domain  Content-related evidence is often established during the planning stages of an assessment or scale.  Content-related validity studies  Exact sampling plan, table of specifications, blueprint  Representativeness of items/prompts →Domain  Appropriate content for instructional objectives ◊ Cognitive level of items ◊ Match to instructional objectives  Review by panel of experts.  Content expertise of item/prompt writers  Expertise of content reviewers  Quality of items/prompts, sensitivity review

87 87 ASPECTS OF VALIDITY: RESPONSE PROCESSES  Emphasis is on the role of theory.  Tasks sample domain processes as well as content.  Accuracy in combining scores from different item formats or subscales.  Quality control – scanning, assignment of grades, score reports.

88 88 SOURCES OF VALIDITY EVIDENCE: RESPONSE PROCESSES Fit of student responses to hypothesized construct?  Basic quality control information – accuracy of item responses, recording, data handling, scoring  Statistical evidence that items/tasks measure the intended construct  Achievement items measure intended content and not other content  Ability items predict targeted achievement outcome  Ability items fail to predict a non-related ability or achievement outcome

89 89 SOURCES OF EVIDENCE: RESPONSE PROCESSES  Debrief examinees regarding solution processes.  “Think-aloud” during pilot testing.  Subscore/subscale analyses- i.e., correlation patterns among part scores.  Accurate and understandable interpretations of scores for examinees.

90 90 SOURCES OF VALIDITY EVIDENCE: INTERNAL STRUCTURE Statistical evidence of the hypothesized relationship between test item scores and the construct  Reliability  Test scale reliability  Rater reliability  Generalizability  Item analysis data  Item difficulty and discrimination  MCQ option function analysis  Inter-item correlations  Scale factor structure  Dimensionality studies  Differential item functioning (DIF) studies

91 91 ASPECTS OF VALIDITY: EXTERNAL Can the test results be evaluated by objective criteria?  Correlations with other relevant variables  Test-criterion correlations  Concurrent or predictive  MTMM matrix  Convergent correlations  Divergent (discriminant) correlations

92 92 SOURCES OF VALIDITY EVIDENCE: RELATIONSHIPS TO OTHER VARIABLES Statistical evidence of the hypothesized relationship between test scores and the construct  Criterion-related validity studies  Correlations between test scores/subscores and other measures  Convergent-Divergent studies  MTMM

93 93 RELATIONSHIPS WITH OTHER VARIABLES Predictive validity: Variation of concurrent validity where the criterion is in the future. Classic example is to determine whether students who score high on an admissions test such as the MCAT earn higher preclinical GPAs?

94 94 RELATIONSHIPS WITH OTHER VARIABLES Convergent validity: Assessed by the correlation among items which make up the scale (internal consistency), by the correlation of a the given scale with measures of the same construct using instruments proposed by other researchers, and by the correlation of relationships involving the given scale across samples or across methods.

95 95 RELATIONSHIPS WITH OTHER VARIABLES Criterion (concurrent) validity: correlation between scale or instrument measurement items and known accepted standard measures or criteria. Do the proposed measures for a given concept exhibit generally the same direction and magnitude of correlation with other variables as measures of that concept already accepted in this area of research?

96 96 RELATIONSHIPS WITH OTHER VARIABLES Divergent (discriminant) validity: The indicators of different constructs should not be highly correlated as to lead us to conclude that they measure the same thing. This would happen is there is definitional overlap between two constructs

97 97 MULTI-TRAIT MULTI-METHOD MTMM MATRIX  Mono-method and/or mono-method biases – use of a single data gathering method or a single indicator for a concept may result in bias  Multi-trait/Multi-method validation uses multiple indicators per concept and gathers data for each indicator by multiple methods or multiple sources.

98 98 MULTI-TRAIT MULTI-METHOD MTMM MATRIX Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40: ILS = Index of Learning Styles LSTI = Learning Style Type Indicator Active-reflective Sensing-intuitive Visual- verbal Sequential-global Extrovert-introvert Sensing-intuition Thinking-feeling Judging- perceiving

99 99 MULTI-TRAIT MULTI-METHOD MTMM MATRIX Validity of index of learning styles scores: multitrait−multimethod comparison with three cognitive learning style instruments. Cook DA; Smith AJ. Medical Education, 2006; 40: ILS = Index of Learning Styles LSTI = Learning Style Type Indicator

100 100 RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY  Neither is a property of a test or scale.  Reliability is important validity evidence.  Without reliability, there can be no validity. Reliability is necessary, but not sufficient for validity.  Purpose of an instrument dictates what type of reliability is important and the sources of validity evidence necessary to support the desired inferences.

101 101 SOURCES OF VALIDITY EVIDENCE: CONSEQUENCES Evidence of the effects of tests on students, instruction, schools, society  Consequential validity  Social consequences of assessment  Effects of passing-failing tests  Economic costs of failure  Costs to society of false positive/false negative decisions  Effects of tests on instruction/learning  Intended vs. unintended

102 RELIABILITY AND INSTRUMENTATION Lou Ann Cooper, PhD Director of Program Evaluation and Medical Education Research University of Florida College of Medicine

103 103 TYPES OF RELIABILITY Different types of assessments require different kinds of reliability  Written MCQ/Likert-scale items  Scale reliability  Internal consistency  Written Constructed Response and Essays  Inter-rater agreement  Generalizability theory

104 104 TYPES OF RELIABILITY  Oral Exams  Rater reliability  Generalizability Theory  Observational Assessments  Rater reliability  Inter-rater agreement  Generalizability Theory  Performance Exams (OSCEs)  Rater reliability  Generalizability Theory

105 105 ROUGH GUIDELINES FOR RELIABILITY  The higher the better!  Depends on purpose of test  Very high-stakes: > 0.90 (Licensure exams)  Moderate stakes: at least ~ 0.75 (Classroom test, Medical school OSCE)  Low stakes: > 0.60 (Quiz, test for feedback only)

106 106 INCREASING RELIABILITY  Written tests  Use objectively scored formats  At least MCQs  MCQs that differentiate between high and low scorers  Performance exams  At least 7-12 cases  Well trained standardized patients and/or other raters  Monitoring and quality control  Observational Exams  Many independent raters (7-11)  Standard checklists/rating scales  Timely ratings

107 107 SCALE DEVELOPMENT 1. Identify the primary purpose for which scores will be used.  Validity is the most important consideration.  Validity is not a property of an instrument.  Inferences to be made determine the type of items you will write. 2. Specify the important aspects of the construct to be measured.

108 108 SCALE DEVELOPMENT 3. Initial pool of items. 4. Expert review (content validity) 5. Preliminary item ‘tryout’ 6. Statistical properties of the items  Item analysis  Reliability estimate  Dimensionality

109 109 ITEM ANALYSIS  Item ‘difficulty’ – item variance, frequencies  Inter-item covariances/correlations  Item discrimination – an item that discriminates well correlates with the total score.  Cronbach’s coefficient alpha  Factor Analysis – Multidemensional Scaling  IRT  Structural aspect of validity.

110 110 NEED TO EVALUATE SCALE  Jarvis & Petty (1996)  Hypothesis: Individuals differ in the extent to which they engage in evaluative responding.  Subjects were undergraduate psychology students.  Comprehensive reliability and validity studies.  Concluded the scale was ‘unidimensional’.

111 111 REFERENCES Cook, T.D. & Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings. Downing, S. M. Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv in Health Sci Educ 2002; 7: Downing, S. M. Validity: On the meaningful interpretation of assessment data. Med Educ 2003; 37: Messick, S. (1989) Validity. In Educational Measurement 3 rd Ed. R. L. Linn, Ed. Downing, S. M. Reliability: On the reproducibility of assessment data. Med Educ, 2004; 38 :


Download ppt "1 POPULATIONS AND SAMPLING. 2 Let’s Look at our Example Research Question How do UF COP pharmacy students who only watch videostreamed lectures differ."

Similar presentations


Ads by Google