Presentation on theme: "Chapter 6. The Research Consumer Evaluates Measurement Reliability and Validity."— Presentation transcript:
Chapter 6. The Research Consumer Evaluates Measurement Reliability and Validity
Evidence that Matters: Reliable Measurements Evidence that matters is collected from reliable, valid, responsive, and interpretable measures—methods of collecting information---of participant characteristics and program process, outcomes, impact and costs. For research findings to count, they must come from measures that have the capacity to consistently and accurately detect changes in program participants’ knowledge, attitudes and behavior. A reliable measure is a consistent one. A measure of quality of life, for example, is reliable if, on average, it produces the same information from the same people today and two weeks from now.
Reliability, Reproducibility, and Precision A reliable measure is reproducible and precise: Each time it is used it produces the same value. A beam scale can measure body weight precisely, but a questionnaire about good citizenship is likely to produce values that vary from person to person and even from time to time. A measure (e.g., of good citizenship) cannot be perfectly precise it its underlying concept is imprecise (e.g., because differing definitions of good citizenship). This imprecision is the gateway to random (chance) error. Error comes from three sources: variability in the measure itself, variability in the respondents, and variability in the observer.
Reliability Types Test-retest reliability A measure has test-retest reliability if the correlation or reliability coefficient between scores from time-to-time is high. Internal Consistency Reliability. Internal consistency is an indicator of the cohesion of the items in a single measure. All items in an internally consistent measure actually assess the same idea or concept. One example of internal consistency might be a test of two questions. The first statement says "You almost always feel like smoking." The second question says "You almost never feel like smoking." If a person agrees with the first and disagrees with the second, the test has internal consistency.
Reliability Types (Continued) Split-half Reliability To estimate split-half reliability, the researcher divides a measure into two equal halves (say by choosing all odd numbered questions to be in the first and all even numbered questions to be in the second half). Then using the researcher calculates the correlation between the two halves. Alternate-form Reliability Refers to the extent to which two instruments measure the same concepts at the same level of difficulty.
Reliability Types (Continued) Intra-rater reliability Refers to the extent to which an individual’s observations are consistent over time. If you score the quality of 10 evaluation reports at time 1, for example, and then re-score them 2 weeks later, your intrra-rater reliability will be perfect if the two sets of scores are in perfect agreement.
Reliability Types (Continued) Inter-rater reliability Refers to the extent to which two or more observers or measurements agree with one another. Suppose you and a co-worker score the quality of 10 evaluation reports. If you and your co-worker have identical scores for each of the 10 reports, you inter- rater reliability will be perfect. A commonly used method for determining the agreement between observations and observers results in a statistic called kappa.
Measurement Validity Validity refers to the degree to which a measure assesses what it is supposed to measure. Measurement validity is not the same thing as internal and the concepts of external validity we discussed in connection with research design Measurement validity refers to the extent to which a measure or instrument provides data that accurately represents the concepts of interest.
Validity Types Content validity Refers to the extent to which a measure thoroughly and appropriately assesses the skills or characteristics it is intended to measure. Face validity Refers to how a measure appears on the surface: Does it seem to cover all the important domains? ask all the needed questions? Face validity is established by experts in the field who are asked to review a measure and to comment on its coverage. Face validity is the weakest type because it does not have theoretical or research support.
Validity Types (Continued) Predictive validity Predictive validity refers to the extent to which a measure forecasts future performance. A graduate school entry examination that predicts who will do well in graduate school (as measured, for example, by grades) has predictive validity. Concurrent validity Concurrent validity is demonstrated when two measures agree with one another, or a new measure compares favorably with one that is already considered valid. Construct validity Construct validity is established experimentally to demonstrate that a measure distinguishes between people who do and do not have certain characteristics. To demonstrate constructive validity for a measure of competent teaching, you need proof that teachers who do well on the measure are competent whereas teachers who do poorly are incompetent.
Sensitivity and Specificity Sensitivity and specificity are two terms that are used in connection with screening and diagnostic tests and measures to detect “disease.” Sensitivity refers to the proportion of people with disease who have a positive test result. A sensitive measure will correctly detect disease among people who have the disease. A sensitive measure is a valid measure. What happens when people without the disease get a positive test anyway, as sometimes happens? That is called a false positive. Insensitive, invalid measures lead to false positives.
Sensitivity and Specificity Specificity Specificity refers to the proportion of people without disease who have a negative test result. Measures with poor specificity lead to false negatives. They invalidly classify people as not having a disease when in fact they actually do.