Presentation on theme: "Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement."— Presentation transcript:
Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement Parallel forms Split-half (internal consistency) Content Criterion Construct Reliability is a necessary prerequisite for validity.
Reliability Reliability refers to the consistency of a measure. Across A reliable test has little measurement error. Time Versions Raters And so on Observed Score = True Score + Error
Reliability True score – true or perfectly accurate E.g. the time Often a fictional mark in psychology Based on multiple measurements Aggregation = averaging a number of imprecise measurements to increase reliability
Reliability Test-retest Interobserver agreement Administer same measure at two points in time Multiple observers/judges/raters/scorers rate same target Parallel forms Compare alternate forms of same test Split-half reliability Split test into two halves and compare scores across halves Coefficient alpha: average of all possible split-half reliabilities
Validity Is the test measuring what I think it is? There are three types of validity This requires empirical demonstration Content Validity Criterion Validity Construct Validity
Validity Content Validity A test has content validity if it adequately covers the area of content it is supposed to cover. Difficult to examine statistically Content validity typically must be built in at beginning Course exams are the best examples
Validity Criterion Validity For criterion validity, tests are evaluated against some criterion Often called predictive validity Most at issue for tests employed to make decisions Selection of students Parole decisions Jobs
Criterion Validity - Concurrent Concurrent validity: does my measure correlate highly with an established measure? Can my measurement instrument predict a criterion that occurs at the same point in time? Can my measure (i.e. my operationalization) distinguish between two groups that it should be able to distinguish between?
Criterion Validity - Predictive Can my measure predict future behavior? –If yes, has predictive validity (a type of criterion validity)
Predictive Validity of the GRE Graduate Record Examination Kuncel, N.R., Hezlett, S.A., & Ones, D.S. (2001). A comprehensive meta-analysis of the predictive validity of the Graduate Record Examinations: Implications for graduate school student selection and performance. Psychological Bulletin, 127, 162- 181. Originally designed to measure “basic developed abilities relevant to performance in graduate studies” Used often and heavily in decisions about admissions Verbal measure: analogy, antonym, sentence completion, reading comprehension Quantitative measure: quantitative, quantitative comparison, data interpretation Analytic measure: analytical and logical reasoning Subject test: acquired knowledge in particular area
Predictive validity of GRE Want to establish predictive validity of GRE What will my criterion of graduate school performance be? Use several indicators of “performance”: –Graduate GPA –1 st year graduate GPA –Comprehensive exam scores –Publication citation counts –Faculty ratings –(these are the criteria)
Summary All areas of GRE were found to be valid predictors of GGPA, 1 st year GGPA, faculty ratings, and comprehensive exam scores. GRE subject tests were consistently better predictors of the criteria than quantitative or verbal tests; also better than UGPA
Construct Validity Most important type of validity “If this were a measure of …, what would it look like?” Depends heavily on theory: How is this construct related to other constructs? Requires broad thinking In validating my construct, I am validating my theory
Steps to establish construct validity 1. Need to establish convergent correlations measures of constructs that theoretically should be related to each other are, in fact, observed to be related to each other (that is, you should be able to show a correspondence or convergence between similar constructs) 2. Need to establish divergent correlations measures of constructs that theoretically should not be related to each other are, in fact, observed to not be related to each other (that is, you should be able to discriminate between dissimilar constructs) 3. Build nomological net
Convergent validity Measures that should be related are related These 4 items are converging on the same thing (don’t know for sure that it is “self-esteem” yet
Divergent Validity Self-esteem measures do not correlate with locus of control measures These measure seem to be tapping different things
Nomological Network Must develop a “lawful network” for your measure in order to establish construct validity. Includes –Theoretical framework –Empirical framework –Observables
Childhood Psychopathy Scale Lynam, D.R. (1997). Pursuing the psychopath: Capturing the fledgling psychopath in a nomological net. Journal of Abnormal Psychology, 106, 425-438. “The construct of psychopathy and attendant personality information might profitably be used at the childhood level to identify a more homogeneous group of antisocial children.”
Psychopathy The [psychopath] is unfamiliar with the primary facts or data of what might be called personal values and is altogether incapable of understanding such matters. It is impossible for him to take even a slight interest in the tragedy or joy or the striving of humanity as presented in serious literature or art. He is also indifferent to all these matters in life itself. Beauty and ugliness, except in a very superficial sense, goodness, evil, love, horror, and humour have no actual meaning, no power to move him. He is, furthermore, lacking in the ability to see that others are moved. It is as though he were colour-blind, despite his sharp intelligence, to this aspect of human existence. It cannot be explained to him because there is nothing in his orbit of awareness that can bridge the gap with comparison. He can repeat the words and say glibly that he understands, and there is no way for him to realize that he does not understand (Cleckley, 1941, p. 90 quoted in Hare, 1993, pp. 27-28).
Developed Child Psychopathy Scale Principles of rational scale construction Working from Psychopathy Checklist (PCL-R), identified mother-reported items that assessed PCL-R constructs Operationalized 13 of the 20 PCL-R constructs at 3- to 4-item scales – –glibness, untruthfulness, manipulation, lack of guilt, poverty of affect, callousness, parasitic lifestyle, behavioral dyscontrol, lack of planning, impulsiveness, unreliability, failure to accept responsibility, criminal versatility