1 Content-based Interpretations of Test Scores Michael Kane National Conference of Bar Examiners Maryland Assessment Research Center for Education Success.

Slides:



Advertisements
Similar presentations
Elliott / October Understanding the Construct to be Assessed Stephen N. Elliott, PhD Learning Science Institute & Dept. of Special Education Vanderbilt.
Advertisements

Edouard Manet: The Bar at the Folies Bergere, 1882
Cross Cultural Research
 Degree to which inferences made using data are justified or supported by evidence  Some types of validity ◦ Criterion-related ◦ Content ◦ Construct.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
8. Evidence-based management Step 3: Critical appraisal of studies
Cognition and Crime Kristopher Proctor Kirk R. Williams Nancy G. Guerra University of California, Riverside.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Culture and psychological knowledge: A Recap
1 An Introduction to Validity Arguments for Alternate Assessments Scott Marion Center for Assessment Eighth Annual MARCES Conference University of Maryland.
Chapter 4 Validity.
VALIDITY.
Concept of Measurement
Beginning the Research Design
MSc Applied Psychology PYM403 Research Methods Validity and Reliability in Research.
Developing a Hiring System Reliability of Measurement.
Concept of Reliability and Validity. Learning Objectives  Discuss the fundamentals of measurement  Understand the relationship between Reliability and.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Classroom Assessment A Practical Guide for Educators by Craig A
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
Understanding Validity for Teachers
1 Evaluating Psychological Tests. 2 Psychological testing Suffers a credibility problem within the eyes of general public Two main problems –Tests used.
POSC 202A: Lecture 1 Introductions Syllabus R Homework #1: Get R installed on your laptop; read chapters 1-2 in Daalgard, 1 in Zuur, See syllabus for Moore.
Reporting & Ethical Standards EPSY 5245 Michael C. Rodriguez.
Political Science 102 May 18 th Theories and hypotheses Evidence Correlation and Causal Relationships Doing comparative research Your Term Paper.
Bryman: Social Research Methods, 4 th edition What is a concept? Concepts are: Building blocks of theory Labels that we give to elements of the social.
Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.
CSD 5100 Introduction to Research Methods in CSD Observation and Data Collection in CSD Research Strategies Measurement Issues.
Scientific Inquiry & Skills
WELNS 670: Wellness Research Design Chapter 5: Planning Your Research Design.
MODULE 3 INVESTIGATING HUMAN AND SOCIL DEVELOPMENT IN THE CARIBBEAN.
Techniques of research control: -Extraneous variables (confounding) are: The variables which could have an unwanted effect on the dependent variable under.
Inductive Generalizations Induction is the basis for our commonsense beliefs about the world. In the most general sense, inductive reasoning, is that in.
McGraw-Hill © 2006 The McGraw-Hill Companies, Inc. All rights reserved. Preparing Research Proposals and Reports Chapter Twenty-Four.
The Nature of Quantitative Research
URBDP 591 I Lecture 3: Research Process Objectives What are the major steps in the research process? What is an operational definition of variables? What.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Scientifically-Based Research What is scientifically-based research? How do evaluate it?
Measurement Validity.
Research Methods in Psychology Chapter 2. The Research ProcessPsychological MeasurementEthical Issues in Human and Animal ResearchBecoming a Critical.
VALUE/Multi-State Collaborative (MSC) to Advance Learning Outcomes Assessment Pilot Year Study Findings and Summary These slides summarize results from.
Copyright © Allyn & Bacon 2008 Intelligent Consumer Chapter 14 This multimedia product and its contents are protected under copyright law. The following.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
Criteria for selection of a data collection instrument. 1.Practicality of the instrument: -Concerns its cost and appropriateness for the study population.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Chapter 6 - Standardized Measurement and Assessment
Writing A Review Sources Preliminary Primary Secondary.
Validity Evaluation NCSA Presentation NAAC/NCIEA GSEG CONSTORIA.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Inf st december Credibility: reliability and validity in qualitative research inf5220.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Critiquing Quantitative Research.  A critical appraisal is careful evaluation of all aspects of a research study in order to assess the merits, limitations,
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
1 Basics of Inferential Statistics Mark A. Weaver, PhD Family Health International Office of AIDS Research, NIH ICSSC, FHI Lucknow, India, March 2010.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Writing a sound proposal
Personality Assessment, Measurement, and Research Design
VALIDITY by Barli Tambunan/
Unit 5: Hypothesis Testing
© LOUIS COHEN, LAWRENCE MANION AND KEITH MORRISON
Reliability and Validity of Measurement
Significance Tests: The Basics
Personality Assessment, Measurement, and Research Design
POSC 202A: Lecture 1 Introductions Syllabus R
Critical Appraisal วิจารณญาณ
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

1 Content-based Interpretations of Test Scores Michael Kane National Conference of Bar Examiners Maryland Assessment Research Center for Education Success October, 2008

2 Overview Argument-based framework for validation Three content-based interpretations: –observable attributes, –operationally defined attributes – traits Limitations of content-based validity evidence –“Begging the question”

3 Validation To validate test score interpretations and uses is to evaluate the plausibility of the interpretations and the appropriateness of the uses. Validation is therefore contingent; the evidence relevant to validation depends on the proposed interpretations and uses.

4 Argument-based Framework for Validation

5 Interpretations/Uses of scores In order to evaluate an interpretation, it is necessary to specify what it claims. What inferences are being draw? What rules of inferences are being relied on? What supporting assumptions are being made? The format used to specify the interpretation and uses is not important. That they be specified is essential.

6 Argument-based Approach to Validation The interpretive argument specifies the interpretations and uses of the test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results. The validity argument provides a critical evaluation of the interpretive argument.

7 Toulmin’s Model of Inference Datum  [warrant]  so{Qualifier} claim  Backing exceptions

8 Warrants as Generic Inferences

9 Characteristics of the Interpretive Argument “Informal” - Involves substantive inferences and assumptions - not just logical or statistical inferences and assumptions. “presumptive” - does not prove the conclusions, but develops a presumption in favor of them. “tentative” - conclusions are uncertain. “defeasible” – can be overturned in particular cases.

10 Criteria for Validating/Evaluating Interpretive Arguments Clarity of the interpretive Argument Coherence of the interpretive argument Plausibility of Inferences Plausibility of Assumptions

11 Asking the Right Questions An essential step in validation is the clear, explicit, and complete specification of the proposed interpretations and uses of test scores. In the absence of a clear and complete understanding of the proposed interpretations and uses, validators literally do not know what they are doing. To evaluate/validate the claims based on test scores, it is important to know what is being claimed.

12 Three Distinct Content-based Interpretations: Observable attributes Operationally defined attributes Traits

13 A Family of Content-based Interpretations A cluster of closely related attributes that derive much of their meaning from content domains (Observable Attributes, Operational Definitions, and Traits). These attributes are interesting in themselves. And they illustrate the dependence of validation on the details of the proposed interpretations and uses.

14 Observable Attributes Some kind of behavior is of interest A target domain (TD) of possible observations (often large and somewhat fuzzy) is specified. The target score (TS), the expected value over the TD for the person is taken to be the value of the observable attribute (OA) for the person. Because it is not generally possible to observe all of the observations in the TD, the TS has to be estimated using samples from the TD.

15 Possible Observations Observable attributes are dispositions. They report a tendency to respond in a some way to some kind of stimulus or to perform in some way given a task. Each possible observation in the TD involves some task or stimulus, some conditions of observation, some context, some response, and a categorization of the response (e.g., good, adequate, marginal, inadequate).

16 Notes on OAs OAs are “observable” in the sense that they are expected values over (very large) domains (or sets) of potential observations. They are inductive summaries. OAs do not require an explanation for the observations, and they do not assume any latent trait that accounts for the observations. But they do not rule out explanations in terms of theories, latent traits, etc. Rather, they invite explanation.

17 What Shapes TDs? Why do we include some observations and not others in the TD? –Practical needs: performances involved in a job, sport, or other activity –Theoretical context: performances serve the same role or are accounted for in the same way by a theory. –Experience: performances seem to hang together However, once the TD is specified, it defines the observable attribute.

18 Examples of Observable Attributes Performance in shooting free throws in basketball Performance in responding appropriately to written materials in English Performance in a job Performance in a trade or profession Tendency to respond in some way to some kind of stimulus

19 Measuring Observable Attributes Typically, it is not feasible to draw random or representative samples from the TD. Rather, a measurement procedure is defined in terms of a subset of the TD, from which we can draw random or representative samples. I will refer to this subset of the TD defining the measurement procedure as the universe of generalization (UG) for the procedure. I will refer to a person’s expected value over the UG as the person’s universe score (US).

20 Interpretive Arguments for OAs Evaluation: from observations to an observed score (OS) Generalization: from the observed score (OS) to a universe score (US) Extrapolation: from the universe score (US) to the target score (TS)

21 Validity Arguments for OAs Evaluation Generalization Extrapolation Expert judgment supporting scoring rule Generalizability study Criterion-related data study, analyses of relationships between UG performances and TD performances

22 Operational Definitions In some cases, OAs may be defined in terms of a domain from which it is possible to draw random or representative samples, and the attribute can be operationally defined in terms of a measurement procedure. For such operationally defined attributes (ODAs), there is no extrapolation to a broader domain, and therefore no need for evidence supporting extrapolation. So validation is much easier for an ODA than it is for a broadly defined OA.

23 Interpretive Arguments for Operationally Defined Attributes Evaluation: from observations to an observed score (OS) Generalization: from the observed score (OS) to a universe score (US)

24 Uses of Operational Definitions An operationally-defined attribute is interpreted in terms of expected test performance. Any inferences about non-test performances will generally require specific criterion-related evidence. An ODA can also be used as an indicator for a theoretical construct, but this use requires construct-related validity evidence.

25 Traits Trait definitions incorporate target domains of possible observations, but add assumptions about underlying causal traits, that account for performance in the target domain. As a result, trait interpretations are much richer than the interpretations of observable attributes or operationally defined attributes.

26 Trait Language 1 A trait is a disposition to behave or perform in some way in response to some kinds of stimuli or tasks, under some range of circumstances. Much of the meaning of the trait is given by the domain of observations over which the disposition is defined, but trait interpretations also assume, at least implicitly, that some underlying or latent attribute accounts for the observed regularities in performance (Loevinger, 1957).

27 Trait Language 2 Messick defined a trait as: “a relatively stable characteristic of a person... which is consistently manifested to some degree when relevant, despite considerable variation in the range of settings and circumstances” (Messick, 1989, p 15). Trait language tends to be implicitly causal, but no specific mechanisms describe how the trait influences performance or behavior.

28 Traits One can think of a trait as an observable attribute with an added dimension, the underlying latent attribute that accounts for the observed performances. Alternately, one can think of a latent “trait” (e.g., anxiety, quantitative aptitude), and then specify a corresponding target domain of possible observations. Either way, we have a target domain and an underlying latent trait.

29 Interpretive Arguments for Traits Evaluation: from observations to an observed score (OS) Generalization: from the observed score (OS) to a universe score (US) Extrapolation: from the universe score (US) to the target score (TS) Explanation/Implications: from the target score (TS) to the latent trait and to any implications associated with the trait

30 Validation requires backing for the scoring and generalization inferences, and typically for an extrapolation inference. In addition, validation calls for backing for any additional inferences associated with the trait claims: –Unidimensionality –Agreement with theory (as in Cronbach and Meehl, 1955) –Relationship to other variables –Fit to an IRT model Validating Trait Interpretations

31 Limitations of Content-based Validity Evidence

32 Content-based judgments about content relevance and representativeness are typically made during test development and have a confirmationist bias. Messick (1989) saw content-validity evidence as playing a minor role in validation because it doesn’t apply directly to “inferences to be made from test scores” (p. 17). Cronbach (1971, p.452) maintained that, –Judgments about content validity should be restricted to the operational, externally observable side of testing. Judgments about the subject’s internal processes state hypotheses, and these require empirical construct validation. (italics in original) Criticisms of the Content Model

33 Judgment

34 Confirmationist Bias and the Stages of Validation Development Stage: Creating the test and the interpretive argument –Done by test developers –Tends to be confirmationist –Most content-related evidence is collected Appraisal Stage: challenging the interpretive argument

35 The Begging-the-question Fallacy Begging the question occurs if a large part of the question at issue is simply taken for granted or “begged”. In the weakest applications of content-validity models, content judgments are used to justify very expansive interpretations (e.g., in terms of traits, theoretical constructs) and uses (accountability).

36 To validate the interpretations and uses of test scores is to evaluate all of the claims being made.