Presentation is loading. Please wait.

Presentation is loading. Please wait.

Validity and Reliability

Similar presentations


Presentation on theme: "Validity and Reliability"— Presentation transcript:

1 Validity and Reliability
Margaret Wu

2 Validity According APA Standards for educational and psychological testing: The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. To establish validity: We need to accumulate evidence for appropriate interpretations of scores. That is, to evaluate the use of test scores, and not the test itself. A test may be valid for one kind of use, but not valid for another kind of use.

3 Sources of Validity Evidence
Evidence based on Test Content Evidence based on Response Process Evidence based on Internal Structure Evidence based on Relations to Other Variables Evidence based on Consequences of Testing

4 Evidence based on Test Content
Content coverage – representation of the domain Check assessment framework and test blueprints Assessment framework has definition of the test domain Test Blueprint has the coverage of subdomains. Check against intended use of the test: Subdomain scores reported? Inferences on individual student performance? Inferences on group performance? Inferences on trends over time? Test blueprint and curriculum match Test blueprint and test items match

5 Evidence based on Response Processes
Response processes of test takers, For example: Questionnaire: social desirability answers or students’ “real” responses Items testing reasoning: Is reasoning needed, or perhaps memorised algorithm? Evidence can be gathered through: Theoretical evidence: test blueprints include students’ required cognitive demand Empirical evidence: cognitive labs (think-aloud procedure); pre-test; response time Students with disability Judges/raters’ judging processes Test administration procedures: testing time; test security, testing environment

6 Evidence based on Internal Structure
Good/poor discriminating items Expected item difficulty order matches empirical order Item inter-relationship; unidimensionality Test reliability Presence of differential item functioning Size of measurement error; sampling error; in relation to the validity of using the test scores

7 Evidence based on relations to other variables
Predictive validity Concurrent validity Relationship to other tests Relationship to group variables, e.g., gender, demographic Are the relationships consistent with what’s expected given the construct of the test?

8 Evidence based on consequences of testing
Intended and unintended consequences of test use. Effect of group differences in test scores on employment selection Narrowing of school curriculum to exclude learning objectives not assessed. What are the benefits of using the test scores? Are there any detrimental effects from administering the test?

9 Reliability Reliability refers to how consistent would the test scores be should “similar” tests be administered.

10 (a) The temperature was 7° . It fell by 4° . The temperature was then
David can answer 60% of the items (if we have the opportunity to administer all items) (c) (-11) + (+3) = (a) The temperature was 7° . It fell by 4° . The temperature was then A class starts at 10:30. The class is 40 minutes long. What time does the class finish? Possible Grade 5 Mathematics Item Pool – Many questions can be asked (a) 16 × 10 = A class starts at 10:30. The class is 40 minutes long. What time does the class finish? The correlation between students’ scores on two “parallel” tests is called reliability (j) ÷ 1000 = Each apple weighs around 160 grams. How many apples together will weigh close to half a kilogram? 40 questions are sampled from the large item pool David’s test scores on similar NAPLAN tests will have a range of 10 score points (e.g., between 20/40 to 30/40). The difference between David’s test score and his “TRUE” score is called measurement error/. NAPLAN 2008 Test NAPLAN 2009 Test David’s score: 25/40 NAPLAN 2010 Test David’s score: 28/40 David’s score: 20/40

11 Margin of error in measuring student performance
One test collects only a small sample of performance. Possible variation in scores is called Measurement Error. Note that Measurement Error does not refer to mistakes made in the assessment (e.g., wrong scoring; incorrect question) . Is the measurement error too large in NAPLAN? 11

12 How big an error size is acceptable?
The answer is It depends. An example Effectiveness of a weight loss program Expect a loss of 0.5 kg after one week. Measurement scale is accurate to 1kg. Not good enough for measuring individual change OK for a group change, if group size is ‘large’. The key is to assess whether the measurement error is too large in comparison to the magnitude that we want to measure. 12 12

13 Magnitudes of measurement error
As a rough guide, a lower bound for measurement error is given by sqrt(4/N) where N is the number of items in the test.

14 On the NAPLAN scale… 14

15 Summary of measuring individuals
The main message is that a one-off test does not provide very accurate information at the individual student level other than an indicative level of whether a student is below average, at average or above average. If ever one single test of 30 items is used for high-stakes purposes such as selection into colleges or awarding certificates, we should be very wary of the results.

16 Reliability and Measurement Error
Reliability = Var(T)/Var(X) variance of true score divided by variance of observed score Reliability = (Var(X) – Var(E)) / Var(X) Variance of observed scores minus square of measurement error, then divide by variance of observed scores. Sqrt(Var(E)) is Measurement Error.


Download ppt "Validity and Reliability"

Similar presentations


Ads by Google