Presentation on theme: "Chapter 4 – Reliability Observed Scores and True Scores Error"— Presentation transcript:
1 Chapter 4 – Reliability Observed Scores and True Scores Error How We Deal with Sources of Error:Domain sampling – test itemsTime sampling – test occasionsInternal consistency – traitsReliability in Observational StudiesUsing Reliability InformationWhat To Do about Low Reliability
2 Chapter 4 - ReliabilityMeasurement of human ability and knowledge is challenging because:ability is not directly observable – we infer ability from behaviorall behaviors are influenced by many variables, only a few of which matter to usWhatever I say about ability here applies to knowledge as well.Ability is not directly observablewe have to infer true values for things like intelligence or aptitude.There are many influences on behaviorthe ones we’re not interested in can obscure the ones we are interested inthat makes our inferences problematic
3 Observed Scores O = T + e O = Observed score T = True score e = error This means that observed scores combine ability (or knowledge) and error. The more error there is, the less reliable the test is.
4 Reliability – the basics A true score on a test does not change with repeated testingA true score would be obtained if there were no error of measurement.We assume that errors are random (equally likely to increase or decrease any test result).1. In other words, you don’t learn or acquire skill (related to the material being tested) from the test itself. That assumption is, of course, questionable – since in some courses, the testing experience teaches through integration.
5 Reliability – the basics Because errors are random, if we test one person many times, the errors will cancel each other out(Positive errors cancel negative errors)Mean of many observed scores for one person will be the person’s true score
6 Reliability – the basics Example: to measure Sarah’s spelling ability for English words.We can’t ask her to spell every word in the OED, so…Ask Sarah to spell a subset of English words% correct estimates her true English spelling skillBut which words should be in our subset?OED = Oxford English Dictionary. For her to spell every word in the OED would take far too long for our measurement purpose.
7 Estimating Sarah’s spelling ability… Suppose we choose 20 words randomly…What if, by chance, we get a lot of very easy words – cat, tree, chair, stand…Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics
8 Estimating Sarah’s spelling ability… Sarah’s observed score varies as the difficulty of the random sets of words variesBut presumably her true score (her actual spelling ability) remains constant.Note: this is only true when we select the words randomly. If we deliberately choose an easy set of words & a hard set, the difference in Sarah’s scores reflects both sampling error and her ability.
9 Reliability – the basics Other things can produce error in our measurementE.g. on the first day that we test Sarah she’s tiredBut on the second day, she’s rested…This would lead to different scores on the two days
10 Estimating Sarah’s spelling ability… Conclusion:O = T + eBut e1 ≠ e2 ≠ e3 …The variation in Sarah’s scores is produced by measurement error.How can we measure such effects – how can we measure reliability?O = T + e. We are assuming that over the period of testing, T remains constant. This may not be true, for example if Sarah is 8 years old, and we test her in September and the following June. But it should be true if she is 21 and we test her today and tomorrow.
11 Reliability – the basics In what follows, we consider various sources of error in measurement.Different ways of measuring reliability are sensitive to different sources of error.
12 How do we deal with sources of error? Error due to test itemsDomain sampling error
13 How do we deal with sources of error? Error due to test itemsError due to testing occasionsTime sampling error
14 How do we deal with sources of error? Error due to test itemsError due to testing occasionsError due to testing multiple traitsInternal consistency error
15 Domain Sampling errorA knowledge base or skill set containing many items is to be tested.E.g., the chemical properties of foods.We can’t test the entire set of items.So we select a sample of items.That produces domain sampling error, as in Sarah’s spelling test.
16 Domain Sampling error There is a “domain” of knowledge to be tested A person’s score may vary depending upon what is included or excluded from the test.
17 Domain Sampling errorSmaller sets of items may not test entire knowledge base.Larger sets of items should do a better job of covering the whole knowledge base.As a result, reliability of a test increases with the number of items on that testr1t = r1j reliability of a score given by square root of the correlation between the test that produced it and all other randomly parallel tests from the domain. As number of items on a test increases, it tests a more complete range of items and so should produce a result more correlated with results of other tests.
18 Domain Sampling error Parallel Forms Reliability: choose 2 different sets of test items.these 2 sets give you “parallel forms” of the testAcross all people tested, if correlation between scores on 2 parallel forms is low, then we probably have domain sampling error.E.g., 2 different random sets of 20 words to test Sarah’s spelling abilitySolution when we have domain sampling error of measurement: develop new sets of items, with better methodIf both forms of a test are given on same day, a given person’s scores could differ because of (i) form and (ii) random errorIf the two forms are given on different days, time sampling is another source of error.This technique is not used much – professors don’t want the extra work making and using two forms of the same test. Students don’t want to write two forms of one test either
19 Time Sampling error Test-retest Reliability person taking test might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc.Give same test repeatedly & check correlations among scoresHigh correlations indicate stability – less influence of bad or good days.Test-retest correlations will decline with time, so that there is not one, but an infinite number of such correlations. Interval over which correlation was measured should always be provided in a report.
20 Time Sampling errorTest-retest approach is only useful for traits – characteristics that don’t change over timeNot all low test-retest correlations imply a weak testSometimes, the characteristic being measured varies with time (as in learning)
21 Time Sampling errorInterval over which correlation is measured mattersE.g., for young children, use a very short period (< 1 month, in general)In general, interval should not be > 6 monthsNot all low test-retest correlations imply a weak testSometimes, the characteristic being measured varies with time (as in learning)Over longer periods of time, changes in performance are likely to reflect cumulative changes (development), rather than random error, and to be seen in broad cognitive performance rather than just in narrow test performance.
22 Time sampling errorTest-retest approach advantage: easy to evaluate, using correlationDisadvantage: carryover & practice effectsCarryover: first testing session influences scores on next sessionPractice: when carryover effect involves learningE.g., 1st testing session might spur interest in topiccarryover effects: when the first testing session influences scores on the second session.practice effects – when the carryover effect involves learning of some kind.interval between 1st and 2nd tests is crucialshort interval – more carryover effectslong interval – more effects of other kindsNote – not all low test-retest correlations imply a weak test. Sometimes, the characteristic being studied varies with time.
23 Internal Consistency error Suppose a test includes both items on social psychology and items requiring mental rotation of abstract visual shapes.Would you expect much correlation between scores on the two parts?No – because the two ‘skills’ are unrelated.It is quite possible to be good at social psychology and bad (or good) at mental imagery.
24 Internal Consistency Approach A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits.A good test has high correlations between scores on its two halves.But how should we divide the test in two to check that correlation?
25 Internal Consistency error Split-half methodKuder-Richardson formulaCronbach’s alphaAll of these assess the extent to which items on a given test measure the same ability or trait.
26 Split-half Reliability After testing, divide test items into halves A & B that are scored separately.Check for correlation of results for A with results for B.Various ways of dividing test into two – randomly, first half vs. second half, odd-even…
27 Split-half Reliability – a problem Each half-test is smaller than the wholeSmaller tests have lower reliability (domain sampling error)So, we shouldn’t use the raw split-half reliability to assess reliability for the whole test
28 Split-half reliability – a problem We correct reliability estimate using the Spearman-Brown formula:re = 2rc1+ rcre = estimated reliability for the testrc = computed reliability (correlation between scores on the two halves A and B)
29 Kuder-Richardson 20Kuder & Richardson (1937): an internal-consistency measure that doesn’t require arbitrary splitting of test into 2 halves.KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves.Note – KR20 is for situation where test items are dichotomous (e.g., right or wrong).
30 Kuder-Richardson 20 The formula contains two basic terms: a measure of all the variance in the whole set of test results.“Variance” is a technical term – a measure of how much, on average, the scores in the data set differ from each other.
31 Kuder-Richardson 20 The formula contains two basic terms: “item variance” – when items measure the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance”“Variance” is a technical term – a measure of how much, on average, the scores in the data set differ from each other.When items measure the same trait, they will co-vary (same people will get them right or wrong)More co-variance = less item variance and smaller KR-20.Don’t worry about the KR-20 formula!
32 Internal Consistency – Cronbach’s α KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false).Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories.α is a more generally-useful measure of internal consistency than KR-20
33 Review: How do we deal with sources of error? Approach Measures IssuesTest-Retest Stability of scores CarryoverParallel Forms Equivalence & Stability EffortSplit-half Equivalence & Internal Shortened consistency testKR-20 & α Equivalence & Internal Difficult toconsistency calculateTest – Retest: same test given twice with interval between testingsParallel Forms: Equivalent tests given with interval between testingsInternal Consistency: One test given at one time
34 Reliability in Observational Studies Some psychologists collect data by observing behavior rather than by testing.This approach requires time sampling, leading to sampling errorFurther error due to:observer failuresinter-observer differences
35 Reliability in Observational Studies Deal with possibility of failure in the single-observer situation by having more than 1 observer.Deal with inter-observer differences using:Inter-rater reliabilityKappa statistic
36 Reliability in Observational Studies Inter-rater reliability% agreement between 2 or more observersproblem: in a 2-choice case, 2 judges have a 50% chance of agreeing even if they guess!this means that % agreement may over-estimate inter-rater reliability.
37 Reliability in Observational Studies Kappa Statistic (Cohen,1960)estimates actual inter-rater agreement as a proportion of potential inter-rater agreement after correction for chance.Somewhat controversial – some researchers say that Kappa should not be used to assess amount of agreement between observers, only whether the observers are independent of each other in their ratings.
38 Using Reliability Information Standard error of measurement (SEM)estimates extent to which test score misrepresents a true score.SEM = (S)(1 – r)S = standard deviation of the test scores; r = reliability coefficientsmall SEM means test score is probably very close to true scorelarge SEM means test score may be quite distant from true score
39 Standard Error of Measurement We use SEM to compute a confidence interval for a particular test score.The interval is centered on the test scoreWe have confidence that the true score falls in this intervalE.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score.Interval estimators vs. point estimators
40 Standard Error of Measurement A simple way to think of the SEM:Suppose we gave one student the same test over and overSuppose, too, that no learning took place between tests and the student did not memorize questionsThe standard deviation of the resulting set of test scores (for this one student) would be the standard error of measurement.Interval estimators vs. point estimators
41 What to do about low reliability Increase the number of itemsTo find how many you need, use Spearman-Brown formulaUsing more items may introduce new sources of error such as fatigue, boredom
42 What to do about low reliability Discriminability analysisFind correlations between each item and whole testDelete items with low correlations