Presentation on theme: "1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High."— Presentation transcript:
1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High reliability: measuring weight Psychological tests fall inbetween
2 Reliability & Measurement error Measures are not perfectly reliable because they have error The “built in accuracy” of the scale Pokemon wristwatch vs. USN atomic clock We can express this as: X = T + e X = your measurement T = the “True score” e = the error involved in measuring it (+ or -)
3 Example: the effect of e Imagine we have someone with a “true” int score of 100. If your int scale has a large e, then you measurements will vary a lot (say from 60 all the way to 130) If your scale has a small e, your scale will vary a little (say from 90 to 110)
4 Measurements as distributions Think of e as variance in a distribution, with X as your mean Small e - scores clustered close to true score Large e - scores all over the place! (hard to say what the true score is)
5 More on the error Measures with a large e are dodgy (hides the true score) We can reduce the size of e, but not eliminate it completely Measuring reliability is measuring the impact of e
6 Different forms of reliability Reliability (“effect of e”) can be very hard to conceptualise To help, we break it up into 2 subclasses Temporal stability If I measure you today and tomorrow, do I get the same result? Internal consistency Are all the questions in the test measuring the same thing?
7 Temporal stability The big idea: If I test you now, and then I test you tomorrow, I should get the same result Why have it? Can’t measure changes otherwise! Tells us that we can trust results (small time error) Tells us that there is no learning effect
8 Measuring temporal stability How can we measure if a test is temporally stable? The problem: we have 2 sets of scores. We need to see if they are the same Solution: Use a correlation. If the two sets are strongly related, then they are basically the same
9 Example: Correlations & Stability Imagine a test with ten questions, and a person does it twice (on Monday and Wednesday): M: W: Are these scores the same? (r = 0.897)
10 Example: correlations & stability Now imagine a crappy scale: M: W: Are these scores basically the same (r = 0.211)
11 Different approaches to stability There are a two main ways of testing temporal stability Test-retest method: give the same test to the same people Alternate forms: give a highly similar test to the same people
12 Test-retest method Method: 1. Select a group of people 2. Give them your test 3. Get them to come back later 4. Give them the test again 5. Correlate the results to see
13 Things to note It must be the same people We want to know that if client X returns, we can measure that person again The amount of time between tests depends on your requirements The correlation value must be very high - above 0.85
14 Why it works We get 2 results from each person to compare this means we can draw rely on the test to work for the same people We use a lot of people in our assessment this means that we can rely on the test, regardless of who our client is The correlation tells us the degree to which the 2 tests agree (R 2 is the % they agree)
15 The learning effect What if you have a test where learning/practice can affect your score? Eg. class test The Test-retest method will always yield poor correlations people will always score higher marks the second time around This will make it look as if temporal stability is poor!
16 Alternate forms reliability Answer: do test-retest, but don’t use the same test twice Use a highly similar test In order for this to work, both forms must be equally difficult The more similar, the better
17 Making alternate forms of a test Simple to ensure both forms are equally difficult Make twice as many questions as you will want in the test Randomly divide them up into two halves Each half is a test! The random division ensures both forms are equally difficult
18 The procedure: alternate forms Once you have your 2 forms: Collect a sample of people Give them the first form of the test wait a while Give them the second form of the test Correlate the results If the correlation is high (>.85), you have stability
19 Which to use: alternate forms or test- retest? If you are measuring something which can be learned/perfected by practice - alternate forms If not, you could choose Test-retest if preferable, removes confound about difficulty In many cases, you don’t really know if learning is an issue Alternate forms is “safer”, but poorer statistically
20 What if you don’t have temporal stability? Temporal stability is not required for all tests Most important for tests which work longitudinally Very important if you want to track changes over time Excludes all “once off” tests (eg. aptitude tests)
21 Internal consistency A different type of reliability The big idea: Are all the questions in my test tapping into the same thing? (or, are some questions irrelevant) All tests require this property
22 Why it’s important Imagine we have an arithmetic ability test, with 4 questions: 1. What is 5 x 3 2. What is What is the capital of the Ukraine 4. What is 5 x 2 + 3
23 Why it’s important Item 3 does not contribute to measuring arithmetic Someone who is a maths wiz (should get 4/4) might only get 3/4 A complete maths idiot (should get 0/4) could get 1/4 It does not belong in this test! If we include it in our total, it will confuse us Items such as this become “third variables”
24 How do we know if an item belongs? We need to figure out if a particular item is testing the same thing as the others We can correlate the item’s scores with the scores of some other item we do know belongs High correlation (above 0.85) - it tests the same thing Low correlation (below 0.85) - it measures something else
25 Our example again Some people who know maths, will also know geography But not everyone! Correlate Q1 to Q3 - it will be weak Those who know arithmetic will know how to do the other items Correlate Q1 to Q2 or Q4, all will give a high correlation
26 Doing it for real Problem: how do we know which items are suspect? Any item could be at fault Not always ovious Solution - check them all Split half method Cronbach’s Alpha
27 Split half approach Basic idea: check one half of the test against the other half If first half correlates well to the other half, then they are tapping into the same thing Problem to overcome: each half of the test must be the same difficulty
28 Split half - procedure Give a bunch of people your test Decide on how to split the test in half Correlate the halves If the correlation is high (above 0.85), the test is reliable
29 Where to split? Problem: how do we split the test? First 10 Q vs last 10? Odd numbered Q vs Even numbered Q? Any method is acceptable, as long as the halves are of equivalent difficulty How do you show that? Not by correlation - paradox! (low r could be difficulty or reliability!)
30 Cronbach’s coefficient A major problem with split-half approach How do you know that inside a half there aren’t a few bad items? Catches most, but not all Solution: Select another half to split at But: if you have the same number of bad items in each half, they balance out - hidden!
31 The splitting headache Imagine you have a few bad items, evenly spread in the test: (Black bars are bad items) If you use a first 3/ last 3 split, end up with one bad item in each half, so they are balanced out (hidden) If you use a even/odd split, they are balanced out as well (hidden) How do you split?
32 A solution to splitting Remember: we don’t know which the bad ones are Can’t make bizzare splits to work around them Solution: brute force! Work out the correlations between every possible split, and average them out!
33 Cronbach’s Not to be confused with (prob of Type I error) from significance tests! Works out the correlation between each half and each other half, and averages them out Impossible for bad items to “hide” by balancing out
34 Interpreting Cronbach’s Gives numbers between 0 and 1 Needs to be very high (above 0.9) It is a measure of homogeneity of the test If your test is designed to measure more than one thing, the score will be low
35 Other forms of reliability Kuder-Richardson formula 20 (KR20) Like Cronbach’s alpha, but specialized for correct/incorrect type answers Inter-scorer reliability for judgement tests to what degree do several judges agree on the answer expressed as a correlation