1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.

Slides:



Advertisements
Similar presentations
Agenda Levels of measurement Measurement reliability Measurement validity Some examples Need for Cognition Horn-honking.
Advertisements

Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Consistency in testing
Topics: Quality of Measurements
The Research Consumer Evaluates Measurement Reliability and Validity
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 5 Reliability Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright ©2006.
The Department of Psychology
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Chapter 4 – Reliability Observed Scores and True Scores Error
1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
Part II Sigma Freud & Descriptive Statistics
Methods for Estimating Reliability
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Reliability and Validity of Research Instruments
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability and Validity in Experimental Research ♣
Reliability n Consistent n Dependable n Replicable n Stable.
Validity, Reliability, & Sampling
Research Methods in MIS
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Validity and Reliability
Reliability, Validity, & Scaling
Foundations of Educational Measurement
Data Analysis. Quantitative data: Reliability & Validity Reliability: the degree of consistency with which it measures the attribute it is supposed to.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Reliability Lesson Six
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
Reliability Chapter 3. Classical Test Theory Every observed score is a combination of true score plus error. Obs. = T + E.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Reliability & Validity
Tests and Measurements Intersession 2006.
Assessing Learners with Special Needs: An Applied Approach, 6e © 2009 Pearson Education, Inc. All rights reserved. Chapter 4:Reliability and Validity.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
Measurement MANA 4328 Dr. Jeanne Michalski
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
MEASUREMENT: PART 1. Overview  Background  Scales of Measurement  Reliability  Validity (next time)
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Measuring Research Variables
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Reliability.
RELIABILITY OF QUANTITATIVE & QUALITATIVE RESEARCH TOOLS
Classical Test Theory Margaret Wu.
Reliability & Validity
PSY 614 Instructor: Emily Bullock, Ph.D.
Evaluation of measuring tools: reliability
MANA 5341 Dr. George Benson Measurement MANA 5341 Dr. George Benson 1.
The first test of validity
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling – test occasions C. Internal consistency – traits 4. Reliability in Observational Studies 5. Using Reliability Information 6. What To Do about Low Reliability

2 Chapter 4 - Reliability Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer ability from behavior all behaviors are influenced by many variables, only a few of which matter to us

3 Observed Scores O = T + eO = Observed score T = True score e = error

4 Reliability – the basics 1.A true score on a test does not change with repeated testing 2.A true score would be obtained if there were no error of measurement. 3.We assume that errors are random (equally likely to increase or decrease any test result).

5 Reliability – the basics Because errors are random, if we test one person many times, the errors will cancel each other out (Positive errors cancel negative errors) Mean of many observed scores for one person will be the person’s true score

6 Reliability – the basics Example: to measure Sarah’s spelling ability for English words. We can’t ask her to spell every word in the OED, so… Ask Sarah to spell a subset of English words % correct estimates her true English spelling skill But which words should be in our subset?

7 Estimating Sarah’s spelling ability… Suppose we choose 20 words randomly… What if, by chance, we get a lot of very easy words – cat, tree, chair, stand… Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics

8 Estimating Sarah’s spelling ability… Sarah’s observed score varies as the difficulty of the random sets of words varies But presumably her actual spelling ability remains constant.

9 Reliability – the basics Other things can produce error in our measurement E.g. on the first day that we test Sarah she’s tired but on the second day, she’s rested…

10 Estimating Sarah’s spelling ability… Conclusion: O = T + e But e 1 ≠ e 2 ≠ e 3 … The variation in Sarah’s scores is produced by measurement error. How can we measure such effects – how can we measure reliability?

11 Reliability – the basics In what follows, we consider various sources of error in measurement. Different ways of measuring reliability are sensitive to different sources of error.

12 How do we deal with sources of error? Error due to test itemsDomain sampling error

13 How do we deal with sources of error? Error due to test items Error due to testing occasions Time sampling error

14 How do we deal with sources of error? Error due to test items Error due to testing occasions Error due to testing multiple traits Internal consistency error

15 Domain Sampling error A knowledge base or skill set containing many items is to be tested.  E.g., chemical properties of foods. We can’t test the entire set of items.  So we sample items.  That produces sampling error, as in Sarah’s spelling test.

16 Domain Sampling error Smaller sets of items may not test entire knowledge base. a person’s score may vary depending upon what is included or excluded from test. Reliability increases with number of items on a test

17 Domain Sampling error Parallel Forms Reliability: choose 2 different sets of test items. Across all people tested, if correlation between scores on 2 sets of words is low, then we probably have domain sampling error.

18 Time Sampling error Test-retest Reliability  person taking test might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc. Give same test repeatedly & check correlations among scores High correlations indicate stability – less influence of bad or good days.

19 Time sampling error Advantage: easy to evaluate, using correlation Disadvantage: carryover & practice effects

20 Internal Consistency error Suppose a test includes both items on social psychology and items requiring mental rotation of abstract visual shapes. Would you expect much correlation between scores on the two parts?  No – because the two ‘skills’ are unrelated.

21 Internal Consistency Approach A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits. A good test has high correlations between scores on its two halves.  But how should we divide the test in two to check that correlation?

22 Internal Consistency error Split-half method Kuder-Richardson formula Cronbach’s alpha All of these assess the extent to which items on a given test measure the same ability or trait.

23 Split-half Reliability After testing, divide test items into halves A & B that are scored separately. Check for correlation of results for A with results for B. Various ways of dividing test into two – randomly, first half vs. second half, odd- even…

24 Split-half Reliability – a problem Each half-test is smaller than the whole Smaller tests have lower reliability (domain sampling error) So, we shouldn’t use the raw split-half reliability to assess reliability for the whole test

25 Split-half reliability – a problem We correct reliability estimate using the Spearman-Brown formula: r e = 2r c 1+ r c r e = estimated reliability for the test r c = computed reliability (correlation between scores on the two halves A and B)

26 Kuder-Richardson 20 Kuder & Richardson (1937): an internal- consistency measure that doesn’t require arbitrary splitting of test into 2 halves. KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves.

27 Kuder-Richardson 20 The formula contains two basic terms: 1. a measure of all the variance in the whole set of test results.

28 Kuder-Richardson 20 The formula contains two basic terms: 2. “item variance” – when items measure the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance”

29 Internal Consistency – Cronbach’s α KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false). Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories. α is a more generally- useful measure of internal consistency than KR-20

30 Review: How do we deal with sources of error? ApproachMeasuresIssues Test-RetestStability of scoresCarryover Parallel FormsEquivalence & StabilityEffort Split-halfEquivalence & InternalShortened consistency test KR-20 & αEquivalence & InternalDifficult to consistencycalculate

31 Reliability in Observational Studies Some psychologists collect data by observing behavior rather than by testing. This approach requires time sampling, leading to sampling error Further error due to:  observer failures  inter-observer differences

32 Reliability in Observational Studies Deal with possibility of failure in the single- observer situation by having more than 1 observer. Deal with inter- observer differences using:  Inter-rater reliability  Kappa statistic

33 Reliability in Observational Studies Inter-rater reliability % agreement between 2 or more observers  problem: in a 2-choice case, 2 judges have a 50% chance of agreeing even if they guess!  this means that % agreement may over- estimate inter-rater reliability.

34 Reliability in Observational Studies Kappa Statistic (Cohen,1960) estimates actual inter- rater agreement as a proportion of potential inter-rater agreement after correction for chance.

35 Using Reliability Information Standard error of measurement (SEM) estimates extent to which test score misrepresents a true score. SEM = (S)  (1 – r)

36 Standard Error of Measurement We use SEM to compute a confidence interval for a particular test score. The interval is centered on the test score We have confidence that the true score falls in this interval E.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score.

37 What to do about low reliability Increase the number of items To find how many you need, use Spearman- Brown formula Using more items may introduce new sources of error such as fatigue, boredom

38 What to do about low reliability Discriminability analysis Find correlations between each item and whole test Delete items with low correlations