The Department of Psychology

Slides:



Advertisements
Similar presentations
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Advertisements

Consistency in testing
Topics: Quality of Measurements
Procedures for Estimating Reliability
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 5 Reliability Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright ©2006.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
Lesson Six Reliability.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Reliability and Validity of Research Instruments
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Reliability and Validity
A quick introduction to the analysis of questionnaire data John Richardson.
Research Methods in MIS
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Classical Test Theory By ____________________. What is CCT?
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Reliability, Validity, & Scaling
Data Analysis. Quantitative data: Reliability & Validity Reliability: the degree of consistency with which it measures the attribute it is supposed to.
SELECTION OF MEASUREMENT INSTRUMENTS Ê Administer a standardized instrument Ë Administer a self developed instrument Ì Record naturally available data.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Tests and Measurements Intersession 2006.
Assessing Learners with Special Needs: An Applied Approach, 6e © 2009 Pearson Education, Inc. All rights reserved. Chapter 4:Reliability and Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
Measurement MANA 4328 Dr. Jeanne Michalski
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Reliability: Introduction. Reliability Session Definitions & Basic Concepts of Reliability Theoretical Approaches Empirical Assessments of Reliability.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Chapter 6 Norm-Referenced Reliability and Validity.
Language Assessment Lecture 7 Validity & Reliability Instructor: Dr. Tung-hsien He
Lesson 5.1 Evaluation of the measurement instrument: reliability I.
Chapter 6 Norm-Referenced Measurement. Topics for Discussion Reliability Consistency Repeatability Validity Truthfulness Objectivity Inter-rater reliability.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.
Reliability.
RELIABILITY OF QUANTITATIVE & QUALITATIVE RESEARCH TOOLS
Classical Test Theory Margaret Wu.
Reliability & Validity
PSY 614 Instructor: Emily Bullock, Ph.D.
Evaluation of measuring tools: reliability
By ____________________
The first test of validity
15.1 The Role of Statistics in the Research Process
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

The Department of Psychology Introduction to Measurement Theory Liu Xiaoling The Department of Psychology ECNU Email: xlliu@psy.ecnu.edu.cn

Chapter 5 Reliability §1 Theory of Reliability Interpretation of Reliability Reliability refers to the degree of consistency or reproducibility of measurements (or test scores).

Qualified Reliability Coefficient for Types of Tests Ability or aptitude test, achievement test .90 and above Personality, interests, value, attitude test .80 and above

EXAMPLES Stanford-Binet Fifth Edition: full-scale IQ (23 age ranges), .97-.98; test-retest reliability coefficients for verbal and nonverbal subtests, from the high .7’s to the low .9’s. WISC-IV: split-half reliabilities for full scale IQ, .97 WAIS-III: average split-half reliability for full scale IQ is .98; .97 for verbal IQ; .94 for performance IQ Thurstone’s Attitude Scale: .80 - .90 Rosenberg’s Self-Esteem Scale(1965): α(.77 - .88); test –retest, .85

Errors— Inconsistent and inaccurate Effect Refers to the inconsistent and in accurate effects caused by the variable factors which are unrelated to the objective. Three Types: Random Systematic Sampling

Random errors reduce both the Random Error An error due to chance alone, randomly distributed around the objective . Random errors reduce both the consistency and the accuracy of the test scores.

Systematic errors do not result in Systematic Error An error in data that is regular and repeatable due to improper collection or statistical treatment of the data. Systematic errors do not result in inconsistent measurement, but cause inaccuracy.

Sampling Error Deviations of the summary values yielded by samples, form the values yielded by entire population.

Classical True Score Theory Assumptions: One Formula X=T+E X, an individual’s observed score E, random error score (error of measurement) T, the individual’s true score Founders: Charles Spearman (1904,1907,1913) J. P. Guilford (1936)

CONCEPTION True score CTT assumes that each person has a true score that would be obtained if there were no errors in measurement. INTERPRETATION: The average of all the observed scores obtained over an infinite number of repeated things with the same test.

150 Observed Score True Score Error TABLE 5.1 One Measure Data 12 19 27 41 51 10 20 30 40 50 2 -1 13 1 150 203.2 200 3.2 1.8

Three Principles 1. The mean of the error scores for a population of examinees is zero. 2. The correlation between true and error scores for a population of examinees is zero. 3. The correlation between error scores from two independent testing is zero.

Reliability Coefficient Reliability coefficient can be defined as the correlation between scores on parallel test forms. (5.1) Mathematical Definition: Reliability coefficient is the ratio of true score variance to observed score variance.

As se increases, rtt decrease if St won’t vary . and (5.2) As se increases, rtt decrease if St won’t vary .

§2 Sources of Random Errors Sources from Tests Sources from Tests Administration and Scoring Sources from Examinees

Sources from Tests Item sampling is lack of representativeness. Item format is improper. Item difficulty is too high or too low. Meaning of sentence is not clear. Limit of test time is too short.

Sources from Tests Administration and Scoring Test Conditions is negtive. Examiner affects examinees’ performance. Unexpected disturbances occur. Scoring isn’t objective; counting is inaccurate.

Sources from examinees Motive for Test Negative Emotions ( e.g., anxiety) Health Learning, Development and Education Experience in Test

§3 Estimates Reliability Coefficient Test-retest Reliability (Coefficient of Stability) Alternate-Forms Reliability (Coefficient of Equivalence) Coefficients of Internal Consistency Scorer reliability (Inter-rater Consistency)

Test-retest Reliability Also called Coefficient of Stability, which refers to the correlation between test scores obtained by administering the same form of a test on two separate occasions to the same examinee group. TEST RETEST INTERVAL THE SAME examinee

REVIEW CORRELATION Figure 5.1 Scatter Plots for Two Variates

Formula for estimating reliability (5.3) , test score Pearson product moment correlation coefficient , retest score , sample size

An subjective wellbeing scale administered to 10 high Application Example An subjective wellbeing scale administered to 10 high school students and half a year later they were tested the same scale again. Estimate the reliability of the scale. Table 5.2 Test score examinees 1 2 3 4 5 6 7 8 9 10 16 15 13 13 11 10 10 9 8 7 16 16 14 12 11 9 11 8 6 7

Computing statistics Answer:

, standard deviation of first test scores Transform of formula 5.3 (5.4) , mean of first test scores , mean of retest scores , standard deviation of first test scores , SD of retest scores

Quality of test-retest reliability: Estimates the consistence of tests across time interval. Sources of errors: Stability of the trait measured Individual differences on development, education, learning, training, memory, etc.. Unexpected disturbances during test administration.

Alternate-Forms Reliability also called equivalent/ parallel forms reliability, which refers to the correlation between the test scores obtained by separately administering the alternate or equivalent forms of the test to the same examinees on one occasions. IMMEDIATE FORMⅠ FORMⅡ THE SAME examinee

Application Example Two alternate forms of a creative ability test administered ten students in seventh grade one morning. Table 5.3 shows the test result. Estimate the reliability of this test. Table 5.3 Form of test examinees 1 2 3 4 5 6 7 8 9 10 20 19 19 18 17 16 14 13 12 10 20 20 18 16 15 17 12 11 13 9

ANSWER: If use formula 5.4, then

Exercise1 Use formula 5.3 and 5.4 independently to estimate the reliability coefficient for the data in the following table. Test Examinees 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A B 16 15 13 13 11 10 10 10 10 9 9 8 6 7 6 15 15 14 14 12 11 12 10 10 10 9 9 9 3 7

How to eliminate the effect of order of forms administered? Method: First, divide one group of examinees into two parallel groups; Second, group one receives formⅠ of the test, and group two receives formⅡ; Third, after a short interval, group one receives formⅠ, and group two receives formⅡ; Compute the correlation between all the examinees’ scores on two forms of the test.

Sources of Error Whether the two forms of test are parallel or equivalent, such as the consistence on content sampling, item format, item quantity, item difficulty, SD and means of the two forms. Fluctuations in the individual examinee’s mind, in-cluding emotions, test motivation, health, etc.. Other unexpected disturbance.

Coefficient of Stability and Equivalence the correlation between the two group observed scores, when the two alternate test forms are administered on two separate occasions to the same examinees. FORM Ⅰ FORMⅡ INTERVAL SAME EXAMINEES

Coefficients of Internal Consistency When examinees perform consistently across items within a test, the test is said to have item homogeneity. Internal consistency coefficient is an index of both item content homogeneity and item quality.

Content sampling Quality: One administration of a single form of test Error sources: Content sampling Fluctuations in the individual examinee’s state, including emotions, test motivation, health, etc..

Split-Half Reliability To get the split –half reliability, the test developer administers the test to a group of examinees; then divides the items in to two subtest, each half the length of the original test; computes the correlation of the two halves of the test. Procedures

Methods to divide the test into two parallel halves: 1. Assign all odd-number items to half 1 and all even-number items to half 2. 2. Rank order the items in terms of their difficulty levels based on the responses of the examinees; then apply the method 1. 3. Randomly assign items to the two halves. 4. Assign items to half-test forms as that the forms are “matched ” in content.

Table 5.4 Illustrative Data for Split-half Reliability Estimation Examinee Item Half 1 Half 2 1 2 3 4 5 6 7 8 Xo Xe Total score Xt 1 2 3 4 5 6 7 8 9 10 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 2 0 1 1 0 0 1 0 0 0 2 1 0 1 0 1 0 0 1 0 1 2 1 1 1 0 1 0 1 0 4 1 1 1 1 1 1 1 0 0 3 3 1 1 1 1 0 1 0 1 2 4 1 1 1 1 1 1 1 1 4 4 pi pi qi .8 .7 .6 .5 .5 .4 .3 .2 .16 .21 .24 .25 .25 .24 .21 .16 St2=6.0

Employing formula 5.3 to compute rhh Attention: This rhh actually give the reliability of only a half-test. That is, it underestimates the reliability coefficient for the full-length test.

Employ the Spearman-Brown formula to correct rhh (5.5)

Spearman-Brown general formula (5.6) is the estimated coefficient is the obtained coefficient is the number of times the test is lengthened or shortened

Kuder-Richardson Reliability (Kuder & Kuder-Richardson formula 20 (KR20) ( 5.7) , the number of items, the total test variance, , the proportion of the examinees who pass each item , the proportion of the examinees who do not pass each item Dichotomously Scored Items

Employ the data in table 5.3,

Coefficient-Alpha ( ) (Cronbach,1951) (5.8 ) , the total test variance , the variance of item i

Exercise 2 Suppose that examinees have been tested on four essay items in which possible scores range form 0 to 10 points, and , , , . If total score variance is 100, then estimate the reliability of the test.

Scorer reliability (Inter-rater Consistency) When a sample of test items are independently scored by two or more scorers or raters, each examinee should have several test scores. So there is a need to measure consistency of the scores over different scorers.

Methods 1 The correlation between the two set of scores over two scorers (Pearson correlation; Spearman rank correlation ) 2 Kendall coefficient of concordance

Kendall coefficient of concordance (5.9) K, the number of scorers, N,the number of examinees Ri, the sum of ranks for each examinee over all scorers

Table 5.5 Scores of 6 Essays for 6 Examinees rater 1 2 3 4 5 6 1 2 4 1 5 6 3 2 3 4 1 5 6 2 3 3 5 1 4 6 2 Employ formula 5.9, compute the scorer correlation. Key: .95

Summing up Table 5.6 Sources of Error Variance in Relation to Reliability Coefficients Type of Reliability Coefficient Error Variance Test -retest Alternate- Form Stability and Equivalence Coefficient Split -Half KR20 and αCoefficient Scorer Time sampling Content sampling Time and content sampling Content sampling and content heterogeneity Inter-scorer differences

§4 Factors That Affect Reliability Coefficients Group Homogeneity Test Length Test difficulty

Group Homogeneity The magnitude of reliability coefficient depends on variation among individuals on both their ture scores and error scores. The score range is restricted, consequently, the true score variance is restricted, then the reliability coefficient is low.

is an important consideration in test development and test selection. Figure 5.2 Scatter Plots for Two Variates Thus, the homogeneity of the examinee group is an important consideration in test development and test selection.

Predicting how reliability is altered when sample variance is altered (5.10) , the predicted reliability estimate for new sample , the variance of the new sample , the variance of the original sample , the reliability estimate for the original sample

Exercise 3 Suppose one memory test have been administered to the all middle school students in one city, and the standard deviations of test score is 20, reliability coefficient is .90. If we also obtain 10, the standard deviation of test score for the students in grade two, please to predict the reliability coefficient for the students in grade two .

Test Length Which test seems more reliable? Test A 1+1= Test B 1+1= 2+2= 3+1= 4+2= 3+2= 5+3= 4+4= 1+6= 2+8= 4+5= 3+9 2+7= Conclusion: reliability is higher for the test with more items (all based on the same content) .

Using the Transform of Spearman-Brown General Formula to Determine the Length of the Test (5.11)

Exercise 4 One language test has 10 items, and reliability coefficient is 0.50. To make it’s reliability higher to .80, how many items the test developer should add into the test?

Test Difficulty When a test is too hard or too easy for a group of examinees, restriction of score range , and the reliability coefficient is likely to be the result.

§5 Standard Error of Measurement Interpretation Theoretically, each examinee ‘s personal distribution of possible observed scores around the examinee’s true score has a standard deviation. When these individual error standard deviation are averaged for the group, the result is the standard error of measurement, and is denoted as .

Figure 5.3 Approximately Normal Distribution of Observed Scores for Repeated Testing of One Examinee (Form Introduction to Measurement Theory, M. J. Allen, & W. M. Yen, p89, 2002)

Figure 5.4 Hypothetical Illustration of different Examinees’ Distributions of Observed Scores Around Their True Score

Computation Formula (5.12)

Use to Estimate one Examinee’s True Score Create a confidence interval around the examinee’s observed score.

Example Known that the standard deviation of an intelligence scale is 15, rtt is .95. If one examinee’s IQ is 120, then create confidence interval for his/her true score. Answer: Then,