© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.

Slides:



Advertisements
Similar presentations
Consistency in testing
Advertisements

Topics: Quality of Measurements
RELIABILITY Reliability refers to the consistency of a test or measurement. Reliability studies Test-retest reliability Equipment and/or procedures Intra-
Procedures for Estimating Reliability
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
Chapter 5 Reliability Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright ©2006.
The Department of Psychology
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Reliability and Validity of Research Instruments
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Chapter 4 Validity.
REVIEW I Reliability Index of Reliability Theoretical correlation between observed & true scores Standard Error of Measurement Reliability measure Degree.
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability n Consistent n Dependable n Replicable n Stable.
RELIABILITY consistency or reproducibility of a test score (or measurement)
Reliability and Validity
Research Methods in MIS
Measurement Joseph Stevens, Ph.D. ©  Measurement Process of assigning quantitative or qualitative descriptions to some attribute Operational Definitions.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Measurement and Data Quality
Validity and Reliability
Foundations of Educational Measurement
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
CRT Dependability Consistency for criterion- referenced decisions.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Reliability REVIEW Inferential Infer sample findings to entire population Chi Square (2 nominal variables) t-test (1 nominal variable for 2 groups, 1 continuous)
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Psychometrics. Goals of statistics Describe what is happening now –DESCRIPTIVE STATISTICS Determine what is probably happening or what might happen in.
Measurement MANA 4328 Dr. Jeanne Michalski
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Chapter 13 Understanding research results: statistical inference.
Chapter 6 Norm-Referenced Reliability and Validity.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 10: Correlational Research 1.
Measuring Research Variables
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 4 Investigating the Difference in Scores.
Lesson 5.1 Evaluation of the measurement instrument: reliability I.
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 5 What is a Good Test?
Assessing Student Performance Characteristics of Good Assessment Instruments (c) 2007 McGraw-Hill Higher Education. All rights reserved.
Dr. Jeffrey Oescher 27 January 2014 Technical Issues  Two technical issues  Validity  Reliability.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Chapter 6 Norm-Referenced Measurement. Topics for Discussion Reliability Consistency Repeatability Validity Truthfulness Objectivity Inter-rater reliability.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Classical Test Theory Margaret Wu.
PSY 614 Instructor: Emily Bullock, Ph.D.
Evaluation of measuring tools: reliability
Using statistics to evaluate your test Gerard Seinhorst
The first test of validity
15.1 The Role of Statistics in the Research Process
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity

© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Outline Selecting a Criterion Score Types of Reliability Reliability Theory Estimating Reliability – Intraclass R Spearman-Brown Prophecy Formula Standard Error of Measurement Objectivity Reliability of Criterion-referenced Tests Reliability of Difference Scores

© McGraw-Hill Higher Education. All rights reserved. Objectivity Interrater Reliability Agreement of competent judges about the value of a measure.

© McGraw-Hill Higher Education. All rights reserved. Reliability Dependability of scores Consistency Degree to which a test is free from measurement error.

© McGraw-Hill Higher Education. All rights reserved. Selecting a Criterion Score Criterion score – the measure used to indicate a person’s ability. – Can be based on the mean score of the best score. Mean Score – average of all trials. – Usually a more reliable estimate of a person’s true ability. Best Score – optimal score a person achieves on any one trial. – May be used when criterion score is to be used as an indicator of maximum possible performance.

© McGraw-Hill Higher Education. All rights reserved. Potential Methods to Select a Criterion Score 1. Mean of all trials. 2. Best score of all trials. 3. Mean of selected trials based on trials on which group scored best. 4. Mean of selected trials based on trials on which individual scored best (i.e., omit outliers). Appropriate method to use depends on the situation.

© McGraw-Hill Higher Education. All rights reserved. Norm-referenced Test Designed to reflect individual differences.

© McGraw-Hill Higher Education. All rights reserved. In Norm-referenced Framework Reliability - ability to detect reliable differences between subjects.

© McGraw-Hill Higher Education. All rights reserved. Types of Reliability Stability Internal Consistency

© McGraw-Hill Higher Education. All rights reserved. Stability (Test-retest) Reliability l Each subject is measured with same instrument on two or more different days. l Scores are then correlated. l An intraclass correlation should be used.

© McGraw-Hill Higher Education. All rights reserved. Internal Consistency Reliability Consistent rate of scoring throughout a test or from trial to trial. All trials are administered in a single day. Trial scores are then correlated. – An intraclass correlation should be used.

© McGraw-Hill Higher Education. All rights reserved. Sources of Measurement Error Lack of agreement among raters (i.e., objectivity). Lack of consistent performance by person. Failure of instrument to measure consistently. Failure of tester to follow standardized procedures.

© McGraw-Hill Higher Education. All rights reserved. Reliability Theory X = T + E Observed score = True score + Error  2 X =  2 t +  2 e Observed score variance = True score variance + Error variance Reliability =  2 t ÷  2 X Reliability = (  2 X -  2 e ) ÷  2 X

© McGraw-Hill Higher Education. All rights reserved. Reliability depends on: Decreasing measurement error Detecting individual differences among people – ability to discriminate among different ability levels

© McGraw-Hill Higher Education. All rights reserved. Reliability Ranges from 0 to 1.00 – When R = 0, there is no reliability. – When R = 2, there is maximum reliability.

© McGraw-Hill Higher Education. All rights reserved. Reliability from Intraclass R ANOVA is used to partition the variance of a set of scores. Parts of the variance are used to calculate the intraclass R.

© McGraw-Hill Higher Education. All rights reserved. Estimating Reliability Intraclass correlation from one-way ANOVA: R = (MS A – MS W )  MS A – MS A = Mean square among subjects (also called between subjects) – MS w = Mean square within subjects – Mean square = variance estimate This represents reliability of the mean test score for each person.

© McGraw-Hill Higher Education. All rights reserved. Sample SPSS One-way Reliability Analysis

© McGraw-Hill Higher Education. All rights reserved. Estimating Reliability Intraclass correlation from two-way ANOVA: R = (MS A – MS R )  MS A – MS A = Mean square among subjects (also called between subjects) – MS R = Mean square residual – Mean square = variance estimate Used when trial to trial variance is not considered measurement error (e.g., Likert type scale).

© McGraw-Hill Higher Education. All rights reserved. Sample SPSS Two-way Reliability Analysis

© McGraw-Hill Higher Education. All rights reserved. What is acceptable reliability? Depends on: – age – gender – experience of people tested – size of reliability coefficients others have obtained – number of days or trials – stability vs. internal consistency coefficient

© McGraw-Hill Higher Education. All rights reserved. l Most physical measures are stable from day- to-day. ­ Expect test-retest R xx between.80 and.95. l Expect lower R xx for tests with an accuracy component (e.g.,.70). l For written test, want R XX >.70. l For psychological instruments, want R XX >.70. l Critical issue: time interval between 2 test sessions for stability reliability estimates. 1 to 3 days apart for physical measures is usually appropriate. What is acceptable reliability?

© McGraw-Hill Higher Education. All rights reserved. Factors Affecting Reliability Type of test. – Maximum effort test expect R xx .80 – Accuracy type test expect R xx .70 – Psychological inventories expect R xx .70 Range of ability. – R xx higher for heterogeneous groups than for homogeneous groups. Test length. – Longer test, higher R xx

© McGraw-Hill Higher Education. All rights reserved. Factors Affecting Reliability Scoring accuracy. – Person administering test must be competent. Test difficulty. – Test must discriminate among ability levels. Test environment, organization, and instructions. – favorable to good performance, motivated to do well, ready to be tested, know what to expect.

© McGraw-Hill Higher Education. All rights reserved. Factors Affecting Reliability Fatigue – decreases R xx Practice trials – increase R xx

© McGraw-Hill Higher Education. All rights reserved. Coefficient Alpha AKA Cronbach’s alpha Most widely used with attitude instruments Same as two-way intraclass R through ANOVA An estimate of R xx of a criterion score that is the sum of trial scores in one day

© McGraw-Hill Higher Education. All rights reserved. Coefficient Alpha R alpha = [K / (K-1)] x [(S 2 x -  S 2 trials ) / S 2 x ] K = # of trials or items S 2 x = variance for criterion score (sum of all trials)  S 2 trials = sum of variances for all trials

© McGraw-Hill Higher Education. All rights reserved. Kuder-Richardson (KR) Estimate of internal consistency reliability by determining how all items on a test relate to the total test. KR formulas 20 and 21 are typically used to estimate R xx of knowledge tests. Used with dichotomous items (scored as right or wrong). KR 20 = coefficient alpha

© McGraw-Hill Higher Education. All rights reserved. KR 20 KR 20 = [K / (K-1)] x [(S 2 x -  pq) / S 2 x ] K = # of trials or items S 2 x = variance of scores p = percentage answering item right q = percentage answering item wrong  pq = sum of pq products for all k items

© McGraw-Hill Higher Education. All rights reserved. KR 20 Example Itempq If Mean = 2.45 and SD = 1.2, what is KR 20 ? pq  pq = KR 20 = (4/3) x (1.44 – )/1.44 KR 20 =.70

© McGraw-Hill Higher Education. All rights reserved. KR 21 If assume all test items are equally difficult, KR 20 can be simplified to KR 21 KR 21 = [(K x S 2 )-(Mean x (K - Mean)] ÷ [(K-1) x S 2 ] K = # of trials or items S 2 = variance of test Mean = mean of test

© McGraw-Hill Higher Education. All rights reserved. Equivalence Reliability (Parallel Forms) Two equivalent forms of a test are administered to same subjects. Scores on the two forms are then correlated.

© McGraw-Hill Higher Education. All rights reserved. Spearman-Brown Prophecy formula Used to estimate r xx of a test that is changed in length. r kk = (k x r 11 ) ÷ [1 + (k - 1)(r 11 )] k = number of times test is changed in length. k = (# trials want) ÷ (# trials have) r 11 = reliability of test you’re starting with Spearman-Brown formula will give an estimate of maximum reliability that can be expected (upper bound estimate).

© McGraw-Hill Higher Education. All rights reserved. Standard Error of Measurement (SE M ) Degree you expect test score to vary due to measurement error. Standard deviation of a test score. SE M = S x  1 - R xx S x = standard deviation of group R xx = reliability coefficient Small SE M indicates high reliability

© McGraw-Hill Higher Education. All rights reserved. SE M example: written test: S x = 5 R xx =.88 SE M = 5  = 1.73 Confidence Interval: 68%X ± 1.00 (SE M ) 95%X ± 1.96 (SE M ) If X = = = % confident true score is between and 24.73

© McGraw-Hill Higher Education. All rights reserved. Objectivity (Rater Reliability) Degree of agreement between raters. Depends on: – clarity of scoring system. – degree to which judge can assign scores accurately. If test is highly objective, objectivity is obvious and rarely calculated. As subjectivity increases, test developer should report estimate of objectivity.

© McGraw-Hill Higher Education. All rights reserved. Two Types of Objectivity: Intrajudge objectivity – consistency in scoring when test user scores same test two or more times. Interjudge objectivity – consistency between two or more independent judgments of same performance. Calculate objectivity like reliability, but substitute judges scores for trials.

© McGraw-Hill Higher Education. All rights reserved. Criterion-referenced Test A test used to classify a person as proficient or nonproficient (pass or fail).

© McGraw-Hill Higher Education. All rights reserved. In Criterion-referenced Framework: Reliability - defined as consistency of classification.

© McGraw-Hill Higher Education. All rights reserved. Reliability of Criterion- referenced Test Scores To estimate reliability, a double- classification or contingency table is formed.

© McGraw-Hill Higher Education. All rights reserved. Contingency Table (Double-classification Table) PassFail Day 2 Pass Fail Day 1 AB CD

© McGraw-Hill Higher Education. All rights reserved. Proportion of Agreement (P a ) Most popular way to estimate R xx of CRT. P a = (A + D) ÷ (A + B + C + D) P a does not take into account that some consistent classifications could happen by chance.

© McGraw-Hill Higher Education. All rights reserved. Example for calculating P a PassFail Day 2 Pass Fail Day

© McGraw-Hill Higher Education. All rights reserved. P a = (A + D) ÷ (A + B + C + D) P a = ( ) ÷ ( ) P a = 80 ÷ 100 =.80 PassFail Day 2 Pass Fail Day

© McGraw-Hill Higher Education. All rights reserved. Kappa Coefficient (K) Estimate of CRT R xx with correction for chance agreements. K = (P a - P c ) ÷ (1 - P c ) P a = Proportion of Agreement P c = Proportion of Agreement expected by chance P c = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D) 2

© McGraw-Hill Higher Education. All rights reserved. Example for calculating K PassFail Day 2 Pass Fail Day

© McGraw-Hill Higher Education. All rights reserved. K = (P a - P c ) ÷ (1 - P c ) P a =.80 PassFail Day 2 Pass Fail Day

© McGraw-Hill Higher Education. All rights reserved. P c = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D) 2 P c = [(45+12)(45+8)+(8+35)(12+35)]÷(100) 2 P c = [(57)(53)+(43)(47)]÷(10,000) = 5,042÷10,000 P c =.5042 PassFail Day 2 Pass Fail Day

© McGraw-Hill Higher Education. All rights reserved. Kappa (K) K = (P a - P c ) ÷ (1 - P c ) K = ( ) ÷ ( ) K =.597

© McGraw-Hill Higher Education. All rights reserved. Modified Kappa (Kq) Kq may be more appropriate than K when proportion of people passing a criterion- referenced test is not predetermined. Most situations in exercise science do not predetermine the number of people who will pass.

© McGraw-Hill Higher Education. All rights reserved. Modified Kappa (Kq) Kq = (P a – 1/q) ÷ (1 – 1/q) - q = number of classification categories - If pass-fail, q = 2 Kq = ( ) ÷ (1 -.50) Kq =.60

© McGraw-Hill Higher Education. All rights reserved. Modified Kappa Interpreted same as K. When proportion of masters =.50, Kq = K. Otherwise, Kq > K.

© McGraw-Hill Higher Education. All rights reserved. Interpretation of R xx for CRT P a (Proportion of Agreement) Affected by chance classifications Pa <.50 are unacceptable Pa should be >.80 in most situations. K and Kq (Kappa and Modified Kappa) Interpretable range: 0.0 to 1.0 Minimum acceptable value =.60

© McGraw-Hill Higher Education. All rights reserved. When reporting results: Report both indices of R xx.

© McGraw-Hill Higher Education. All rights reserved. Formative Evaluation of Chapter Objectives Define and differentiate between reliability and objectivity for norm-referenced tests. Identify factors that influence reliability and objectivity of norm-referenced test scores. Identify factors that influence reliability of criterion-referenced test scores. Select a reliable criterion score based on measurement theory.

© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity