Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTER 7 Procedures for Estimating Reliability 1.

Similar presentations


Presentation on theme: "CHAPTER 7 Procedures for Estimating Reliability 1."— Presentation transcript:

1 CHAPTER 7 Procedures for Estimating Reliability 1

2 *TYPES OF RELIABILITY TYPE OF RELIABILITY WHT IT IS HOW DO YOU DO IT WHAT THE RELIABILITY COEFFICIENT LOOKS LIKE Test-Retest 2 Admin stability A measure of stability same test/measure same group Administer the same test/measure at two different times to the same group of participants r test1.test2 Ex. IQ test Parallel/alternate Interitem/Equivalent Forms 2 Admin equivalence A measure of equivalence twodifferent forms same test to the same group Administer two different forms of the same test to the same group of participants r testA.testB Ex. Stats Test r testA.testB Test-Retest with Alternate Forms 2 Admin stability equivalence A measure of stability and equivalence On Monday, you administer form A to 1st half of the group and form B to the second half. On Friday, you administer form B to 1st half of the group and form A to the 2nd half Inter-Rater 1 Admin agreement A measure of agreement Have two raters rate behaviors and then determine the amount of agreement between them Percentage of agreement Internal Consistency 1 Admin consistently each item measures the same underlying construct A measure of how consistently each item measures the same underlying construct Correlate performance on each item with overall performance across participants Cronbach’s Alpha Method Kuder-Richardson Method Split Half Method Hoyts Method 2

3 Procedures for Estimating/Calculating Reliability  Procedures Requiring 2 Test Administration  Procedures Requiring 1 Test Administration 3

4 Procedures for Estimating Reliability  *Procedures Requiring two (2) Test Administration  1. Test-Retest Reliability Method measures the Stability.  2. Parallel (Alternate) Equivalent Forms Interitem Reliability Method measures the Equivalence.  3. Test-Retest with Alternate Reliability Forms measures the Stability and Equivalent 4

5 Procedures Requiring 2 Test Administration  1. Test-Retest Reliability Method Administering the same test to the same group of participants then, the two sets of scores are correlated with each other. Administering the same test to the same group of participants then, the two sets of scores are correlated with each other. The correlation coefficient ( r ) between the two sets of scores is called the coefficient of stability. The problem with this method is time Sampling, it means that factors related to time are the sources of measurement error e.g., change in exam condition such as noises, the weather, illness, fatigue, worry, mood change etc. The problem with this method is time Sampling, it means that factors related to time are the sources of measurement error e.g., change in exam condition such as noises, the weather, illness, fatigue, worry, mood change etc. 5

6 How to Measure the Test Retest Reliability  Class IQ Scores  Students X –first timeY- second time  John  Jo  Mary  Kathy  David  r first time.second time stability 6

7 Procedures Requiring 2 Test Administration  2. Parallel (Alternate) Forms Reliability Method Different Forms of the same test are given to the same group of participants then, the two sets of scores are correlated. The correlation coefficient (r) between the two sets of scores is called the coefficient of equivalence. Different Forms of the same test are given to the same group of participants then, the two sets of scores are correlated. The correlation coefficient (r) between the two sets of scores is called the coefficient of equivalence. 7

8 How to measure the Parallel Forms Reliability  Class Test Scores  Students X-Form A Y-Form B  John  Jo  Mary  Kathy  David  r formAformB equivalence 8

9 Procedures Requiring 2 Test Administration  3. Test-Retest with Alternate Reliability Forms  It is a combination of the test-retest and alternate form reliability method.  On Monday, you administer form A to 1 st half of the group and form B to the second half.  On Friday, you administer form B to 1 st half of the group and form A to the second half.  The correlation coefficient ( r) between the two sets of scores is called the coefficient of stability and equivalence. 9

10 Procedures Requiring 1 Test Administration  A. Internal Consistency Reliability (ICR)  Examines the unidimensional nature of a set of items in a test. It tells us how unified the items are in a test or in an assessment.  Ex. If we administer a 100-item personality test we want the items to relate with one another and to reflect the same construct (personality). We want them to have item homogeneity.  ICR deals with how unified the items are in a test or an assessment. This is called “item homogeneity.” 10

11 Procedures for Estimating Reliability  *Procedures Requiring one (1) Test Administration  A. Internal Consistency Reliability  B. Inter-Rater Reliability 11

12 A. Internal Consistency Reliability (ICR)  *4 Different ways to measure ICR  1. Guttman Split Half Reliability Method same as (Spearman Brown Prophesy Formula)  2. Cronbach’s Alpha Method  3. Kuder Richardson Method  4. Hoyt’s Method  They are different statistical procedures to calculate the reliability of a test. 12

13 Procedures Requiring 1 Test Administration  A. Internal Consistency Reliability (ICR)  1. Guttman Split-Half Reliability Method (most popular) usually use for dichotomously scored exams.  First, administer a test, then divide the test items into 2 subtests (There are four popular methods), then, find the correlation between the 2 subtests and place it in the formula. 13

14 1. Split Half Reliability Method  14

15  *The 4 popular methods are:  1.Assign all odd-numbered items to form 1 and all even-numbered items to form 2  2. Rank order the items in terms of their difficulty levels (p-values) based on the responses of the examiners; then assign items with odd-numbered ranks to form 1 and those with even-numbered ranks to form 2 15

16 1. Split Half Reliability Method  The four popular methods are:  3. Randomly assign items to the two half-test forms  4. Assign items to half-test forms so that the forms are “matched” in content e.g. if there are 6 items on reliability, each half will get 3. 16

17 A high Slit Half reliability coefficient (e.g., >0.90) indicates a homogeneous test. 1. Split Half Reliability Method A high Slit Half reliability coefficient (e.g., >0.90) indicates a homogeneous test.  17

18 1. Split Half Reliability Method  *Use the split half reliability method to calculate the reliability estimate of a test with reliability coefficient (correlation) of 0.25 for the 2 halves of this test ? 18

19 1. Split Half Reliability Method  19

20 1. Split Half Reliability Method A=X and B=Y 20

21 Procedures Requiring 1 Test Administration  A. Internal Consistency Reliability (ICR)  2. Cronbach’s Alpha Method (used for wide range of scoring such as Non- Dichotomously and  Dichotomously scored exams.  Cronbach’s(α) is a preferred statistic.  Lee Cronbach-  21

22 22 Procedures Requiring 1 Test Administration 

23 Cronbach α for composite tests K is number of tests/subtest 23

24 A. Internal Consistency Reliability (ICR) 2. *Cronbach’s Alpha Method or ( Coefficient (α) is a preferred statistic)  Ex. Suppose that the examinees are tested on 4 essay items and the maximum score for each is 10 points. The variance for the items are as follow; σ²1=9, σ²2=4.8, σ²3=10.2, and σ²4=16. If the total score variance σ²x=100, used the Cronbach’s Alpha Method to calculate the internal consistency of this test? A high coefficient (e.g., >0.90) indicates a homogeneous test.  Ex. Suppose that the examinees are tested on 4 essay items and the maximum score for each is 10 points. The variance for the items are as follow; σ²1=9, σ²2=4.8, σ²3=10.2, and σ²4=16. If the total score variance σ²x=100, used the Cronbach’s Alpha Method to calculate the internal consistency of this test? A high α coefficient (e.g., >0.90) indicates a homogeneous test.test 24

25  25

26 Cronbach’s Alpha Method 26

27 27

28 Procedures Requiring 1 Test Administration 3. Kuder Richardson Method  A. Internal Consistency Reliability (ICR)  *The Kuder-Richardson Formula 20 (KR-20) first published in It is a measure of internal consistency reliability for measures with dichotomous choices. It is analogous \ə- ˈ na-lə-gəs\ to Cronbach's α, except Cronbach's α is also used for non-dichotomous tests. pq=σ²i. A high KR- 20 coefficient (e.g., >0.90) indicates a homogeneous test. reliabilityCronbach's αtestreliabilityCronbach's αtest 28

29 Procedures Requiring 1 Test *Administration 29 

30 Procedures Requiring 1 Test Administration  30

31 3. *Kuder Richardson Method (KR 20and KR 21) See table 7.1 or data on p.136 next 31

32 Variance=square of standard deviation=

33 Procedures Requiring 1 Test Administration  A. Internal Consistency Reliability (ICR)  3. Kuder Richardson Method (KR 21) It is used only with dichotomously scored items. It does not require the computing of each item variance (you do it once for all items or test variance σ²X=Total test score variance) see table 7.1 for standard deviation and variance for all items.  It assumes all items are equal in difficulties. 33

34 Procedures Requiring 1 Test Administration  34

35 Procedures Requiring 1 Test Administration  A. Internal Consistency Reliability (ICR)  4. *Hoyt’s (1941) Method  Hoyt used ANOVA to obtained variance or MS to calculate the Hoyt’s Coefficient.  MS=σ²=S²=Variance 35

36 Procedures Requiring 1 Test Administration  36

37 4. *Hoyt’s (1941) Method MS person  MS within MS items  MS between MS residual has it’s own calculations, it is not =MS total 37

38 Procedures Requiring 1 Test Administration  B. Inter-Rater Reliability It is measure of consistency from rater to rater. It is a measure of agreement between the raters. 38

39 Procedures Requiring 1 Test Administration  B. Inter-Rater Reliability  Items Rater 1 Rater 2       First do the r rater1.rater2 then, X

40 Procedures Requiring 1 Test Administration  B. Inter-Rater Reliability  More than 2 raters:  Raters 1, 2, and 3  Calculate r for 1 & 2=.6  Calculate r for 1 & 3=.7  Calculate r for 2 & 3=.8  µ=.7 x100=70% 40

41 *Factors that Affect Reliability Coefficients  1. Group Homogeneity  2. Test length  3. Time limit 41

42 *Factors that Affect Reliability Coefficients  1. Group Homogeneity If a sample of examinees is highly homogeneous on the construct being measured, the reliability estimate will be lower than if the sample were more heterogeneous. If a sample of examinees is highly homogeneous on the construct being measured, the reliability estimate will be lower than if the sample were more heterogeneous.  2. Test length Longer tests are more reliable than shorter tests. Longer tests are more reliable than shorter tests. The effect of changing test length can be estimated by using Spearman Brown Prophecy Formula. The effect of changing test length can be estimated by using Spearman Brown Prophecy Formula.  3. Time limit Time Limit refers to when a test has a rigid time limit. Time Limit refers to when a test has a rigid time limit. Meaning, some examinees finish but others don’t, this will artificially inflate the test reliability coefficient. Meaning, some examinees finish but others don’t, this will artificially inflate the test reliability coefficient. 42

43 Reporting Reliability Data According to Standards for Educational and Psychological Testing  1. Result of different reliability studies should be reported to take into account different sources of measurement error that are most relevant to score use.  2.Standard error of measurement and score bands for different confidence intervals should accompany each reliability estimate  3.Reliability and standard error estimates should be reported for subtest scores as well as total test score  3.Reliability and standard error estimates should be reported for subtest scores as well as total test score. 43

44 Reporting Reliability Data  4.Procedures and sample used in reliability studies should be sufficiently describe to permit users to determine similarity between conditions of the reliability study and their local situations.  5.When a test is normally used for a particular population of examinees (e.g., those within a grade level or those who have a particular handicap) reliability estimate and standard error of measurement should be reported separately for such specialized population. 44

45 Reporting Reliability Data  6.when test scores are used primarily for describing or comparing group performance, reliability and standard error of measurement for aggregated observations should be reported.  7.If standard errors of measurement are estimated by using a model such as the binomial model, this should be clearly indicated; users will probably assume that the classical standard error of measurement is being reported. A binomial model is characterized by trials which either end in success (heads) or failure (tails). 45

46 CHAPTER 8 Introduction to Generalizability Theory Cronbach (1963) 46

47 CHAPTER 8 Introduction to Generalizability Theory Cronbach (1963)  Generalizability is another way to calculate the reliability of a test by using ANOVA.  Generalizability refers to the degree to which a particular set of measurements of an examinee generalizes to a more extensive set of measurements of that examinee. (just like conducting inferential research) 47

48 Introduction to Generalizability Generalizability Coefficient  FYI, In Classical True Score Theory, the Reliability was defined as the ratio of the True score to Observed score. Reliability= T/T+E Reliability= T/T+E  Also, an examinee’s True score is defined as the average (mean) of large number of strictly parallel measurements, and the True score variance σ² T defined as variance of these averages.  Reliability Coefficient  p X1X2 = σ²T/ σ²X 48

49 Introduction to Generalizability *Generalizabilty Coefficient  In Generalizability theory an examinee’s Universe Score is defined as the average of the measurements in the universe of generalization (The Universe Score is the same as True score in classical test theory), it is the average or mean of the measurements in the Universe of Generalization. 49

50 Introduction to Generalizability *Generalizabilty Coefficient  The Generalizability Coefficient or p is defined as the ratio of Universe Score Variance (σ²μ) to expected Observed Score Variance ( e σ² X ).  Generalizability Coefficient=p= σ² μ / e σ² X Ex. if Expected Observed Score Variance=eσ²X =10 Ex. if Expected Observed Score Variance=eσ²X =10 and Universe Score Variance =σ²μ =5 and Universe Score Variance =σ²μ =5 Then the Generalizability Coefficient is: 5/10=0.5 50

51 Introduction to Generalizability Key Terms  Universe: Universe are a set of measurement conditions which are more extensive than the condition under which the sample measurements were obtained. Universe are a set of measurement conditions which are more extensive than the condition under which the sample measurements were obtained. Ex: If you took the Test Construction exam here at CAU then, the Universe or (generalization) is when you take the test construction exams at several other universities, University Score CAU 85 FIU 90 FAU 84 NSU 80 UM 88 μ=85.40 is called the Universe Score μ=85.40 is called the Universe Score 51

52 Introduction to Generalizability Key Terms  Universe Score: It is the same as True score in Classical Test Theory. It is the average (mean) of the measurements in the universe of generalization. It is the same as True score in Classical Test Theory. It is the average (mean) of the measurements in the universe of generalization. Ex: The mean of your scores on the Test Construction exams you took at other universities is your Universe Score (see previous slide). Ex: The mean of your scores on the Test Construction exams you took at other universities is your Universe Score (see previous slide). 52

53 Introduction to Generalizability Key Term  Facets: Facets are a part or aspect of something also A Set of Measurement Conditions.  Ex. Next slide 53

54 Introduction to Generalizability  *Facets: Example If two supervisors want to rate the performance of factory workers under three workloads (heavy, medium, and light), how many sets of measurements (facets) we’ll have? If two supervisors want to rate the performance of factory workers under three workloads (heavy, medium, and light), how many sets of measurements (facets) we’ll have? See next slide See next slide 54

55 Introduction to Generalizability  *Facets: Example If two supervisors want to rate the performance of factory workers under three workloads (heavy, medium, and light), how many sets of measurements (facets) we’ll have? If two supervisors want to rate the performance of factory workers under three workloads (heavy, medium, and light), how many sets of measurements (facets) we’ll have? See next slide See next slide 55

56 Introduction to Generalizability Facets: facets  The two sets of measurement conditions or the two facets are; supervisors  1- the supervisors (one and two), workloads  2- The workloads (heavy, medium, and light).  Ex. 2 next slide 56

57 Introduction to Generalizability Facets:  *A researcher measuring students compositional writing on four occasions. On each occasion, each student writes compositions on two different topics. On each occasion, each student writes compositions on two different topics. All compositions are graded by three different raters. All compositions are graded by three different raters. This design involves how many facets?? This design involves how many facets?? See next slide 57

58 Introduction to Generalizability Facets:  *A researcher measuring students compositional writing on four occasions. On each occasion, each student writes compositions on two different topics. On each occasion, each student writes compositions on two different topics. All compositions are graded by three different raters. All compositions are graded by three different raters. 58

59 Introduction to Generalizability Key Term  *Universe of Generalization:  Universe of Generalization are all of the measurement conditions for the second set of measurement or “universe.” Such as; fatigue, room temperature, specification, etc,.  Ex. All of the conditions under which you took your test- construction exams at other universities. 59

60 Introduction to Generalizability  Generalizability Distinguishes between Generalizability Studies (G- Studies) and Decision studies (D-Studies).  *G-Studies: G-Studies are concern with extent to which a sample of measurement generalizes to universe of measurements. It is the study of generalizability procedures. G-Studies are concern with extent to which a sample of measurement generalizes to universe of measurements. It is the study of generalizability procedures. 60

61 Generalizability Studies (G- Studies) and Decision studies (D-Studies)  *D-Studies: D-Studies refer to providing data for making decisions about examinees. It is about the adequacy of measurement. D-Studies refer to providing data for making decisions about examinees. It is about the adequacy of measurement. Ex. Next slide Ex. Next slide 61

62 Generalizability Studies (G- Studies) and Decision studies (D-Studies) Ex. Suppose we use an achievement test to test 2000 children from public and 2000 children from private schools. If we want to know whether this test is equally reliable for both types of schools then we are dealing with G-Study (quality of measurement).  Ex. We can generalize a test to these two different school population i.e CAU and FIU doc. stu. taking the EPPP exam 62

63 Generalizability Studies (G- Studies) and Decision studies (D-Studies)  However, if we want to compare the means of these different types of schools (using data) and draw a conclusion about differences in the adequacy of the two educational systems then, we dealing with D- Study. Ex. Compare the means of CAU and FIU doc. stu. Who took the EPPP exam. 63

64 Introduction to Generalizability  *Generalizability Designs: 4 different Generalizability designs different Generalizability theory  There are 4 different Generalizability designs with different Generalizability theory examinee _  Stands for examinee rater or examiner +  Stands for rater or examiner 64

65 Generalizibility Designs:  1._ _ _ _ _ _ _ _ _ _ + 1. One rater rates each one of the examinees  2._ _ _ _ _ _ _ _ _ _ A group of raters rate each one of the examinees  3._ _ _ _ _ _ _ _ _ _ One rater rates only one examinee  4._ _ _ _ _ _ _ _ _ _ Each examinee is rated by different group of raters (most expensive). 65


Download ppt "CHAPTER 7 Procedures for Estimating Reliability 1."

Similar presentations


Ads by Google