Reliability Lesson Six

Slides:



Advertisements
Similar presentations
Consistency in testing
Advertisements

Topics: Quality of Measurements
Reliability.
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 5 Reliability Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright ©2006.
The Department of Psychology
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Chapter 4 – Reliability Observed Scores and True Scores Error
1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
Lesson Six Reliability.
1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Can you do it again? Reliability and Other Desired Characteristics Linn and Gronlund Chap.. 5.
Lesson Seven Reliability. Contents  Definition of reliability Definition of reliability  Indication of reliability: Reliability coefficient Reliability.
1 BASIC CONSIDERATIONS in Test Design 2 Pertemuan 16 Matakuliah: >/ > Tahun: >
Basic Issues in Language Assessment 袁韻璧輔仁大學英文系. Contents Introduction: relationship between teaching & testing Introduction: relationship between teaching.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
Validity, Reliability, & Sampling
Research Methods in MIS
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Principles of language testing
Questions to check whether or not the test is well designed: 1. How do you know if a test is effective? 2. Can it be given within appropriate administrative.
Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose.
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Measurement and Data Quality
Validity and Reliability
Foundations of Educational Measurement
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Chap. 2 Principles of Language Assessment
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
EDU 8603 Day 6. What do the following numbers mean?
Week 5 Lecture 4. Lecture’s objectives  Understand the principles of language assessment.  Use language assessment principles to evaluate existing tests.
Data Collection and Reliability All this data, but can I really count on it??
Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose.
Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.
RELIABILITY Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Imagine…  A hundred students is taking a 100 item test at 3 o'clock on a Tuesday afternoon.  The test is neither difficult nor easy. So, not ALL get.
RELIABILITY BY DONNA MARGARET. WHAT IS RELIABILITY?  Does this test consistently measure what it’s supposed to measure?  The more similar the scores,
PRINCIPLES OF LANGUAGE ASSESSMENT Riko Arfiyantama Ratnawati Olivia.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Classical Test Theory Margaret Wu.
Reliability & Validity
RELIABILITY IN TESTING
PSY 614 Instructor: Emily Bullock, Ph.D.
The extent to which an experiment, test or any measuring procedure shows the same result on repeated trials.
The first test of validity
Psychological Measurement: Reliability and the Properties of Random Errors The last two lectures were concerned with some basics of psychological measurement:
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

Reliability Lesson Six What is reliability? When we talk about the reiliable, we talk about the result or scores. The test result is consistent . I give you the quiz today to this class, and I give you the quiz again with the same group. I will give the similar result. You rank no. 1 and you will rank no. 1 another time. With the test. We are talking about the test itself. We did not concern the test content.

Case Imagine that a hundred students take a 100-item test at three o’clock one Thursday afternoon. The test is not impossible difficult or ridiculously easy for these students, so they do not all get zero or a perfect score of 100. Now what if in fact they had not taken the test on the Thursday but had taken it at three o’clock the previous afternoon? Would we expect each student to have got exactly the same score on the Wednesday as they actually did on the Thursday? The answer to this question must be no. Even if we assume that the test is excellent, that the conditions of administration are almost identical, that the scoring calls for no judgment on the part of the scorers and is carried out with perfect care, and that no learning or forgetting has taken place during the one-day interval. Human beings are not like that ; they simply do not behave in exactly the same way on every occasion, even when the circumstances seem identical. What we have to do is construct, administer and score tests in such a way that the scores actually obtained on a test on a particular occasion are likely to be very similar to those which would have been obtained if it had been administered to the same students with the same ability, but at a different time. The more similar the scores would have been, the more reliable the test is said to be.

Contents Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient Ways of obtaining reliability coefficient: Alternate/Parallel forms Test-retest Split-half & KR-21/KR-20 Two ways of testing reliability How to make test more reliable Online video http://www.le.ac.uk/education/testing/ilta/faqs/main.html

Definition of Reliability (1) “The consistency of measures across different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24). If you give the same test to the same testees on two different occasions, the test should yield similar results. In other words, the different times the result should be the same. Test forms are equivalent/ parallel form. You might get the similar result. Test gives to different test taker and get the similar score. It concern the rater, how the test is given, administration will affect the reliability.

Definition of Reliability (2) A reliable test is consistent and dependable. Scores are consistent and reproducible. The accuracy or precision with which a test measures something; that is, consistency, dependability, or stability of test results. We are talking about the test results. Again, we did not concern about the content.

Factors Contributing to Unreliability X=T+ E (observed score = true score + error score) Concerned with freedom from nonsystematic fluctuation. Fluctuations in the student scoring test administration the test itself What kinds of factor will influence the test reliability. The test takers- student; or scoring means test raters. How the test is given (test administration), the problem with test (ambiguous item). In reality, we have acknowledge the testing exists errors. X is the test results or scores. These scores consist of two parts. One is his true ability, and the error score. For, example, multiple choice. 35% of get right. If your eyes closed, you pick up a as the correct answer. You guess right. It is not your own ability, it is called error score. In that case, we have to know in the testing situation, there is nonsystematic flunctuation. The really error from the test taker themselves. You expect you get the right answers. However, you did not perform well (careless, it is not my day, stay up late), in your normal

Types of Reliability Student- (or Person-) related reliability Rater- (or Scorer-) related reliability Intra-rater reliability Inter-rater reliability Test administration reliability Test (or instrument-related) reliability To do with test taker itself Or some people call scorer. Intra-rater: we are talking about the one rater. the same people; inter-raters – more than 2 raters. The way you give the test; the classroom situation, times. (too early before the ring. To do with the test itself.

Student-Related Reliability (1) The source of the error score comes from the test takers. Temporary illness Fatigue Anxiety Other physical or psychological factors Test-wiseness (i.e., strategies for efficient test taking) We are going to talk each of them one by one. The students refer to test takers. Fatigue (too tired ; stay up late). Anxiety (too nervous; oral interview. You did not say the word.) Test-wiseness. I give you some example. For example.. Test taking strategies. How to distribute your time? You have to finish your test in one hour. You might use your time evenly to finish four times. Multiple choice. How to get the correct answers? “four choice” eliminate the unlikely answers. Those strategies call test-wiseness. To stick to the passage, don’t think too much. Otherwise, it will lead to the different direction. It is the test-role.

Student-Related Reliability (2) Principles: Assess on several occasions Assess when person is prepared and best able to perform well Ensure that person understands what is expected (e.g., instructions are clear) Some principles to follow; teacher cannot control students’ behavior. Give more tests and get average. Give more chances. Those functions will given out. Only final exams is mainly score for the whole grades. It is caud the control to fit into this 3, make sure that student understand what is expected. “oral instructions.

Rater (or Scorer) Reliability (1) Fluctuations: including human error, subjectivity, and bias Principles: Use experienced trained raters. Use more than one rater. Raters should carry out their assessments independently. Errors – human error as the rater. Carelessness and give the wrong grading. The composition and oral test is hard to be objectives. Iraters need to be independent not be influced by other raters.

Rater Reliability (2) Two kinds of rater reliability: Intra-rater reliability Inter-rater reliability One rater have to use the

Intra-Rater Reliability Fluctuations including: Unclear scoring criteria Fatigue Bias toward particular good and bad students Simple carelessness One rater has clear scoring criteria. No one is totally objective. We are human beings, and making mistakes.

Inter-Rater Reliability (1) Fluctuations including: Lack of attention to scoring criteria Inexperience Inattention Preconceived biases You did not attention to scoring. Two or more raters involved in this situation, we need to calculate the inter-rater reliability.

Inter-Rater Reliability (2) Used with subjective tests when two or more independent raters are involved in scoring Train the raters before scoring (e.g., TWE, dept. oral and composition tests for recommended students).

Inter-Rater Reliability (3) Compare the scores of the same testee given by different raters. If r= high, there’s inter-rater reliability. The evidence will be called. Statistical formula is to help you get the r. if we want to know if there is inter-rater reliability you need to do the calculation and get r. try to convince the people. They did not train the raters and did not have inter-rater reliability.

Test Administration Reliability Street noise Listening comprehension test Photocopying variations Lighting Variations in temperature Condition of desks and chairs Monitors When I give the test, some factors might affect your performance. (noisy street.) 2. The lousy copying machine, you cannot read clear and take more time to 3. Light broken, too hot/ sweating. 4.

Test Reliability Measurement errors come from the test itself: Test is too long Test with a time limit Test format allows for guessing Ambiguous test items Test with more than one correct answer The long test will be more reliable than the short term. It includes more samples of test taker’s ability. One question, and 100 questions. If you miss one, you still have 90% to get right. But it is too long, 200 items, you will be tired. There is more chances that I will make the mistakes in the test. Test with a time limit will have cause some problems. Take 100 items, 70% answers are known, 30% is guessing. He got 25% correct. He got 95. It is likely his test-wiseness good. Multiple-choice is good for inter-rater reliability but increases the guessing factors.

Ways of Enhancing Reliability General strategies: Consider possible sources of unreliability Reduce or average out nonsystematic fluctuations in raters persons test administration instruments

How to Make Tests More Reliable? (1) Take enough samples of behavior Try to avoid ambiguous items Provide clear and explicit instructions Ensure tests are well layout & perfectly legible Provide uniform and undistracted condition of administration Try to use objective tests

How to Make Tests More Reliable? (2) Try to use direct tests Have independent, trained raters Provide a detailed scoring key Try to identify the test takers by number, not by names Try to have more multiple independent scoring in subjective tests (Hughes, 1989, pp. 36-42).

Reliability Coefficient (r) To quantify the reliability of a test  allow us to compare the reliability of different tests. 0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered). If r = 1: 100% reliable A good achievement test: r>= .90 R<.70  shouldn’t use the test Let’s talk about the r. usually, the ideal r between 1 to 0 . When we have r = o , what does that mean? The result is not reliable .You don’t use the test results in ay ways, something is terrible wrong in the test. In other words, the r = is bigger is better. is more reliable. The perfect r is 1. the average .7 to 1. When you have r < .7, you should not use the test. It refers to test items itself own. We did not take into account of test taker and raters.

How to Get Reliability Coefficient Type of Reliability How to Measure Stability or Test-Retest Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and Time 2. Alternate Form Create two forms of the same test (vary the items slightly).  Reliability is stated as correlation between scores of Test 1 and Test 2. Internal  Consistency (Alpha, a) Compare one half of the test to the other half.  Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.    We are talking about the theory. If we give the test to same people several times. The results will be consist. Two parallel forms to the same test takers on different occasion. This time, next time. If the tests have good reliable, it means the results are similar. One form – you use the exact same test. Pre-test and post test. One administration. One time. And divided into two part. Or two sets of scores. Individual test takers have two set of scores. And we can correlate those two score. We called it internal consistency. There are three types.

How to Get Reliability Coefficient Two forms, two administrations: alternate/parallel forms One form, two administrations: test-retest One form, one administration (internal consistency): split-half (Spearman-Brown procedure) KR-21 KR-20 We are talking about the theory. If we give the test to same people several times. The results will be consist. Two parallel forms to the same test takers on different occasion. This time, next time. If the tests have good reliable, it means the results are similar. One form – you use the exact same test. Pre-test and post test. One administration. One time. And divided into two part. Or two sets of scores. Individual test takers have two set of scores. And we can correlate those two score. We called it internal consistency. There are three types.

Alternate/Parallel Forms Two forms, two administrations: Equivalent forms (i.e., different items testing the same topic) taken by the same test taker on different days If r is high, this test is said to have good reliability. the most stringent form Two forms, or called parallel form.. Are measuring the same ability. If you give two test and give to test takers at the same time same day, the result is correlated. How do you like this method as the test takers? It is boring to take the test twice. It is very ideal. If you are not highly motive, you will perform differently. The teacher is not easy to prepare two parallel forms. It takes time and limit time. T likes to take test-retest.

Test-Retest One form, two administrations The same test is administered to the same testees with a short time lag, and then calculate r. Appropriate for highly speeded test Learner effect. The first one, you did not answer items correct. The second one, after four month, you may answer items correct. However, it cannot be a too short. Tests takers might memorized the item. One or two weeks are better time lag. The problem is test takers. They have to take the test twice. Therefore, we have the third method. Called split-half.

Split-half (Spearman-Brown Procedure) One test, one administration Split the test into halves (i.e., odd questions vs even questions) to form two sets of scores. Also called internal consistency Q1 Q2 Q3 Q4 Q5 Q6 Was designed by spearman and Brown. First Half Second Half

Split-half (2) Note that the r isn’t the reliability of the test A math relationship between test length and reliability: the longer the test, the more reliable it is. Rel.total = nr/1+ (n-1)r  Spearman & Brown Prophecy Formula E.g., correlation between 2 parts of test; r= .6  rel. of full test = .75 If lengthen the test items into 3 times: r= .82 The r is not the real reliability of the test. You know why. Because the test is cut into half. The real test should be twice long. 100 items cut into half. 50 items. The reliably become the half of the reliablgigiy of the all test. We need some adjustment. The real = number times r / 1 150 items , we will increase the reliability of the test. The theory is the longer the test is, the more reliable the test is. I did not want you to remember the formula.

Kuder-Ridchardson formula 21 KR-21 = k/(k-1){1-[x (1- x/k)]/s2} k= number of items; x= mean s= standard deviation (formula see Bailey 100) description of the spread outness in a set of scores (or score deviations from the mean) o<=s  the larger s, the more spread out E.g., 2 sets of scores: (5, 4,3) and (7,4,1); which group in general behaves more similarly? The important concept I am going to talk about is the standard deviation. We use S to represent it. For example, you have 20 people in the group. Get 20 scores. Add up all those score up. Use the mean as the center, your score can distance from the mean. Deviate from the mean. How can we get sd.: 3 test takers. Average of the mean is four, another group take the same test. The mean is 4 too. You cannot say they have a similar general behaves. Sd. In the second group is bigger than the first one. Because the SDs are different , they don’t have same behaves. 50 scores. The first group is scattered in the middle crowded. Not spread out like the second group. p. 102. the first one SD 1; the second SD 3. we have small no. it is easy to tell. However, you have the large no. you need calculate them.

Kuder-Ridchardson formula 20 KR-20= [k/(k-1)][1-(∑pq/s2) p= item difficulty (percent of people who got an item right) q= 1-p (i.e., percent of people who got an item wrong) It is another way to calculate the reliability. The first formula is conservative. You might higher. We can come out the ways of reliability.

Ways of Testing Reliability Examine the amount of variation Standard Error of Measurement (SEM) The smaller the better Calculate “reliability coefficient” “r” The bigger the better What the r you are expected? The bigger is better. It is the one way. Another way is we examine the amought of variation can calculate the SEM. Is smaller is the better. SEM has to do with individual