Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.

Slides:



Advertisements
Similar presentations
Agenda Levels of measurement Measurement reliability Measurement validity Some examples Need for Cognition Horn-honking.
Advertisements

Chapter 8 Flashcards.
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
The Research Consumer Evaluates Measurement Reliability and Validity
Taking Stock Of Measurement. Basics Of Measurement Measurement: Assignment of number to objects or events according to specific rules. Conceptual variables:
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.
Research Methodology Lecture No : 11 (Goodness Of Measures)
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Part II Sigma Freud & Descriptive Statistics
Part II Sigma Freud & Descriptive Statistics
Methods for Estimating Reliability
Reliability and Validity of Research Instruments
RESEARCH METHODS Lecture 18
Reliability and Validity in Experimental Research ♣
Beginning the Research Design
Measurement: Reliability and Validity For a measure to be useful, it must be both reliable and valid Reliable = consistent in producing the same results.
Validity, Reliability, & Sampling
Research Methods in MIS
Classroom Assessment A Practical Guide for Educators by Craig A
Measurement and Data Quality
Validity and Reliability
Reliability, Validity, & Scaling
Instrumentation.
Data Analysis. Quantitative data: Reliability & Validity Reliability: the degree of consistency with which it measures the attribute it is supposed to.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
Measurement and Questionnaire Design. Operationalizing From concepts to constructs to variables to measurable variables A measurable variable has been.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Measurement Issues General steps –Determine concept –Decide best way to measure –What indicators are available –Select intermediate, alternate or indirect.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
MEASUREMENT: PART 1. Overview  Background  Scales of Measurement  Reliability  Validity (next time)
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability and Validity Themes in Psychology. Reliability Reliability of measurement instrument: the extent to which it gives consistent measurements.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Outline Variables – definition  Physical dimensions  Abstract dimensions Systematic vs. random variables Scales of measurement Reliability of measurement.
Measuring Research Variables
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Measurement and Scaling Concepts
Reliability and Validity
Measurement: Part 1.
Tests and Measurements: Reliability
Reliability & Validity
Introduction to Measurement
Human Resource Management By Dr. Debashish Sengupta
پرسشنامه کارگاه.
Measurement: Part 1.
Measurement: Part 1.
Presentation transcript:

Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation between two sets of scores

Reliability Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer ability from behavior all behaviors are influenced by many variables, only a few of which matter to us

Observed Scores O = T + eO = Observed score T = True score e = error

Reliability – the basics 1.A true score on a test does not change with repeated testing 2.A true score would be obtained if there were no error of measurement. 3.We assume that errors are random (equally likely to increase or decrease any test result).

Reliability – the basics Because errors are random, if we test one person many times, the errors will cancel each other out (Positive errors cancel negative errors) Mean of many observed scores for one person will be the person’s true score

Reliability – the basics Example: to measure Sarah’s spelling ability for English words. We can’t ask her to spell every word in the dictionary, so… Ask Sarah to spell a subset of English words % correct estimates her true English spelling skill But which words should be in our subset?

Estimating Sarah’s spelling ability… Suppose we choose 20 words randomly… Then, by chance, we may get a lot of very easy words – cat, tree, chair, stand… Or, by chance, we may get a lot of very difficult words – desiccate, arteriosclerosis, numismatics

Estimating Sarah’s spelling ability… Sarah’s observed score will vary with the difficulty of the random sets of words we choose But presumably her actual spelling ability remains constant.

Reliability – the basics Other things can produce error in our measurement E.g. on the first day that we test Sarah she’s tired but on the second day, she’s rested…

Estimating Sarah’s spelling ability… Conclusion: O = T + e But e 1 ≠ e 2 ≠ e 3 … The variation in Sarah’s scores is produced by measurement error. How can we measure such effects – how can we measure reliability?

Reliability – the basics In what follows, we consider various sources of error in measurement. Different ways of measuring reliability are sensitive to different sources of error.

How do we deal with sources of error? Error due to test itemsDomain sampling error

Domain Sampling error A knowledge base or skill set containing many items is to be tested.  E.g., chemical properties of foods. We can’t test the entire set of items.  So we sample items.  That produces sampling error, as in Sarah’s spelling test.

Domain Sampling error Smaller sets of items may not test entire knowledge base. A person’s score may vary depending upon what is included or excluded from test. Reliability increases with number of items on a test

Domain Sampling error Parallel Forms Reliability: Choose 2 different sets of test items. Across all people tested, if correlation between scores on 2 sets of words is low, then we probably have domain sampling error.

How do we deal with sources of error? Error due to test items Error due to testing occasions Time sampling error

Time Sampling error Test-retest Reliability  person taking test might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc. Give same test repeatedly & check correlations among scores High correlations indicate stability – less influence of bad or good days.

Time sampling error Advantage: easy to evaluate, using correlation Disadvantage: carryover & practice effects

How do we deal with sources of error? Error due to test items Error due to testing occasions Error due to testing multiple traits Internal consistency error

Internal consistency approach Suppose a test includes both (1) items on social psychology and (2) items requiring mental rotation of abstract visual shapes. Would you expect much correlation between scores on the two parts?  No – because the two ‘skills’ are unrelated.

Internal consistency approach A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits. In such a case, the two halves of the test give information about two different, uncorrelated traits

Internal consistency approach So we assess internal consistency by dividing the test into 2 halves and computing the correlation between scores on those two halves for the people who took the test But how should we divide the test into halves to check the correlation?

Internal consistency approach Split-half method Kuder-Richardson formula Cronbach’s alpha All of these assess the extent to which items on a given test measure the same ability or trait.

Split-half Reliability After testing, divide test items into halves A & B that are scored separately. Compute correlation of results for A with results for B. Various ways of dividing test into two – randomly, first half vs. second half, odd- even…

Kuder-Richardson 20 Kuder & Richardson (1937): an internal- consistency measure that doesn’t require arbitrary splitting of test into 2 halves. KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves.

Internal Consistency – Cronbach’s α KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false). Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories. α is a more generally- useful measure of internal consistency than KR-20

Review: How do we deal with sources of error? ApproachMeasuresIssues Test-RetestStability of scoresCarryover Parallel FormsEquivalence & StabilityEffort Split-halfEquivalence & InternalShortened consistency test KR-20 & αEquivalence & InternalDifficult to consistencycalculate

Reliability in Observational Studies Some psychologists collect data by observing behavior rather than by testing. This approach requires time sampling, leading to sampling error Further error due to:  observer failures  inter-observer differences

Reliability in Observational Studies Deal with possibility of failure in the single- observer situation by having more than 1 observer. Deal with inter- observer differences using:  Inter-rater reliability  Kappa statistic

Validity We distinguish between the validity of a measure of some psychological process or state and the validity of a conclusion. Here, we focus on validity of measures. A subsequent lecture will consider the validity of conclusions.

Theory: A influences B Prediction: A  B Operationalization of A = a, B = b Measurement of b We’ll consider validity of these in a few weeks We’ll look at validity of these phases today

Validity a measure is valid if it measures what you think it measures. we traditionally distinguish between four types of validity:  face  content  construct  criterion

Four types of validity Face The test appears to measure what it is supposed to measure  not formally recognized as a type of validity

Four types of validity Face Construct The measure captures the theoretical construct it is supposed to measure

Four types of validity Face Construct Content The measure samples the range of behavior covered by the construct.

Four types of validity Face Construct Content Criterion Results relate closely to those produced by other measures of the same construct. Results do not relate to those produced by measures of other constructs

Review (last week & this week) We’re not really interested in things that stay the same. We’re interested in variation. But only systematic variation, not random variation  systematic variation can be explained  random variation can’t

Quick Review Some variation in performance is random and some is systematic The scientist’s tasks are to separate the systematic variation from the random, and then to build models of the systematic variation.

Quick Review We choose a measurement scale. We prefer either ratio or interval scales, when we can get them. We try to maximize both the reliability and the validity of our measurements using that scale.

Review questions Which would you expect to be easier to assess – reliability or validity? Why do we have tools and machines to measure some things for us (such as rulers, scales, and money)? What are some analogues for rulers and scales, used when we measure psychological constructs?