1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content,

Slides:



Advertisements
Similar presentations
Agenda Levels of measurement Measurement reliability Measurement validity Some examples Need for Cognition Horn-honking.
Advertisements

Chapter 8 Flashcards.
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Lecture 3 Validity of screening and diagnostic tests
Topics: Quality of Measurements
The Research Consumer Evaluates Measurement Reliability and Validity
Taking Stock Of Measurement. Basics Of Measurement Measurement: Assignment of number to objects or events according to specific rules. Conceptual variables:
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Reliability and Validity of Research Instruments
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Chapter 4 Validity.
Concept of Measurement
RELIABILITY consistency or reproducibility of a test score (or measurement)
Reliability and Validity
Research Methods in MIS
Chapter 7 Correlational Research Gay, Mills, and Airasian
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Measurement and Data Quality
Measurement in Exercise and Sport Psychology Research EPHE 348.
PTP 560 Research Methods Week 3 Thomas Ruediger, PT.
1 Lecture 2: Types of measurement Purposes of measurement Types and sources of data Reliability and validity Levels of measurement Types of scale.
1 Lecture 2 Screening and diagnostic tests Normal and abnormal Validity: “gold” or criterion standard Sensitivity, specificity, predictive value Likelihood.
Instrumentation.
Foundations of Educational Measurement
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Lecture 5: Reliability and validity of scales 1. Describe the applications of the following types of measurement: - Impairment, disability, handicap, quality.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
Lecture 6: Reliability and validity of scales (cont) 1. In relation to scales, define the following terms: - Content validity - Criterion validity (concurrent.
Reliability & Validity
EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Tests and Measurements Intersession 2006.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Independent vs Dependent Variables PRESUMED CAUSE REFERRED TO AS INDEPENDENT VARIABLE (SMOKING). PRESUMED EFFECT IS DEPENDENT VARIABLE (LUNG CANCER). SEEK.
Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Assessing Responsiveness of Health Measurements Ian McDowell, INTA, Santiago, March 20, 2001.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
DENT 514: Research Methods
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 10: Correlational Research 1.
Assessing Student Performance Characteristics of Good Assessment Instruments (c) 2007 McGraw-Hill Higher Education. All rights reserved.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Measurement and Scaling Concepts
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Reliability and Validity
Questions What are the sources of error in measurement?
Test Validity.
CHAPTER 5 MEASUREMENT CONCEPTS © 2007 The McGraw-Hill Companies, Inc.
Understanding Results
Reliability & Validity
Introduction to Measurement
Reliability and Validity of Measurement
PSY 614 Instructor: Emily Bullock, Ph.D.
Presentation transcript:

1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content, criterion, and construct validity –responsiveness

2 Multi-item scales Measure constructs without a gold standard –e.g., depression, satisfaction, quality of life Items are intended to sample the content of the underlying construct Items summarized in various ways: –sum or average of responses to individual items –item weighting or other algorithm –profiles/sub-scale scores

3 Example: Reliability and validity of a measure of severity of delirium Source: McCusker et al, Internat Psychogeriatrics 1998; 10: Delirium - acute confusion Common in older hospitalized patients Diagnosis of delirium is based on the following symptoms: –acute onset, fluctuations –inattention, disorganized thinking –altered consciousness, disorientation –memory impairment, perceptual disturbances –psychomotor agitation or retardation

4 Requirements of new scale Administered by interviewer at bedside Not using patient chart (to maintain blinding) Brief (avoid patient burden) Responsive to within-patient changes over time

5 Delirium Index (DI) Assesses severity of 7 symptoms of delirium (excl. acute onset, fluctuations, sleep disorder): –inattention, disorganized thinking –altered consciousness, disorientation –memory impairment, perceptual disturbances –psychomotor agitation or retardation

6 Administration and scoring Administered in conjunction with first 5 questions of Mini-Mental State Exam (MMSE) Each symptom rated on 4-point scale: 0 = absent 1 = mild 2 = moderate 3 = severe Operational definition of each symptom

7 Scoring Score is sum of 7 item scores Scoring of symptoms that could not be assessed: –patient non-responsive - coded as “severe” for items 1,2,4,5 –coding instructions provided for questions 3, 6, 7 –patient refuses - questions 1, 2, 4, 5 scores replaced by score of item 3

8 Reliability Internal consistency Test-retest reliability Inter-rater and intra-rater reliability

9 Internal consistency Relevant to additive scales (that sum or average items) Split-half reliability: –correlation between scores on arbitrary half of measure with scores on other half Coefficient alpha (Cronbach) –estimates split half correlation for all possible combinations of dividing the scale

10 Internal consistency of DI Cronbach’s alpha (overall) = 0.74 After exclusion of perceptual disturbance: 0.82 In sub-groups of patients: –delirium and dementia:0.69, 0.79 –delirium alone:0.67, 0.79 –dementia alone:0.55, 0.59 –neither0.44, 0.52

11 Test-retest reliability (stability) Scale is repeated –short-term for constructs that fluctuate, 2 weeks often used to reduce effects of memory and true change –long-term for constructs that should not fluctuate (e.g., personality traits) Correlation between 2 scores is computed Also important to look at systematic increase or decrease in score

12 Test-retest reliability of DI Delirium is marked by fluctuations Variability over time is expected

13 Mean within-patient standard deviation in DI score during 1st week in hospital

14 Inter- and intra-rater reliability Inter-rater reliability For scales requiring rater skill, judgment 2 or more independent raters of same event Intra-rater reliability Independent rating by same observer of same event

15 Measures of inter- and intra-rater reliability: categorical data Percent agreement –can be used for di- and polychotomous scales –limitation: value is affected by prevalence - higher if very low or very high prevalence Kappa statistic –takes chance agreement into account –defined as fraction of observed agreement not due to chance

16 Kappa statistic Kappa = p(obs) - p(exp) 1 - p(exp) p(obs): proportion of observed agreement p(exp): proportion of agreement expected by chance

17

18 Interpretation of kappa Various suggested interpretations Example: Fleiss (1981) excellent: 0.75 and above fair to good: poor: less than 0.40 Limitations –depends on prevalence (see Szklo & Nieto) –do not use as only measure of agreement

19 Measures of inter- and intra-rater reliability: continuous data Measures of correlation –Correlation graph (scatter diagram) –Correlation coefficients Measures of pairwise comparison

20 Correlation coefficients Pearson’s r –assesses linear association, not systematic differences between 2 sets of observations –sensitive to range of values, especially outliers Spearman r –ordinal or rank order correlation –less influenced by outliers –doesn’t assess systematic differences

21 Correlation coefficients Intra-class correlation coefficient (ICC) –Estimate of total measurement variability due to between-individuals (vs error variance) –Equivalent to kappa and same range of values –Reflects true agreement, including systematic differences –Affected by range of values - if less variation between individuals, ICC will be lower

22 Inter-rater reliability of DI Intraclass correlation coefficient (ICC): n = 26 patients (39 pairs of ratings) ICC = 0.98 (SD 0.06)

23 Alternate form reliability Agreement between alternate forms of same instrument: –longer vs shorter version –alternate method of administration: face-to-face vs telephone subject vs proxy (see Magaziner paper)

24 Validity Content and face validity Criterion validity: concurrent and predictive Construct validity

25 Validity Depends on purpose: –screening: discrimination –outcome of treatment: responsive, sensitivity to change –prognosis: predictive validity

26 Content and face validity Judgment of “experts” and/or members of target population Does measure adequately sample domain being measured? Does it appear to measure what it is intended to measure? (eyeball test)

27 Content validity of DI Based on Confusion Assessment Method (CAM) –based on accepted diagnostic criteria (DSM) –widely used

28 Criterion validity Criterion (“gold” standard) Concurrent criterion validity –e.g., screening test vs diagnostic test Predictive criterion validity –e.g., cancer staging test vs 5-year survival

29 Criterion validity of DI Correlation between psychiatrist-scored DI (based only on patient observation) and Delirium Rating Scale (using all available information) –original scale –adjusted scale, omitting 4 items not assessed by DI

30 Criterion validity of DI: results Spearman correlation coefficient ( and 95% CI) between DI and adjusted DRS (using multiple observations): –at one point in time0.84 (0.75, 0.89) –within-subject change over time0.71 (0.53, 0.82)

31 Delirium severity and survival Proportional hazards regression of delirium severity in delirium cohort Mean of 1st 2 DI scores Results –significant interaction: DI predicted survival in patients with delirium alone, not in those with dementia

32 Construct validity Is the theoretical construct underlying the measure valid? Development and testing of hypotheses Requires multiple data sources and investigations: –Convergent validity: measure is correlated with other measures of similar constructs –discriminant validity: measure is not correlated with measures of different constructs

33 Construct validity (cont) Multitrait-multi-method: –Convergent validity: measure is correlated with other measures of similar constructs –discriminant validity: measure is not correlated with measures of different constructs Factorial method: –factor analysis or principle components analysis to identify underlying dimensions

34 Spearman correlation coefficients between Delirium Index and 3 baseline measures of current status

35 Spearman correlation coefficients between Delirium Index and 3 baseline measures of prior status

36 Responsiveness of measures Ability to detect clinically important change over time or differences between treatments Requirement of evaluative measures

37 Some sources of bias in scales “Response sets” –Social desirability –Acquiescent

38 Social desirability Tendency to give answers to questions that are perceived to be more socially desirable than the true answer Different from deliberate distortion (“faking good”) Depends on: –Individual characteristics (age, sex, cultural background) –Specific question

39 Social desirability Measures of social desirability (SD) –SD scales (e.g., Jackson SD scale, Crowne & Marlowe SD scale) –individual tendency to SD bias Prevention –phrasing of questions –questionnairemode –training of interviewers

40 Acquiescent response set Tendency to agree with Likert-type questions Can be prevented by mix of positively and negatively-phrased questions, e.g.: –My health care is just about perfect –There are serious problems with my health care

41 Measurement of Quality of life (QoL) Definition –individuals’ perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards, and concerns” (WHO QOL group, 1995) Domains –physical, psychological,level of independence, social relationships, environment, and spirituality/religion/personal beliefs

42 Health-related quality of life (HRQoL) Dimensions of QoL related to health Related terms: –health status –functional status Usually includes: –physical health/function –mental health/function –social health/function

43 Evaluative HRQoL instruments Purpose –evaluate within-individual change over time Reliability: –responsiveness Construct validity: –correlations of changes in measures during period of time, consistent with theoretically derived predictions

44 Discriminative HRQoL instruments Purpose –evaluate differences between individuals at point in time Reliability: –reproducibility Construct validity: –correlations between measures at point in time, consistent with theoretically derived predictions

45 How is HRQoL measured? Mode –Interviewer face-to-face Telephone –Self-completed Completed by –self –proxy/surrogate

46 Types of HRQoL measures Generic (global) –Health profiles –Utility measures Specific

47 Generic vs specific Generic –comparisons across populations and problems –robust and generalizable –measurement properties better understood Disease-specific –shorter –more relevant and appropriate –sensitive to change

48 Appropriateness Purpose: –describe health of population –evaluate effects of interventions (change over time) –compare groups at point in time –predict outcomes Areas of function covered Level of health Generic/global or specific