1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content,

1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content, criterion, and construct validity –responsiveness

2 Multi-item scales Measure constructs without a gold standard –e.g., depression, satisfaction, quality of life Items are intended to sample the content of the underlying construct Items summarized in various ways: –sum or average of responses to individual items –item weighting or other algorithm –profiles/sub-scale scores

3 Example: Reliability and validity of a measure of severity of delirium Source: McCusker et al, Internat Psychogeriatrics 1998; 10:421-33 Delirium - acute confusion Common in older hospitalized patients Diagnosis of delirium is based on the following symptoms: –acute onset, fluctuations –inattention, disorganized thinking –altered consciousness, disorientation –memory impairment, perceptual disturbances –psychomotor agitation or retardation

4 Requirements of new scale Administered by interviewer at bedside Not using patient chart (to maintain blinding) Brief (avoid patient burden) Responsive to within-patient changes over time

5 Delirium Index (DI) Assesses severity of 7 symptoms of delirium (excl. acute onset, fluctuations, sleep disorder): –inattention, disorganized thinking –altered consciousness, disorientation –memory impairment, perceptual disturbances –psychomotor agitation or retardation

6 Administration and scoring Administered in conjunction with first 5 questions of Mini-Mental State Exam (MMSE) Each symptom rated on 4-point scale: 0 = absent 1 = mild 2 = moderate 3 = severe Operational definition of each symptom

7 Scoring Score is sum of 7 item scores Scoring of symptoms that could not be assessed: –patient non-responsive - coded as “severe” for items 1,2,4,5 –coding instructions provided for questions 3, 6, 7 –patient refuses - questions 1, 2, 4, 5 scores replaced by score of item 3

8 Reliability Internal consistency Test-retest reliability Inter-rater and intra-rater reliability

9 Internal consistency Relevant to additive scales (that sum or average items) Split-half reliability: –correlation between scores on arbitrary half of measure with scores on other half Coefficient alpha (Cronbach) –estimates split half correlation for all possible combinations of dividing the scale

10 Internal consistency of DI Cronbach’s alpha (overall) = 0.74 After exclusion of perceptual disturbance: 0.82 In sub-groups of patients: –delirium and dementia:0.69, 0.79 –delirium alone:0.67, 0.79 –dementia alone:0.55, 0.59 –neither0.44, 0.52

11 Test-retest reliability (stability) Scale is repeated –short-term for constructs that fluctuate, 2 weeks often used to reduce effects of memory and true change –long-term for constructs that should not fluctuate (e.g., personality traits) Correlation between 2 scores is computed Also important to look at systematic increase or decrease in score

12 Test-retest reliability of DI Delirium is marked by fluctuations Variability over time is expected

13 Mean within-patient standard deviation in DI score during 1st week in hospital

14 Inter- and intra-rater reliability Inter-rater reliability For scales requiring rater skill, judgment 2 or more independent raters of same event Intra-rater reliability Independent rating by same observer of same event

15 Measures of inter- and intra-rater reliability: categorical data Percent agreement –can be used for di- and polychotomous scales –limitation: value is affected by prevalence - higher if very low or very high prevalence Kappa statistic –takes chance agreement into account –defined as fraction of observed agreement not due to chance

16 Kappa statistic Kappa = p(obs) - p(exp) 1 - p(exp) p(obs): proportion of observed agreement p(exp): proportion of agreement expected by chance

18 Interpretation of kappa Various suggested interpretations Example: Fleiss (1981) excellent: 0.75 and above fair to good: 0.40 - 0.74 poor: less than 0.40 Limitations –depends on prevalence (see Szklo & Nieto) –do not use as only measure of agreement

19 Measures of inter- and intra-rater reliability: continuous data Measures of correlation –Correlation graph (scatter diagram) –Correlation coefficients Measures of pairwise comparison

20 Correlation coefficients Pearson’s r –assesses linear association, not systematic differences between 2 sets of observations –sensitive to range of values, especially outliers Spearman r –ordinal or rank order correlation –less influenced by outliers –doesn’t assess systematic differences

21 Correlation coefficients Intra-class correlation coefficient (ICC) –Estimate of total measurement variability due to between-individuals (vs error variance) –Equivalent to kappa and same range of values –Reflects true agreement, including systematic differences –Affected by range of values - if less variation between individuals, ICC will be lower

22 Inter-rater reliability of DI Intraclass correlation coefficient (ICC): n = 26 patients (39 pairs of ratings) ICC = 0.98 (SD 0.06)

23 Alternate form reliability Agreement between alternate forms of same instrument: –longer vs shorter version –alternate method of administration: face-to-face vs telephone subject vs proxy (see Magaziner paper)

24 Validity Content and face validity Criterion validity: concurrent and predictive Construct validity

25 Validity Depends on purpose: –screening: discrimination –outcome of treatment: responsive, sensitivity to change –prognosis: predictive validity

26 Content and face validity Judgment of “experts” and/or members of target population Does measure adequately sample domain being measured? Does it appear to measure what it is intended to measure? (eyeball test)

27 Content validity of DI Based on Confusion Assessment Method (CAM) –based on accepted diagnostic criteria (DSM) –widely used

28 Criterion validity Criterion (“gold” standard) Concurrent criterion validity –e.g., screening test vs diagnostic test Predictive criterion validity –e.g., cancer staging test vs 5-year survival

29 Criterion validity of DI Correlation between psychiatrist-scored DI (based only on patient observation) and Delirium Rating Scale (using all available information) –original scale –adjusted scale, omitting 4 items not assessed by DI

30 Criterion validity of DI: results Spearman correlation coefficient ( and 95% CI) between DI and adjusted DRS (using multiple observations): –at one point in time0.84 (0.75, 0.89) –within-subject change over time0.71 (0.53, 0.82)

31 Delirium severity and survival Proportional hazards regression of delirium severity in delirium cohort Mean of 1st 2 DI scores Results –significant interaction: DI predicted survival in patients with delirium alone, not in those with dementia

32 Construct validity Is the theoretical construct underlying the measure valid? Development and testing of hypotheses Requires multiple data sources and investigations: –Convergent validity: measure is correlated with other measures of similar constructs –discriminant validity: measure is not correlated with measures of different constructs

33 Construct validity (cont) Multitrait-multi-method: –Convergent validity: measure is correlated with other measures of similar constructs –discriminant validity: measure is not correlated with measures of different constructs Factorial method: –factor analysis or principle components analysis to identify underlying dimensions

34 Spearman correlation coefficients between Delirium Index and 3 baseline measures of current status

35 Spearman correlation coefficients between Delirium Index and 3 baseline measures of prior status

36 Responsiveness of measures Ability to detect clinically important change over time or differences between treatments Requirement of evaluative measures

37 Some sources of bias in scales “Response sets” –Social desirability –Acquiescent

38 Social desirability Tendency to give answers to questions that are perceived to be more socially desirable than the true answer Different from deliberate distortion (“faking good”) Depends on: –Individual characteristics (age, sex, cultural background) –Specific question

39 Social desirability Measures of social desirability (SD) –SD scales (e.g., Jackson SD scale, Crowne & Marlowe SD scale) –individual tendency to SD bias Prevention –phrasing of questions –questionnairemode –training of interviewers

40 Acquiescent response set Tendency to agree with Likert-type questions Can be prevented by mix of positively and negatively-phrased questions, e.g.: –My health care is just about perfect –There are serious problems with my health care

41 Measurement of Quality of life (QoL) Definition –individuals’ perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards, and concerns” (WHO QOL group, 1995) Domains –physical, psychological,level of independence, social relationships, environment, and spirituality/religion/personal beliefs

42 Health-related quality of life (HRQoL) Dimensions of QoL related to health Related terms: –health status –functional status Usually includes: –physical health/function –mental health/function –social health/function

43 Evaluative HRQoL instruments Purpose –evaluate within-individual change over time Reliability: –responsiveness Construct validity: –correlations of changes in measures during period of time, consistent with theoretically derived predictions

44 Discriminative HRQoL instruments Purpose –evaluate differences between individuals at point in time Reliability: –reproducibility Construct validity: –correlations between measures at point in time, consistent with theoretically derived predictions

45 How is HRQoL measured? Mode –Interviewer face-to-face Telephone –Self-completed Completed by –self –proxy/surrogate

46 Types of HRQoL measures Generic (global) –Health profiles –Utility measures Specific

47 Generic vs specific Generic –comparisons across populations and problems –robust and generalizable –measurement properties better understood Disease-specific –shorter –more relevant and appropriate –sensitive to change

48 Appropriateness Purpose: –describe health of population –evaluate effects of interventions (change over time) –compare groups at point in time –predict outcomes Areas of function covered Level of health Generic/global or specific

1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content,

Similar presentations

Presentation on theme: "1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content,

Similar presentations

Presentation on theme: "1 Lecture 3: Reliability and validity of scales Reliability: –internal consistency –test-retest –inter- and intra-rater –alternate form Validity: –content,"— Presentation transcript:

Similar presentations

About project

Feedback