Download presentation

Presentation is loading. Please wait.

Published byKeeley Bartron Modified about 1 year ago

1
Patient reported outcome measures and the Rasch model Helen Parsons

2
Patient reported outcome measures ◦ Quick overview ◦ Analysis problems Rasch models ◦ Simple Rasch formulation ◦ Rasch extensions: polytomous data Application of the Rasch model ◦ Using the Oxford Knee Score ◦ Model fit criteria ◦ DIF checking Summary 2

3
Outcome measures are widespread, with patient reported outcome measurements (PROMS) increasingly used Try to capture some latent trait of the respondent ◦ ie. Some trait that is difficult to directly measure like patient like “quality of life” or “anxiety” Often in a self-report questionnaire format ◦ EQ5D Some outcome measures reported by clinicians ◦ HoNoS Sometimes incorporates clinical findings as well as questionnaire data ◦ DAS 28 3

4
Outcome measures have a variety of usages ◦ One off assessment as a diagnosis tools ◦ Comparative assessment Such as measuring the outcome before and after an intervention ◦ Longitudinal analysis ◦ The NHS records and publishes 1 the aggregated results from 4 PROMs as part of the quality assurance process 1 : 4

5
As PROMs tend to be in a questionnaire format, often in the format of “total score” ◦ i.e. a sum of ordinal scores Often not “nice” distributions ◦ Not normal ◦ Bi-modal ◦ Floor and ceiling effects Analysis usually assumes linear relationships ◦ That is, moving from 4/10 to 5/10 is the same clinical gain as moving from 9/10 to 10/10 5

6
Example of PROM baseline data 2 Here a low score denotes good function Most patients on higher values Tail is abruptly cut off on RHS ◦ Can have worse function than, but score the same as others 6 2 : Data from Nick: OHS from WAT trial (ref: slide 15)

7
Part of Item Response Theory Introduced by Georg Rasch ( ) ◦ Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Used in psychometrics, so was created to describe a participant’s ability measured by item difficulties ◦ Ability: the ‘latent trait’ of the participant i.e. “maths ability” of a student ◦ Difficulty: which levels of latent trait the question can discriminate i.e. “easy” items identify poor students whilst “hard” items show the difference between good and excellent students 7

8
Given a data matrix of (binary) scores on n persons (S 1, S 2, … S n ) to a fixed set of k items (I 1, I 2, … I k ) that measure the same latent trait, θ Each subject, S v has a person parameter θ v denoting their position on the latent trait (ability) Each item I i has a item parameter β i denoting its difficulty 8

9
Let: ◦ β represent the vector of item parameters ◦ θ represent the vector of person parameters ◦ X be the n x k data matrix with elements x vi equal to 0 or 1 Then: Also assume: ◦ Independence of answers between persons No group work, no cheating! ◦ A person’s answers are stochastically independent All dependent on ability only No person subgroups ◦ The latent trait is uni-dimensional i.e. can be used to assess “shame” but not “anxiety and depression” 9 Rasch Models: Foundations, recent developments and applications. Fischer and Molenaar. Springer 1995.

10
When varying ability, the item response is a logistic relationship The probability of a positive answer is 0.5 when the person ability equals item difficulty Given a set difficulty, larger abilities have a greater chance of affirming the item ◦ i.e. better students score more! 10 Ability This plot is called an “Item Characteristic Curve” (ICC) Notice that the latent dimension is rescaled to centre zero and measured in logistic units

11
A logistic model better captures a finite scale Gives information on both persons and items Model parameters are simple to obtain ◦ Total score is sufficient for calculating the person parameter ◦ Item score across persons is sufficient for calculating the item parameter Extensions include ◦ Polytomous data ◦ 2 and 3 parameter IRT models 2 nd parameter adds a “discrimination” (slope) parameter 3 rd parameter allows “guessing” 11

12
The Rasch model uses a pass/fail score However, what happens when pass some of the item? ◦ E.g. exam marking – questions with multiple marks available ◦ E.g. Surveys – Likert format questions Two model variants 12 Partial Credit Models Allows a different number of thresholds each at a separate difficulty for each item Rating Scale Models Items all have the same number of thresholds at identical difficulty levels

13
Rating scale modelPartial credit model 13 The same data (eRm package example) was used to create each model Plots for Question 1 Plots for Question 2Plots for Question 3 RSM items only shift left and right PCM items change shape as well as shift

14
Several payware packages available WINSTEPS ◦ RUMM2020 ◦ Freeware becoming available ◦ Several R packages now released eRm is used throughout this talk Itm and psych also have Rasch implementations Growing literature base ◦ But introduction books and courses hard to find! 14

15
Assesses hip function Designed to assess patients undergoing hip replacement surgery ◦ Patient reported measure ◦ 12 questions, patients choose appropriate statement which reflects their situation (out of 5 possible) ◦ Here, each item marked 0-4, total score summed Minimum of 0 indicates ‘perfect’ function Maximum of Dawson J, Fitzpatrick R, Carr A, Murray D. Questionnaire on the perceptions of patients about total hip replacement. J Bone Joint Surg Br Mar;78(2):

16
Data from the WAT trial ◦ The Warwick Arthroplasty Trial ◦ 126 participants at baseline ◦ 2 intervention groups: hip replacement v. resurfacing Analysed using the Partial credit model ◦ Where categories were not all used, the remaining categories were renumbered, starting from 0 Data available from the same cohort longitudinally 16 Costa ML, Achten J, Parsons N, Edlin RP, Foguet P, Prakash U, Griffin DR. A Randomised Controlled Trial of Total Hip Arthroplasty Versus Resurfacing Arthroplasty in the Treatment of Young Patients with Arthritis of the Hip Joint. BMJ 2012; 344:e2147.

17
17 Distribution of abilities Item difficulties Mean difficulty Items in red indicate non- sequential categories Category thresholds Baseline data

18
Question 9 has the lowest mean item parameter ◦ Indicating best function ◦ Have you been limping when walking? Question 2 has the highest mean item parameter ◦ Indicating worst function ◦ Have you had any trouble washing and drying yourself? Question 8 covers the widest set of difficulties ◦ Most discriminating item ◦ After a meal (sat at a table), how painful has it been for you to stand up from a chair? 4 questions have non-sequential thresholds ◦ Why does this happen? 18

19
Non-sequential item (Question 5) Sequential item (Question 11) 19 Thresholds occur where curves cross 0|1 1|2 2|3 3|4

20
Non sequential categories result from ◦ Underused categories ◦ Unexpected scoring patters Could suggest problems with item Fixed by ◦ Removal of item ◦ Combing categories Total Score OHS Q5 Score Up to to to to All

21
Can associate scores and abilities ◦ Monotonically increasing relationship Clear that an increase of 1 is associated with different increases of ability ◦ “Bigger” loss of function for low scorers ◦ Middle of score scale gives similar abilities 21 Baseline score

22
Total score distributionAbility distribution 22 Centred about zeroHeavy tail

23
Have several models at different time points ◦ Could use baseline model throughout ◦ Could use new models at each time point Have two treatment groups ◦ A and B Four follow-up points post intervention 23 Baseline data: Ability by treatment group

24
Raw scores at baselineRaw scores at 6 weeks 24 Calculated abilities at baseline Calculated abilities at 6 weeks Using intention to treat groups No significant differences between groups

25
Scores at 6 weeks Predicted abilities at 6 weeks 25 No significant differences between groups

26
Scores at 12 months Predicted abilities at 12 months 26 Differences between scores at 12 months and baseline Differences between abilities at 12 month (predicted) and baseline (calculated) No significant differences in either rating Primary outcome of trial

27
Very different to baseline Question 4 now easiest (was Q9) Question 3 now hardest (was Q2) Double the number of reversed scales (8) Suggests that patient function has changed greatly 27 Remember Baseline model

28
Notice wide range of abilities ◦ Some patients now “recovered” ◦ Some patients still with low function Similar to baseline model ◦ Q9 easiest ◦ Q8 most discriminatory ◦ Q2 second most difficult 28

29
29 Model at 6 weeks Model at 12 months Baseline model

30
Abilities using baseline model Abilities using 6 week model 30 Scale calibrated from 6 week data collection allows comparison of items Scale calibrated from baseline data collection allows comparison of persons

31
Because at baseline no responders used the lowest two categories, did not have the full range of scores ◦ Q1: how would you describe the pain you usually had from your hip? This resulted in missing values in other collection points ◦ At 6 weeks: 7 no score, 14 total missing ◦ At 12 months: 3 no score, 9 total missing Would need “calibration” data ◦ From “healthy” population? ◦ All time points? Rasch model excludes maximum and minimum scores in model ◦ Can calculate post-hoc 31

32
Fit statistics are not standardised across software, so it’s hard to get a clear picture ◦ Names, formulae and boundaries are different ◦ There doesn’t appear to be a standard approach Using WINSTEPS nomenclature ◦ As the manual is available on line ◦ But this is still work in progress! ◦ Not clear which implementation eRm package uses 32

33
Chi squared statistics ◦ Observed v model expected Mean square residuals (MSQ) t-statistics ◦ Transformation of MSQ ◦ Not certain where useful cut offs are Two versions of each type ◦ Infit (weighted by ability) ◦ Outfit (overall sample) 33

34
Sample size dependence varies by statistic Most defined in terms of the standard Rasch model only Personfit statistics also available ◦ Similar approach When removing a missfitting item, whole model must be recalculated ◦ Which then finds new poor fitting items, etc, etc Removed over half of all items in ME data set May be problems due to instrument not designed for Rasch analysis ◦ Subscales a major problem 34 Smith et al. Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology 2008, 8:33

35
Chisqdfp-value Outfit MSQ InfitMSQOutfit tInfit t Q Q Q Q Q Q Q Q Q Q Q Q

36
Other problems to consider: Lots of variability in item parameters 36 95% CI for ability thresholds overlap

37
37 Add in 95% CI for each person Often have miss-fitting persons – but not looked into how to deal with this to date

38
Rasch model requires that item difficulty does not change between groups E.g. A shoulder function questionnaire asks about the ability to brush and style hair ◦ If (on average) women spend more effort on more elaborate hairstyles, it would not be surprising to see that women with the same level of function find doing their hair more difficult Differential item functioning (DIF) checks if this is indeed the case 38

39
39 Group A Group B Overall differences using Anderson’s LR test: No difference (p = 0.645) Maybe something here However: 6 questions excluded as not all thresholds used by both groups

40
Rasch models ◦ Give an alternative analysis approach to ordinal and binary scales ◦ Less “bodging” of assumptions! ◦ Give information on questions as well as respondents ◦ 1 parameter case of item response theory Rasch models could potentially be used in PROM analysis ◦ Have potential applications in validation and construction of new PROMS 40

41
When is it a good fit ◦ Still working on model fit statistics Then assess person fit statistics ◦ Does it matter at all? How do you compare different populations ◦ Is a calibration population the best way to go? ◦ How can you find a clinically meaningful change? How does item information effect the analysis ◦ Is it useful?! Thanks for listening! 41

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google