Psychometric Properties of Quantitative Measures

Psychometric Properties of Quantitative Measures
In this lecture I will be talking about the properties of an assessment to consider when you are selecting a measure for a research study or for a client. My goal for you is that at the end of this presentation you will be able understand a journal article that reviews the quality of an assessment. The information in this podcast is covered in Chapter 12 of your text book. Reliability and Validity of Test Measures

Deciding what test to use
Psychometric Properties Ease of Administration Training Needed Cost Appropriateness for your Population Of course when you are selecting a measurement there are many things to think about, besides the psychometric properties such as reliability and validity. Say you have a client who is 6 months old and very cranky. She is probably not going to cooperate with a 2 hour standardized assessment. You will need to use pragmatic reasoning to consider how many supplies are needed to give the test, how much time you have, what kind of training will be needed and the cost of the test materials. Finally you will want to make sure that the assessment was designed for your particular client. What population was targeted to obtain norming data? What construct is the test designed to measure and will you be using it for that purpose? For example, if you want to measure performance skills are you using a self-report tool that was designed to be an occupational profile? You don’t want to use the OCAIRS to assess ADLs in clients who are unreliable self-reporters. Perhaps the FIM would be more appropriate in that case. Dr. Murray will be covering more on these types of considerations in Clinical Eval. This lecture will primarily address reliability and validity of quantitative test measures. We will talk about qualitative measures another time.

Definitions Reliability- consistent, reproducible, dependable
Validity- measures what it says it measures reliability- consistent responses under given conditions validity- truthful, measuring height not weight Together these are sometimes referred to as psychometric properties

Reliability Measurement Error Reliability Coefficients
Types of Reliability Test-retest Rater: Intra & Inter Internal Consistency There are different types of reliability

Measurement Error Observed score = true score + error
Measurement Error = true score – observed score Reliability estimates measurement error But first we need to talk about measurement error because it contributes to poor reliability (and validity to some degree). Lets say you want to measure the height of your son. You stand him up against the wall with the book on his head and make a mark on the wall. But suppose he wiggles a little bit, or slouches, or maybe you tilt the book a little- have you measured his true height? No probably not. You have an observed score which is what you write down in your overly involved scrap-book, right? But really what you have is his true height plus any error that was made in the measurement. So when we conduct reliability tests of a measure we are actually exploring how much error in measurement typically occurs with that particular test.

Sources of Measurement Error
Systematic- consistently wrong in the same amount Random- chance rater measuring instrument variability in what you are measuring There are two types of errors, systematic and random. My scale at home is an example of systematic error- it always tells me I am 5 pounds heavier than I am. Really this is more of a problem with validity isn’t it. Because it is reliable- its just reliably wrong. Random error is more troublesome to reliability and can come from odd places: client fatigue, therapist inattention or lack of training. Maybe the test itself is too vague and people interpret the questions differently. I’ve had problems with dynameters that need re-calibration all the time, for whatever reason. This is why many therapists take multiple measures of grip strength- to get an average that adjusts for incorrect extremes. Also, the more variable the outcome you are trying to measure, the more difficult to establish reliability of a measure (if my weight goes up and down- harder to know if the scale is off or if it is just my weight going up and down)

Reliability Coefficients
True score variance True score variance + error variance as error decreases, coefficient increases coefficient ranges from .00 to 1.00 < .50 poor .50 to .75 moderate > .75 good Reliability coefficients are how researchers report on the reliability of an instrument they have developed. All reliability coefficients have the same basic equation: true score variance (how much my real weight fluctuates) divided by how much my real weight fluctuates plus the amount of error that typically occurs when I weigh myself. So you can see that if little error occurs I should have a number that is close to 1. So when you are reading reviews of a test instrument, how do you know what to look for? Estimates of what is good or poor reliability depend on what you are measuring. 5 degrees is an acceptable error for shoulder ROM but not in finger joints. How precise does the measure need to be? Used for description vs. decision making/dg these are very general estimates

Types of Reliability: Test-Retest
Get the same results every time you use the test Intervals between testing- long enough to avoid fatigue or remembering the answers but not so long that natural maturation occurs Intraclass correlation coefficient (ICC) or Pearson r Need to consider the stability of the variable you are measuring and your purpose (my pain level will change quickly, my IQ will not) can test more unstable variable with shorter interval, if you want to establish ability of test to measure IQ over time- wait a year – correlations of .6 for longer interval and higher correlations for shorter

Another way to Test- Retest
Alternate forms: different versions covering the same content (SAT, GRE) r correlation coefficient is used (.8 or higher) We want our standardized tests to be very consistent. You don’t want to get a harder certification test than someone else! So correlation studies are conducted on different versions of the same test.

Types of Reliability: Internal Consistency
Are all questions measuring the same thing? Split-half: correlation of two halves of same test (odds and evens) Spearman-Brown prophecy Cronbach’s alpha: essentially an average of the all the possible split-half reliabilities, can be used on multiple choice. When used on dichotomous scores, called Kuder- Richardson 20 Internal consistency- measures how similar test items are to each other. For example, a test on History shouldn’t be asking math questions. This type of reliability is just concerned with measuring same thing, not specifying what) validity would be concerned with measuring what it says it measures. The spearman-brown tells if the two halves are measuring the same thing, the Cronbach says that all the items on the test are measuring the same thing. And less common you may see the KR 20 in journal articles if the assessment has either-or, yes or no, items.

Types of Reliability: Raters
Intra – rater: stability of one rater across trials Inter-rater: consistency between two raters Use ICC for both types In both types, we are expecting to get the same results each time for the same circumstances, Itra rater: are you giving the FIM the same way each time? inter-rater- Are you and another OT giving the FIM the same way? The most accurate statistical test is the intraclass correlation coefficient for rater reliability, although you may see kappa, spearman rho and pearson r in some cases

Validity Validity vs. Reliability Types of Validity
The other key psychometric construct related to test quality is its validity, or whether or not it is actually measuring what it says it is, and not something else.

Validity vs. Reliability
not valid not reliable valid? not reliable not valid reliable valid reliable Which one of these targets represents a reliable and valid test? The first target demonstrates a test that is not valid or reliable. The results are all over the place and all of them are wrong. The next one looks like it is more valid (closer to the mark) but it is still unreliable. The third target is like my scale at home, it is reliably wrong all the time, the last target is both valid and reliable. The point is that a test can be reliable but not valid, but can’t have validity if the test is not reliable

Generalizability External validity- the test is valid if used with the intended population The test is valid if used in appropriate context and as directed for its given purpose Generalizability or External validity is the ability to use a test with a certain population. Can’t give a test made for Japanese college students to English speaking third graders and expect a valid result. An example from the clinic: Can’t use Waddels Signs of inorganic Back pain with Fibromyalgia clients. Waddels signs were developed to detect malingering- one item is axial loading, if I push down on your head (axial loading) the idea is that shouldn’t increase your back pain. The problem is that we don’t know that much about back pain! When you do this with a person who also has Fibromyalgia (a CNS dysfunction) it does increase pain. So this is not a valid test with FM clients, although it will reliably increase pain.

Face Validity appears to test what it is supposed to measure
weakest form ok for ROM, length, observation of ADLs An example of this is the first version of the Beck Depression Inventory- it certainly appeared to measure depression but it also measured a response to chronic illness, so that someone who was sick would have a high score on depression even if they weren’t that depressed. More useful for observation based assessment

Content Validity covers the entire range of the variable and reflects the relative importance of each part based on expert opinion, needs to be free of cultural bias Test of function- 20 questions on brushing your teeth, 1 question each on mobility, bathing, dressing VAS vs. McGill Pain Questionnaire Content validity is concerned with the assessment of a single construct that has been operationally defined. Every aspect of the construct (such as ADL performance) should be given an appropriate amount of coverage and nothing important should be left out. We don’t want 20 questions on brushing your teeth and nothing on eating. Because content validity is often established by expert opinion of how well a test measures a construct (also defined by an expert) there is a high risk of cultural bias due to unexamined assumptions. For example one of the questions to test children applying for magnet schools in Shreveport was, “What is a casserole dish?” This assumes that intelligent people know what a casserole dish is and that most families have one. Certainly not the case for all cultures. Another example would be testing a child on Math word problems in a non-native language- the content you are actually testing, at least partially, is language fluency, not math. An example of not measuring the entire construct is the VAS vs. the McGill – The VAS only asks about pain intensity, whereas the McGill- intensity, quality, location, duration

Criterion-related Validity
target test compared to gold standard concurrent predictive target test is taken at the same time as another test with established validity examines whether the target test can predict a criterion variable Criterion validity is obtained when the new test produces an outcome that correlates with evidence of the construct under investigation. For example if I develop a test of ADL function, I expect that the results of the test will correlate with evidence of ADL function in my client, duh. This is the most objective type of validity and there are two types: concurrent: my test scores correlate with an established gold standard (such as the FIM) and predictive where my test scores predict ADL performance in real life. Example of concurrent- autism screen compared to the longer version. Example of predictive- using GRE to predict GPA (it doesn’t) or using high scores on test of racial bias to predict discrimination in referral to healthcare services

Construct Validity ability of a test to measure a construct
based on a theoretical framework what would you include for a test on “wellness”? Construct validity refers to how accurately your test reflects your specific definition of what you are measuring. If an OT creates a test that measures quality of life, how might it be different from a test of the same construct designed from a medical model? what you select to be on a test reflects your philosophical viewpoint of that variable or construct- In this case, sense of well-being vs. absence of disease

Ways to establish construct validity
Known groups Convergent comparison Divergent comparison Factor analysis Different ways to establish construct validity: test people known to have trait- to establish validity for a test of racial bias, test members of KKK convergent- positively correlates with test of same construct (Beck and Hamilton) divergent – negatively correlates with test of different construct (egalitarianism and racism) A factor analysis determines which factors are parts of a construct (ADL- bathing, dressing, grooming etc) The test items within a sub-group should be positively correlated with each other but not other sub-groups (or factors)

Remember The reliability and validity of a test measurement is not the same thing as reliability and validity of a research design.

Where do I find this information?
lsustudent.pbworks.com/Psy Assessments Journals Books Test Manuals

Psychometric Properties of Quantitative Measures

Similar presentations

Presentation on theme: "Psychometric Properties of Quantitative Measures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Psychometric Properties of Quantitative Measures

Similar presentations

Presentation on theme: "Psychometric Properties of Quantitative Measures"— Presentation transcript:

Similar presentations

About project

Feedback