Presentation on theme: "Measurement Error Whatever measurement we might make with regard to some psychological construct, we do so with some amount of error Any observed score."— Presentation transcript:
Measurement Error Whatever measurement we might make with regard to some psychological construct, we do so with some amount of error Any observed score for an individual is their true score with error added in There are different types of “error”, but here we are concerned with a measure’s inability to capture the true response for an individual Observed Score = True score + Error of measurement
Reliability Reliability refers to a measure’s ability to capture an individual’s true score, i.e. to distinguish accurately one person from another While a reliable measure will be consistent, consistency can actually be seen as a by-product of reliability, and in a case where we had perfect consistency (everyone scores the same and gets the same score repeatedly), reliability coefficients could not be calculated No variance/covariance to give a correlation The error in our analyses is due to individual differences but also the lack of the measure being perfectly reliable
Reliability Criteria of reliability Test-retest Test components (internal consistency) Test-retest reliability Consistency of measurement for individuals over time The score similarly e.g. today and 6 months from now Issues Memory If too close in time the correlation between scores is due to memory of item responses rather than true score captured Chance covariation Any two variables will always have a non-zero correlation Reliability is not constant across subsets of a population General IQ scores good reliability IQ scores for college students, less reliable Restriction of range, fewer individual differences
Internal Consistency We can get a sort of average correlation among items to assess the reliability of some measure 1 As one would most likely intuitively assume, having more measures of something is better than few It is the case that having more items which correlate with one another will increase the test’s reliability
What’s good reliability? While we have conventions, it really kind of depends As mentioned reliability of a measure may be different for different groups of people What we may need to do is compare reliability to those measures which are in place and deemed ‘good’ as well as get interval estimates to provide an assessment of the uncertainty in our reliability estimate Note also that reliability estimates are biased upwardly and so are a bit optimistic Also, many of our techniques do not take into account the reliability of our measures, and poor reliability can result in lower statistical power i.e. an increase in type II error Though technically increasing reliability can potentially also lower power 1
Replication and Reliability While reliability implies replicability, assessing reliability does not provide a probability of replication Note also that statistical significance is not a measure of reliability or replicability 1 Replication is not perhaps conducted as much as should be in psychology for a number of reasons Practical concerns, lack of publishing outlets etc. Furthermore, knowing our estimates are biased and variable themselves, we might even think that in many cases we would not expect consistent research findings In psychology, many people spend a lot of time debating back and forth about the merits of some theory, citing cases where it did or did not replicate However the lack of replication could be due to low power, low reliability, problem data, incorrectly carrying out the experiment etc. In other words, we didn’t repeat because of methodology, not because the theory was wrong
Factors affecting the utility of replications You can’t step in the same river twice! Heraclitus 1 When Later replications are not providing as much information, however they can contribute greatly to the overall assessment of an effect Meta-analysis How There is no perfect replication (different people involved, time it takes to conduct etc.) Doing ‘exact’ replication gives us more confidence in the original finding (should it hold), but may not offer much in the way of generalization Example: doing a gender difference study at UNT over and over. Does it work for non-college folk? People outside of Texas?
Factors affecting the utility of replications By whom It is well known that those with a vested interest in some idea tend to find confirming evidence more than those that don’t Replications by others are still being done by those with an interest in that research topic and so may have a ‘precorrelation’ inherent in their attempt Direct: correlation of attributes of persons involved Indirect: correlation of data to be obtained Gist, we can’t have truly independent replication attempts, but must strive to minimize bias The more independent replication attempts are, the more informative they will be
Validity Validity refers to the question of whether our measurements are actually hitting on the construct we think they are While we can obtain specific statistics for reliability (even different types), validity is more of a global assessment based on the evidence available We can have reliable measurements that are invalid Classic example: The scale which is consistent and able to distinguish from one person to the next but actually off by 5 pounds
Validity Criteria in Psychological Testing Content validity Criterion validity Concurrent Predictive Construct-related validity Convergent Discriminant Content validity Items represent the kinds of material (or content areas) they are supposed to represent Are the questions worth a flip in the sense they cover all domains of a given construct? E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers etc.
Validity Criteria in Psychological Testing Criterion validity the degree to which the measure correlates with various outcomes Does some new personality measure correlate with the Big 5 Concurrent Criterion is in the present Measure of ADHD and current scholastic behavioral problems Predictive Criterion in the future SAT and college gpa
Validity Criteria in Psychological Testing Construct-related validity How much is it an actual measure of the construct of interest Convergent Correlates well with other measures of the construct Depression scale correlates well with other dep scales Discriminant Is distinguished from related but distinct constructs Dep scale != Stress scale
Validity Criteria in Experimentation Statistical conclusion validity Is there a causal relationship between X and Y? Correlation is our starting point (i.e. correlation isn’t causation, but does lead to it) Related to this is the question of whether the study was sufficiently sensitive to pick up on the correlation Internal validity Has the study been conducted so as to rule out other effects which were controllable? Poor instruments, experimenter bias External validity Will the relationship be seen in other settings? Construct validity Same concerns as before Ex. Is reaction time an appropriate measure of learning?
Summary Reliability and Validity are key concerns in psychological research Part of the problem in psychology is the lack of reliable measures of the things we are interested in 1 Assuming that they are valid to begin with, we must always press for more reliable measures if we are to progress scientifically This means letting go of supposed ‘standards’ when they are no longer as useful and look for ways to improve current ones