2 Measurement ErrorWhatever measurement we might make with regard to some psychological construct, we do so with some amount of errorAny observed score for an individual is their true score with error added inThere are different types of “error”, but here we are concerned with a measure’s inability to capture the true response for an individualObserved Score = True score + Error of measurement
3 ReliabilityReliability refers to a measure’s ability to capture an individual’s true score, i.e. to distinguish accurately one person from anotherWhile a reliable measure will be consistent, consistency can actually be seen as a by-product of reliability, and in a case where we had perfect consistency (everyone scores the same and gets the same score repeatedly), reliability coefficients could not be calculatedNo variance/covariance to give a correlationThe error in our analyses is due to individual differences but also the lack of the measure being perfectly reliable
4 Reliability Criteria of reliability Test-retest reliability Issues Test components (internal consistency)Test-retest reliabilityConsistency of measurement for individuals over timeThe score similarly e.g. today and 6 months from nowIssuesMemoryIf too close in time the correlation between scores is due to memory of item responses rather than true score capturedChance covariationAny two variables will always have a non-zero correlationReliability is not constant across subsets of a populationGeneral IQ scores good reliabilityIQ scores for college students, less reliableRestriction of range, fewer individual differences
5 Internal ConsistencyWe can get a sort of average correlation among items to assess the reliability of some measure1As one would most likely intuitively assume, having more measures of something is better than fewIt is the case that having more items which correlate with one another will increase the test’s reliability1. Cronbach’s alpha is a function of the number of items and average correlation among them
6 What’s good reliability? While we have conventions, it really kind of dependsAs mentioned reliability of a measure may be different for different groups of peopleWhat we may need to do is compare reliability to those measures which are in place and deemed ‘good’ as well as get interval estimates to provide an assessment of the uncertainty in our reliability estimateNote also that reliability estimates are biased upwardly and so are a bit optimisticAlso, many of our techniques do not take into account the reliability of our measures, and poor reliability can result in lower statistical power i.e. an increase in type II errorThough technically increasing reliability can potentially also lower power11. We don’t really want to go there, but you can see the related notes on the 5710 page. It’s at the end of the ‘Experimental design’ notes.
7 Replication and Reliability While reliability implies replicability, assessing reliability does not provide a probability of replicationNote also that statistical significance is not a measure of reliability or replicability1Replication is not perhaps conducted as much as should be in psychology for a number of reasonsPractical concerns, lack of publishing outlets etc.Furthermore, knowing our estimates are biased and variable themselves, we might even think that in many cases we would not expect consistent research findingsIn psychology, many people spend a lot of time debating back and forth about the merits of some theory, citing cases where it did or did not replicateHowever the lack of replication could be due to low power, low reliability, problem data, incorrectly carrying out the experiment etc.In other words, we didn’t repeat because of methodology, not because the theory was wrong1. Often you may see people use the term ‘statistically reliable’ meaning statistically significant. Please don’t. It is grossly misleading terminology and, taken literally, usually wrong.
8 Factors affecting the utility of replications You can’t step in the same river twice!Heraclitus1WhenLater replications are not providing as much information, however they can contribute greatly to the overall assessment of an effectMeta-analysisHowThere is no perfect replication (different people involved, time it takes to conduct etc.)Doing ‘exact’ replication gives us more confidence in the original finding (should it hold), but may not offer much in the way of generalizationExample: doing a gender difference study at UNT over and over. Does it work for non-college folk? People outside of Texas?1. His student Cratylus said that you couldn’t step in the river even once.
9 Factors affecting the utility of replications By whomIt is well known that those with a vested interest in some idea tend to find confirming evidence more than those that don’tReplications by others are still being done by those with an interest in that research topic and so may have a ‘precorrelation’ inherent in their attemptDirect: correlation of attributes of persons involvedIndirect: correlation of data to be obtainedGist, we can’t have truly independent replication attempts, but must strive to minimize biasThe more independent replication attempts are, the more informative they will be
10 ValidityValidity refers to the question of whether our measurements are actually hitting on the construct we think they areWhile we can obtain specific statistics for reliability (even different types), validity is more of a global assessment based on the evidence availableWe can have reliable measurements that are invalidClassic example: The scale which is consistent and able to distinguish from one person to the next but actually off by 5 pounds
11 Validity Criteria in Psychological Testing Content validityCriterion validityConcurrentPredictiveConstruct-related validityConvergentDiscriminantItems represent the kinds of material (or content areas) they are supposed to representAre the questions worth a flip in the sense they cover all domains of a given construct?E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers etc.
12 Validity Criteria in Psychological Testing Criterion validitythe degree to which the measure correlates with various outcomesDoes some new personality measure correlate with the Big 5ConcurrentCriterion is in the presentMeasure of ADHD and current scholastic behavioral problemsPredictiveCriterion in the futureSAT and college gpa
13 Validity Criteria in Psychological Testing Construct-related validityHow much is it an actual measure of the construct of interestConvergentCorrelates well with other measures of the constructDepression scale correlates well with other dep scalesDiscriminantIs distinguished from related but distinct constructsDep scale != Stress scale
14 Validity Criteria in Experimentation Statistical conclusion validityIs there a causal relationship between X and Y?Correlation is our starting point (i.e. correlation isn’t causation, but does lead to it)Related to this is the question of whether the study was sufficiently sensitive to pick up on the correlationInternal validityHas the study been conducted so as to rule out other effects which were controllable?Poor instruments, experimenter biasExternal validityWill the relationship be seen in other settings?Construct validitySame concerns as beforeEx. Is reaction time an appropriate measure of learning?
15 SummaryReliability and Validity are key concerns in psychological researchPart of the problem in psychology is the lack of reliable measures of the things we are interested in1Assuming that they are valid to begin with, we must always press for more reliable measures if we are to progress scientificallyThis means letting go of supposed ‘standards’ when they are no longer as useful and look for ways to improve current ones1. Always, always, always find the reliability estimates for the measure you are about to use. The information is readily available in original and related articles (always go to the original though), Mental Measurement Yearbook etc. At least give yourself a chance to do a decent study by using reliable measures.