Last week’s question What is a DEVIATION IQ? In your answer, explain how the deviation IQ differs from the IQ as defined by Stern. What is the advantage.

Last week’s question What is a DEVIATION IQ? In your answer, explain how the deviation IQ differs from the IQ as defined by Stern. What is the advantage of a deviation IQ?

The deviation IQ Develop reliable measures, which must have a normal distribution and be consistent. Gather normative data for each half-decade of adult live. Let X be a person’s score on the original scale. Standardise X by subtracting the mean for the age-group and dividing by the SD.

The deviation IQ Standard scores z can easily be converted to IQ’s by multiplying them by 15 and adding 100. So if z = 1, the deviation IQ is ×15 = 115.

Explanation of the deviation IQ
From standard scores z, we can transform to another normal distribution with whatever mean and standard deviation we like. We want IQ to have a mean of 100 and an SD of 15.

Using z Probability that X (IQ) lies between 70 and 130
AND ALSO Probability that z lies between and Probability that X (IQ) is at least 130 AND ALSO Probability that z is at least +1.96 0.95 X (IQ) = SD z

The deviation IQ A deviation IQ locates an individual in a normal distribution with a mean of 100 and an SD of 15. An IQ of 130 means that the person is two standard deviations above the mean. He or she has scored at the 97.5th percentile – in the top 2.5% OF THEIR AGE-GROUP.

Two important differences
There are two important differences between the deviation IQ and IQ as Stern originally defined it: The reference population comprises the scores of the age band of people within which the participant’s age falls. The deviation IQ isn’t a QUOTIENT at all. The notion of mental age has been abandoned and we merely have a measure which expresses the person’s percentile score within an appropriate reference population of similar age to the participant.

Short question What, in the context of mental testing, is meant by the RELIABILITY and VALIDITY of a test? Can a test be valid without being reliable? Can a test be reliable without being valid? Describe two approaches to the measurement of reliability, explaining the advantages and disadvantages of each.

Answer Short definitions of reliability and validity.
A test cannot be valid without being reliable – otherwise we have an elastic tape measure, a rubber ruler. But reliability does not ensure validity. Describe, say the test-retest and parallel forms approaches. Time to describe their advantages and disadvantages.

Lecture 3 RELIABILITY AND VALIDITY

Methods of determining reliability
Test-retest. Parallel forms (or equivalent forms). Split-half.

Reliability A test is said to be reliable if, given that the scores display the necessary variability and DISTRIBUTION SHAPE, individuals retain their relative standing in the distribution from occasion to occasion of testing, and when tested by different administrators. A reliable test thus gives CONSISTENT RESULTS, in that a child scores at similar PERCENTILES from occasion to occasion.

Caution It is insufficient merely to say that consistency ensures reliability. Children will score consistently either on a test that is too difficult (scoring zero each time) or one that is too easy (scoring full marks each time). SCORES ON THE TEST MUST HAVE A DISTRIBUTION THAT TRULY REFLECTS NATURAL VARIABILITY IN THE CHARACTERISTIC YOU ARE TRYING TO MEASURE.

Reliability … A reliable test is produces CONSISTENT results.
By ‘consistency’, I mean that if, on the first occasion of testing, John scored at the 70th percentile, he would, if tested on other occasions, score at similar percentile levels. His scores when tested on subsequent occasions are indicated by the dashed lines at the 69th, the 68th, the 73rd and the 72nd percentiles. He’s always somewhere near the 70th percentile. This is a RELIABLE test.

Unreliability … An unreliable test gives INCONSISTENT results.
John scores at the 70th percentile on the first occasion. On subsequent occasions, however, he scores at the 20th, the 40th, the 45th, the 75th and the 60th percentiles (not necessarily in that order). A test showing this sort of inconsistency is an UNRELIABLE test.

Components of a test score

Good variance, bad variance
We need VARIANCE to achieve the right distribution and spread people’s scores. But there’s GOOD VARIANCE and BAD VARIANCE. Good variance is determined by variation in the TRUE COMPONENT of people’s scores. Bad variance is variation in the random or ERROR component.

Reliability and variance
When we determine test-retest, parallel forms or split-half reliability, we are estimating the proportion of the variance of the TOTAL SCORES that is variance in their TRUE COMPONENTS. The upper figure might represent a reliability of .8, the lower figure a reliability of .3 .

Improving reliability
Four of the most important ways of improving reliability are: Uniformity of testing procedure. Consistency of scoring. Having a test of sufficient length. Item analysis.

WAIS Block design test The participant is presented with nine blocks (Kohs blocks) marked as in the upper left picture. The participant is asked to assemble the blocks so that their upper surfaces reproduce patterns such as that shown in the lower left picture. The patterns become increasingly complex.

Kohs Blocks test The Kohs Blocks test is widely used in clinical practice. The advantages of the test is that no knowledge is required and norms are available. It also appears to tap abilities that are very vulnerable to brain injuries.

1. Uniformity of procedure
Undue variation in the manner in which the tester presents the participant with the items in the test will increase BAD VARIANCE. The tester must handle the materials smoothly, and follow the instructions in the test manual. The tester must adopt an appropriate professional manner. The tester must handle questions from the participant diplomatically, while at the same time giving no information away. This requires skill.

1. Procedure … The manual that accompanies the test contains detailed instructions on how the items are to be presented. In the Block Design test, the tester doesn’t simply scatter the blocks over the table, point to a pattern in the book and say: ‘Produce that pattern’. ‘In laying out the blocks for the subject to use, the examiner should make sure that a variety of surfaces face up, that only one out of the four blocks has the red/white side facing up, and only three when nine blocks are used’ (Manual, p.72). The tester, in fact, is given a DETAILED SCRIPT, with instructions for EVERY CONTINGENCY.

1. Procedure … At first, the tester actually assembles the blocks and the participant reproduces the pattern with another set of blocks. Later, the participant uses the blocks to reproduce a set of patterns in a booklet. ‘To prevent the subject from looking at the side of the block design instead of at the top, construct the model so that the subject is required to look down on it’. This precaution doesn’t always work, because, ‘Occasionally, a subject will try to duplicate the examiner’s model exactly, including the sides. If this occurs, tell the subject that only the top needs to be duplicated’.

2. Scoring There are very precise instructions for scoring the participant’s attempts. For example, the pattern must be in the same ORIENTATION as the one shown either by the model the experimenter has constructed (Design 1) or in the booklet containing Designs 2 to 9. The test is TIMED strictly with a stopwatch.

Reducing bad variance All these strictures are intended to ensure that any variation in participants’ scores arises from GENUINE INDIVIDUAL DIFFERENCES IN ABILITY, rather than differences in the manner in which the test is administered by different testers, or inconsistent methods of scoring.

TRAINING IS REQUIRED! To give someone a psychological test is by no means as easy as it sounds. The BPS is doing everything it can to ensure that those using psychometric tests have undergone THE CORRECT TRAINING. The normative data for the WAIS have been collected by following exactly the procedures described in the WAIS manual. DEVIATING EITHER FROM THE PRESCRIBED PROCEDURES OR SCORING WILL PRODUCE A SCORE THAT CANNOT BE INTERPRETED IN COMPARISON WITH THE NORMS.

3. Length of the test Last week, we say that, other things being equal, a test with more items is more reliable than a test with fewer items. Suppose that a test consists of 30 items. Suppose also that average correlation between pairs of these items is only .2. The individual items aren’t very reliable. But the TOTAL score on the test (the sum of the scores on all the items) will be MUCH MORE RELIABLE THAN .2.

The Spearman-Brown formula
The number of items is 30. The average correlation between pairs of items is .2. Substituting in the formula, we find that the reliability of the TOTAL SCORE on all 30 items is .88.

Split-half reliability
We shall put the Spearman-Brown formula to work immediately. The split-half method of determining reliability underestimates the true reliability because, essentially, we are calculating the correlation between two tests, each of which is half the original length. Suppose that our split-half reliability estimate is .65. The Spearman-Brown formula can be used to improve the estimate.

Split-half reliability …
In split-half reliability, we have two ‘items’ (the two half-test totals). The correlation between them is .65 . Substituting in the Spearman-Brown formula, we find that the reliability estimate is now .79 .

4. Item analysis We apply this technique to the results of all our multiple-choice examinations. We rank the candidates according to their TOTAL SCORES into three ability bands: A, B and C, representing the top, middle and bottom total scores, respectively. We find the percentages of the candidates in bands A, B and C who answered the target item correctly.

Item analysis (continued)
In order of magnitude, the success rate on the target item for top group A should be highest. Next should come the success rate for group B. Finally, the weakest group should show the lowest success rate. For example, we might find that question 26 was correctly answered by 90% of those in A, 50% by those in B and by 33% of those in C. The distribution on the right indicates either a clerical error or a miswording of the question.

Item analysis … The presence of a ‘rogue’ item in the multiple-choice test reduces its reliability, because the item does not correlate positively with the total score on the test, as it should do. Amending the item will increase the reliability of the multiple-choice test.

Validity A test is said to be VALID if it measures what it is supposed to measure.

Relationship between reliability and validity
Reliability is essential for validity. An elastic tape measure is a useless measuring instrument and cannot be valid. But RELIABILITY IS INSUFFICIENT FOR VALIDITY. A vocabulary test, for example, may be highly reliable, but it would have no validity as a measure of mathematical or musical ability.

Different kinds of validity
In contrast with reliability, THERE ARE MANY DIFFERENT KINDS OF VALIDITY. In his Dictionary of Psychology, Reber (1985) gives more than 25 definitions of validity! Some kinds of validity utilise the Pearson correlation; but others do not.

Approaches to validity
Face validity Content validity. Predictive or criterion validity. Construct validity. There are many other kinds of validity, but these four are the most important and subsume most of the others.

1. Face validity A test has FACE VALIDITY if it APPEARS to measure what it is supposed to measure. The impression of face validity can be formed on the basis of a superficial inspection of the test. One person’s impression, however, may not agree with that of another observer, who might argue that some topics or items were under-represented.

Problems with face validity
Too subjective. Ignores the necessity for ADEQUATE SAMPLING of the DOMAIN OF CONTENT. A test with high face validity may signally fail to meet the criteria for more objective definitions of validity.

2. Content validity A test is said to have CONTENT VALIDITY if it can be shown to have sampled adequately from the CONTENT DOMAIN. Content validity, unlike face validity, is determined systematically and with as much objectivity as possible.

Establishing content validity
A PANEL OF EXPERTS is required. The panel must be properly briefed on the purpose of the test. The panel’s pool of expertise must cover the entire domain of the topic. The manufacturers of the test must use professional item-writers, skilled in the wording of questions. The panel must be made aware of the specifications that guided the item-writers. The nature of the mistakes that people make on the items must be subject to a thorough ERROR ANALYSIS. The result of such an error analysis may suggest a more sensitive range of questions about some topics.

3. Criterion or predictive validity
A test is said to have PREDICTIVE or CRITERION validity if performance on the test correlates substantially with performance on some criterion. If a test measures scholastic aptitude, test performance should correlate substantially with the actual school grades children achieve at a later point. A test supposedly measuring leadership qualitites should correlate substantially with a person’s performance in situations requiring real leadership. SELF-REPORTS are one thing; BEHAVIOUR is another.

Predictive validity (continued)
At a large American university, students are tested at entry on a battery of aptitude tests: verbal, numerical and so on. After a year’s study, each student receives a mark for his or her academic performance, which is known as a GRADE POINT AVERAGE (GPA). The predictive validity of such an aptitude test is the Pearson correlation between performance on the test at matriculation and the GPA achieved a year later.

Predictive validity (continued)
Intelligence and aptitude tests do correlate positively and significantly with later school and university performance. At school level, the correlation tends to be quite high (between .5 and .6); whereas at university level, its value, though still statistically significant, tends to be substantially lower (between .2 and .3).

Threats to criterion or predictive validity
criterion contamination. restriction of the range of the predictor variable. unreliability in either measure (leading to ATTENUATION).

1. Criterion contamination
The predictor variable must not have any input into the criterion variable, because that would automatically produce a positive association. Even if the aptitude score does not contribute to the GPA, staff making the academic assessments should be kept in ignorance of a student’s earlier score on the aptitude test. Test results should be treated as highly confidential; otherwise your research data can become contaminated by, for example, HALO EFFECTS.

2. Restriction of range Typically, aptitude scores have low criterion validity as predictors of GPA: r ≈ +0.3. But only the upper part of the aptitude range is found within the student population. Even if aptitude is strongly correlated with academic performance in the general population (a narrowly elliptical scatterplot), a correlation based on only a section of the scatterplot (the more circular cloud to the right of the vertical line) will be CONSIDERABLY LOWER.

Restriction of range … This is why intelligence tests are poor predictors of university performance. Students are selected from a narrow ability range. Other factors, such as efficiency, motivation, commitment, degree of organisation and health, play more important roles in determining degree results.

3. Attenuation Should either or both variables be unreliable, the value of the Pearson correlation will be reduced. This is analogous with the ‘elasticity’ in that hypothetical tape measure, which leads to unreliable measurements. If the measurements have low reliability, PREDICTIVE VALIDITY HAS NO OPPORTUNITY TO EMERGE. Note carefully, however, that HAVING HIGHLY RELIABLE MEASURES DOES NOT IMPLY THAT THE PEARSON CORRELATION WILL NECESSARILY BE HIGH. The tests may be measuring quite different things.

Theoretical range of r Because of error of measurement, the theoretical ceiling of plus or minus 1 for a Pearson correlation can never be reached.

Effect of unreliability upon maximum value of r
The theoretical ceiling for values of r (+1) is lowered for a correlation between unreliable measures. The formula on the left shows that if the reliabilities of Test 1 and Test 2 are, respectively, .9 and .4, the correlation can be no higher than .6 .

Concurrent validity So far, in considering predictive or criterion validity, we have been concerned with the use of a test to predict FUTURE performance. We can, however, determine the correlation between scores on, say, an aptitude test by correlating them with CURRENT academic performance. Here, the researcher must be especially careful to avoid criterion contamination: the criterion activity must be independently assessed. The assessors must not know the results of the aptitude test.

Summary A score on a test is meaningless unless you have NORMS with which the score can be compared. But these norms have been gathered by TRAINED PERSONNEL, who have followed exactly the precise instructions in the manual. Uniformity of testing procedure increases the RELIABILITY of the test.

Summary … Other factors affecting the reliability of a test are adherence to a standard system of SCORING, the LENGTH of the test and the statistical performance of the individual ITEMS within the test. RELIABILITY is an essential pre-requisite for VALIDITY.

Summary … In my final lecture, I shall focus upon construct validity.
There are many definitions of VALIDITY. I have identified FOUR main types: FACE VALIDITY CONTENT VALIDITY PREDICTIVE VALIDITY CONSTRUCT VALIDITY. In my final lecture, I shall focus upon construct validity.

Practice question What, in the context of mental testing, is meant by the term VALIDITY? Describe two approaches to validity. What are the possible threats to the determination of criterion validity, both in its predictive and concurrent applications?

Last week’s question What is a DEVIATION IQ? In your answer, explain how the deviation IQ differs from the IQ as defined by Stern. What is the advantage.

Similar presentations

Presentation on theme: "Last week’s question What is a DEVIATION IQ? In your answer, explain how the deviation IQ differs from the IQ as defined by Stern. What is the advantage."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Last week’s question What is a DEVIATION IQ? In your answer, explain how the deviation IQ differs from the IQ as defined by Stern. What is the advantage.

Similar presentations

Presentation on theme: "Last week’s question What is a DEVIATION IQ? In your answer, explain how the deviation IQ differs from the IQ as defined by Stern. What is the advantage."— Presentation transcript:

Similar presentations

About project

Feedback