Practice question What was Stern’s definition of an IQ? In your answer, explain the concept of mental age. What was the major drawback with the measure.

Practice question What was Stern’s definition of an IQ? In your answer, explain the concept of mental age. What was the major drawback with the measure that Stern proposed?

Points about short answers
Write in COMPLETE SENTENCES: phrases or notes are insufficient. When you are defining a term, avoid using the term itself in its own definition: “A test is reliable when it can be relied upon …”. Decide upon the most important points and get those down first. If any time remains, add more detail.

Main points The three essential ingredients of your answer are as follows: Explanation of MENTAL AGE. STERN’S DEFINITION. Define the IQ verbally, as well as giving the formula. The PROBLEM (psychometric mental age doesn’t increase beyond 15, which is problematic for the measurement of adult intelligence). Anything else you may be able to add is a luxury. Remember, this question hasn’t asked for a solution to the problem, so you don’t have to give one. You need only show the examiner that you understand WHY there’s a problem.

Mental age A person’s MENTAL AGE is the chronological age at which most children can perform at the same level. So if a 25-year-old man performs at the levels of typical 9-year-olds, his CHRONOLOGICAL AGE is 25, but his MENTAL AGE is 9.

The Intelligence Quotient (IQ)
In 1912, the German psychologist Stern proposed the INTELLIGENCE QUOTIENT (IQ), which he defined as the ratio of a person’s mental age to their chronological age, multiplied by 100. The formula is given below.

The problem with mental age
The problem is that, since mental age does not increase beyond 15 years, a person’s IQ as defined by Stern will progressively diminish with each year that passes, EVEN IF THE PERSON CONTINUES TO PERFORM AT EXACTLY THE SAME LEVEL.

Lecture 2 RELIABILITY

The first intelligence test
Last week, I described the construction of the first intelligence test by Binet and Simon. It’s not easy to construct a psychological test: it took Binet and Simon several years to do it. For example, we saw that, if we are to have a really useful test, we need NORMATIVE DATA, or NORMS. The norms provide us with the comparison we need to assess the performance of the child we are testing. This week, I am going to look more closely at two essential characteristics of a good test.

Two essential qualities
For a test to be useful, it must have two essential qualities: It must be RELIABLE; It must be VALID. I shall briefly consider the second of these properties first.

Validity A test is a MEASURING INSTRUMENT.
A test is said to be VALID if it measures what it is supposed to measure. Applicants for a post in senior management may be given a psychometric test of leadership capability. But do a candidate’s responses really indicate his or her suitability for the post? This is a question about the VALIDITY of a test.

Validity… When we say the test is VALID, we mean that a person’s responses to the questions in the test really do tell us something about how that person would perform in a real situation requiring managerial capability.

Are intelligence tests valid?
Binet assembled items that differentiated between typical children at various ages. He was trying to measure general scholastic aptitude. But are children who achieve a certain mental age on his test really capable of learning school subjects at a level typical of children of that chronological age?

Age norms: Reproducing a figure from memory
A 5-year-old can copy a square from memory, but not a diamond or a cylinder. An 8-year-old can copy a square and a diamond, but not a cylinder. An 11-year-old can copy all three figures.

Validity … According to Binet, the child who can draw the cone from memory has a greater scholastic aptitude than a child who can not. But is that true? Is a child’s ‘mental age’, as measured by Binet’s test, really a measure of scholastic aptitude? Is the child who can draw the cone really better at school subjects (such as French, geometry or chemistry) than a child who can not? These are questions about the VALIDITY of a psychological test. Does it measure the hypothetical quality that it is supposed to measure?

Reliability If a test is to be VALID, it must, in the first place, be RELIABLE. A RELIABLE test is one that gives CONSISTENT RESULTS if taken by the same participants on different occasions or when they are tested by different examiners.

Reliability … A reliable test produces CONSISTENT results.
If John scored at the 70th percentile on the first occasion of testing, he would, if tested on other occasions, score at similar percentile levels. His scores when tested on subsequent occasions are indicated by the dashed lines at the 69th, the 68th, the 73rd and the 72nd percentiles. He’s always somewhere near the 70th percentile. This is a RELIABLE test.

Unreliability … An unreliable test gives INCONSISTENT results.
John scores at the 70th percentile on the first occasion. On subsequent occasions, however, he scores at the 20th, the 40th, the 45th, the 75th and the 60th percentiles (not necessarily in that order). A test showing this sort of inconsistency is an UNRELIABLE test.

An elastic tape measure
Suppose you were to try to measure a set of objects with an elastic tape measure and repeat this operation on the same objects on several occasions. Each time such a tape measure was used, it would be stretched to a different extent, EVEN WHEN YOU WERE MEASURING THE SAME OBJECT. So the dimensions of the same objects would be recorded as having different values on different occasions. Our hypothetical tape measure would be useless. AN UNRELIABLE TEST IS LIKE AN ELASTIC TAPE MEASURE.

The right distribution
Implicit in this concept of consistency is the assumption that we have a measure that SPREADS PEOPLE OUT. The assumption is that what we are measuring is a VARIABLE. If so, it must have a proper DISTRIBUTION and people’s scores on our test must reflect this. A normal distribution is ideal.

Consistency without reliability
It is insufficient to say that a test is reliable if people get the same scores on different occasions. When a test is too difficult, people will score similarly on different occasions (0), but the scores do not have a satisfactory distribution. This is known as a FLOOR EFFECT. The same problem obtains when a test is too easy: this is known as a CEILING EFFECT. These are not truly reliable tests, because the scores do not differentiate among those tested. THE SCORES MUST HAVE AN APPROPRIATE DISTRIBUTION.

Definition of reliability
The scores must have a distribution that DIFFERENTIATES among those tested. A test is said to be reliable if, given that the scores display the necessary variability and DISTRIBUTION, individuals retain their relative standing in the distribution from occasion to occasion of testing, and when tested by different administrators. A child should score at similar PERCENTILES from occasion to occasion. A reliable test thus gives CONSISTENT RESULTS.

Relationship between reliability and validity
Reliability and validity are two DIFFERENT PROPERTIES. Reliability, however, is a NECESSARY condition for validity – otherwise you have an elastic tape measure. Binet’s age norms attest to the reliability of his intelligence test. But there remains the question of whether mental age really reflects scholastic aptitude. That is a question about VALIDITY. Reliability is not a SUFFICIENT condition for validity. I shall return to validity later on.

Measuring the reliability of a test
There are several ways of measuring reliability. They all make use of the Pearson correlation. There are situations in which some are applicable but not others. There is no ‘best’ method for all purposes.

Composite or aggregate scores
An important consideration in the determination of reliability is whether the test leads to a single score or the final ‘score’ is actually an aggregate of scores on several different items. The DIGIT SPAN test produces a single score. Each person attempts to reproduce successively larger lists of digits until a maximum is reached. An INTELLIGENCE TEST, which contains many items, produces an aggregate score. Most personality tests yield aggregate scores.

Personality tests Many tests of personality have several subsections, each of which measures a distinct aspect of personality. So an overall aggregate score may be a sum of several scores which are themselves aggregates of scores on the items in the various subsections of the test. Cattell’s personality test produces scores on 16 subscales, each supposedly measuring one of 16 personality factors.

Short-term or working memory
Brain damage often results in impairment of various memory functions, both short-term and long term. Short term retention of verbal and nonverbal material is thought to be delivered by different functions. In Alan Baddeley’s theory of working memory, verbal working memory is served by the PHONOLOGICAL LOOP; non-verbal working memory is served by the VISUO-SPATIAL SKETCHPAD – and perhaps the CENTRAL EXECUTIVE as well.

The digit span and Corsi Blocks tests
The DIGIT SPAN test is one measure of verbal working memory. The CORSI BLOCKS test is a measure of non-verbal working memory. Both tests are widely used in the clinical context, when doctors or psychologists are testing for loss of memory function in brain-damaged patients. Until recently, the Corsi test was the more widely used measure of nonverbal working memory.

The Corsi Blocks test The tester and the patient sit opposite one another at a table. On the table, is a board about the size of a chessboard, upon which some wooden cubes are fixed in a haphazard arrangement. On the tester’s side, the blocks are numbered, so that they can be touched in predefined sequences. The experimenter taps a sequence of the cubes. The patient is asked to tap the same cubes in the same order.

The Corsi span The tester taps the blocks in progressively longer sequences, until the patient cannot reproduce the sequence of taps. The Corsi span is the longest sequence the patient can reproduce. The entire procedure results in a single score, the Corsi span.

The Visual Patterns test
Arguably, the Corsi Blocks test taps both visual storage and SPATIAL MEMORY, which has a nonvisual element consisting of memories for felt body position. The VISUAL PATTERNS TEST (VPT) is intended to tap purely VISUAL nonverbal working memory, excluding the nonvisual spatial element.

Visual patterns You can build up increasing complex patterns by increasing the size of the grid. The lower pattern is, of course, much more difficult to reproduce from memory than would be one in a smaller grid.

Obtaining the visual span
The patient is shown a grid, some of whose squares are blackened. After a fixed inspection period, the grid is removed and the patient is asked to reproduce the pattern by marking in pencil with crosses the corresponding squares in a blank grid the same size as the original. Increasingly large patterned grids are presented, until the patient can no longer reproduce the exact pattern of the black and white squares. The patient’s VISUAL SPAN is the size of the largest pattern he or she is able to reproduce.

Age norms Norms are available for both the Corsi Blocks and Visual Patterns tests. Both Corsi Blocks and Visual Patterns spans decrease noticeably with age. To assess whether someone in their seventies has sustained cognitive impairment, that person’s score must be related to the distribution of scores of people in that age group.

Methods of determining reliability
Test-retest. Parallel (or equivalent) forms Split-half.

1. Test-retest reliability
Give the test to a large number of people. Give the test again to the same people. You will have a bivariate data set comprising the scores of each person on the two tests. Calculate the Pearson correlation r between Score on the FIRST occasion and Score on the SECOND occasion. The value of r should be at least .75.

2. Parallel forms. Construct two equivalent forms of the same test, Form A and Form B. Ensure that people score at similar levels on the two forms. This has been done with the Visual Span test. Test each of a large sample of people with both Form A and Form B. Let Variable A contain their scores on Form A of the test; Variable B contains their scores on Form B of the test. This is a bivariate data set. Calculate the Pearson correlation between A and B. The correlation should be at least .75.

3. Split-half reliability: Requirements
The test must consist of several questions and yield a composite score. The items must be divisible into two EQUIVALENT sub-groups, as by taking the ODD-NUMBERED and EVEN-NUMBERED items. There shouldn’t be any systematic difference in the difficulty or nature of the items in the two sub-groups.

The method Each person can now be given two totals:
1. A total on the odd-numbered items. 2. A total on the even-numbered items. You will now have a bivariate data set comprising the Odd and Even totals achieved by all the people tested. Calculate the Pearson correlation to determine the split-half reliability. The value of r should be at least .75.

An example Suppose that a test comprises ten items, each item being marked 0 or 1, for a wrong and a right answer, respectively. The score a person finally gets on the test is an aggregate of the ones and zeros over all ten items. A person’s total score, therefore, can vary from 0 to 10.

Obtaining the odd and even totals
In the table, the bottom two rows contain the Odd and Even half-totals for three people. Fred did best, Joe did worst and Mary’s score is intermediate.

The split-half reliability
Each of the people tested has now two ‘scores’: and ‘even’ score; an ‘odd’ score. We have a bivariate data set and we can calculate a Pearson correlation. The value of this correlation is 0.86. The SPLIT-HALF measure of reliability is the Pearson correlation between the Odd and Even subtotals.

The scatterplot The scatterplot is indicative of the assumed linear relationship between scores on the odd and even items in the test. The split-half reliability is 0.86.

Test-retest: Disadvantages
On a test of attitudes or prejudice, memory for previous answers would make the test seem more reliable than it really is. The shorter the interval between the first and second testing, the stronger the memory effect is likely to be. There is therefore uncertainty about how long to make the interval between the two sessions. The test-retest method may OVERSTATE a test’s reliability.

Parallel forms method: Advantages
The reliability sample is tested twice in one session, once with Form A and once with Form B. So there can be no uncertainty about the length of the interval between testing sessions. Different items are used for Form A and Form B, greatly reducing the possibility practice or memory effects.

Parallel forms: Disadvantages
There could still be some practice effect, because the items in Form A and Form B will be similar. If two tests are given during a single session, it would be wise to vary the order of presentation of Forms A and B (ie. counterbalance the order) among the participants. The parallel forms method produces LOWER ESTIMATES of reliability than does the test-retest method. This is because you are SAMPLING from a larger pool of possible items; and sampling implies SAMPLING ERROR. The VARIANCE of the scores is increased, which REDUCES the correlation.

The split-half method: Advantages
You only need only test your reliability sample once. You don’t need go to the trouble of constructing Form A and Form B of the test and establishing that they are equivalent.

Split-half reliability: Disadvantages
The method isn’t workable with tests like the Corsi or the Visual Patterns, which produce a single score. Essentially, you are producing two tests, each of which is half as long as the original one. Unfortunately, SHORTER TESTS ARE LESS RELIABLE THAN LONGER TESTS. The split-half method produces LOWER ESTIMATES of reliability than either the test-retest or parallel forms methods. In fact the method produces the LOWEST reliability estimates of the three methods I have described.

Effect of the length of a test
Longer tests are more reliable than shorter ones. For example, a vocabulary test with 50 items is more reliable than one with 10 items. This is because the words in a test are a SAMPLE from a much larger pool of possible words. Sampling entails SAMPLING ERROR or VARIABILITY. We have seen that the statistics of small samples vary more than the statistics of large samples. In the same way, a person’s score on a short vocabulary test would show more variation than scores on a longer test. This is RANDOM variation and reduces the reliability of the test.

True and random error components
A person’s vocabulary score consists of a TRUE component (true relative size of voculary) and a RANDOM component. The random (or ERROR) component is contributed to by the element of luck in the selection of words for the test.

A score’s components

Good variance, bad variance
We need variance to achieve the right distribution and spread people’s scores. But there’s good variance and bad variance. Good variance is determined by variation in the true component of people’s scores. Bad variance is variation in the random or error component.

Longer tests Tests with more items produce scores with relatively greater true components and relatively less error. Increase the true component of the total score by HAVING MORE ITEMS IN YOUR TEST.

Low average inter-item correlation
A test consists of 30 items. The average correlation among the various pairs of items is only 0.2 . The test yields a total score, which is the sum of the participant’s scores on all the items. What is the reliability of the test?

The Spearman-Brown formula
The number of items is 30. The average correlation between pairs of items is 0.2. Substituting in the formula, we find that the reliability of the TOTAL SCORE on all 30 items is 0.88.

Summary Reliability, in the technical sense of the term, was defined.
Three methods of determining reliability were described: (1) the TEST-RETEST method; (2) the PARALLEL FORMS method; (3) the SPLIT-HALF method. Each method has its own advantages, disadvantages and applicability.

Summary … Reliability is a necessary, but not a sufficient, condition for validity. The reliability of intelligence tests, field dependence tests (Rod-and-frame, Embedded Figures) and personality tests (Introversion-Extraversion, Neuroticism-Stability) is very high, often .9 or greater. This fact in itself, however, does not demonstrate the VALIDITY of these tests. Next week, I shall turn to the ways in which psychometricians attempt to validate their tests.

Short question What, in the context of mental testing, is meant by the RELIABILITY and VALIDITY of a test? Can a test be valid without being reliable? Can a test be reliable without being valid? Describe two approaches to the measurement of reliability, explaining the advantages and disadvantages of each.

Practice question What is a DEVIATION IQ? In your answer, explain how a deviation IQ differs from IQ as defined by Stern. What is the advantage of a deviation IQ?

Practice question What was Stern’s definition of an IQ? In your answer, explain the concept of mental age. What was the major drawback with the measure.

Similar presentations

Presentation on theme: "Practice question What was Stern’s definition of an IQ? In your answer, explain the concept of mental age. What was the major drawback with the measure."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practice question What was Stern’s definition of an IQ? In your answer, explain the concept of mental age. What was the major drawback with the measure.

Similar presentations

Presentation on theme: "Practice question What was Stern’s definition of an IQ? In your answer, explain the concept of mental age. What was the major drawback with the measure."— Presentation transcript:

Similar presentations

About project

Feedback