Tests of Cognitive Intelligence

1 Tests of Cognitive Intelligence

2 Common Characteristics of Individual Intelligence Tests
individually administered administration requires advanced training tests cover wide range of age and ability examiner must establish rapport immediate scoring of items usually requires about one hour allows opportunity for observation

3 Two Main Individually Administered Intelligence Tests
Stanford-Binet He wanted to create a process for identifying intellectually limited children so they could be removed from the regular classroom and put in special education. Wechsler Scales Developed in response to the perceived shortcomings of the Stanford-Binet

4 Binet’s Principles of Test Construction
Wanted tasks to measure judgment, attention, and reasoning. Guided by two major concepts: age differentiation and general mental ability. Age differentiation: Binet searched for tasks that could be completed by 2/3 to ¾ of the children in a particular age group & was completed by fewer younger children and more older children.  General mental ability: measured only the total product of the various tasks. Judged value of task in terms of its correlation with the combined result of all other tasks. Intelligence testing as we know it began with the decision of a French minister of public instruction at the turn of the century. He wanted to create a process for identifying intellectually limited children so they could be removed from the regular classroom and put in special education.

5 Early Binet Scales 1905: 30 items ordered by difficulty. Test lacked:
adequate measuring units to express results (only used idiot, imbecile, and moron) adequate normative data (only used 50 subjects) evidence of validity 1908: Grouped items according to age level rather than simply increasing difficulty. Introduced concept of mental age. Increased norm group to 203. Criticized because it produced only one score almost exclusively related to verbal, language, and reading ability

6 1916 Stanford Binet Intelligence Scale
Lewis Terman increased size of standardization sample though it was only white native-California children. Introduced intelligence quotient concept to show subjects’ rate of mental development. IQ = (MA/CA) x 100 However, maximum mental age was Had to set maximum chronological age, too, so set it at 16.

7 1937 scale Extended age range down to 2 and up to 22 years, 10 months.
Scoring standards and instructions were improved Several performance items added Standardization sample improved to include 3184 subjects from 11 states. Subjects selected according to their fathers’ occupations. Still, sample included only whites and mainly those from urban areas. Developed alternate form.

8 Problems with 1937 Form Reliability higher for older subjects than for younger ones and for those in the lower IQ ranges Scores were most unstable for young children with high IQ. Each age group also had different standard deviations which made interpretation difficult (think of different sized SDs and the interpretation of the normal curve. A 6-year-old would need an IQ of 125 to be two SD above the mean. A 12 year old where SD is 20, would need an IQ of 140 to be 2 SD above the mean.)

9 1960 Stanford-Binet Used Binet’s principles to redo scale.
Solved problem of differential variation in IQ by using the deviation IQ concept. Set mean at 100 with SD of 16. Could now compare scores of one age level with another. No new normative sample but did one in 1972 that included non-whites and 2100 children.

10 Modern Binet Scale Totally revised in 1986 by Thorndike et al.
Used Thurstone’s multidimensional model (1938): G made up of crystallized ability (verbal & quantitative reasoning), fluid-analytic abilities (abstract-visual reasoning) and short term memory. Used IRT (Rasch model) to determine proper order of the items Used routing test (Vocabulary) as attempt to adapt testing to specific ability level of each examinee without computer adaptive testing

11 Structure of the SB-IV Verbal Reasoning included vocabulary test, comprehension test, absurdities test, and verbal relations test. Abstract-Visual Reasoning included pattern analysis test, copying test, matrices test, paper-folding and cutting test. Quantitative Reasoning included quantitative test, number series test, equation-building test. Short-term Memory included bead memory, memory for sentences, memory for digits, and memory for objects. Composite included all areas combined. To decide where to begin each test, use scores on the vocabulary test in conjunction with chronological age. Test becomes adaptive. Basal age is established for each test: lowest level where 2 consecutive items of approximately equal difficulty are passed. Ceiling for each test is where at least 3 out of 4 items are missed.

12 Psychometric properties of SB-IV
Standardization sample has subjects in 47 states and DC. Sample stratified based on 1980 census – geographic region, community size, ethnic group, age, and gender. Internal consistency reliability is .98 for composite and for area scores. Some individual test scores are lower: .73 for memory for objects is the lowest. Test-retest reliabilities for composite score were .91 and .90 for 5 and 8-year-olds. Factor analysis supports the structure of the test. Correlations with other IQ tests are generally in the 70s and 80s Developed “testlets” that had 2 items of the same difficulty and successive pairs were to differ in difficulty by a constant amount based on Rasch scaling. Because of limited item pool and subject sample, some items were misplaced. This was discovered only after normative data were collected. RLT rescored all 5000 tests by hand and ran Rasch analysis on new data. He found a disturbingly large number of scoring errors.

13 Wechsler Scales David Wechsler worked at NY’s Bellevue Hospital. He wasn’t happy with the Stanford Binet with it’s focus on children or on the production of a single score. In 1939, he created the Wechsler-Bellevue, later called the WAIS. In 1949, he created the children’s version, the WISC. In 1967, he added the WPPSI for children ages

14 Structure of the WAIS The WAIS yields separate verbal and performance IQs The WAIS-III has four index scores: Verbal comprehension, working memory, perceptual organization, and processing speed.

15 Verbal and Performance Tests on the WAIS
Vocabulary Similarities Arithmetic digit Span Information Comprehension Letter-Number Sequencing Performance: Picture completion Digit symbol-coding Block design Matrix reasoning Picture arrangement Symbol search Object assembly

16 Scales and Norms for the WAIS
Determine raw score for each subtest. Convert raw scores to standard scores, called scaled scores (M=10, SD=3) There are conversions for 13 age groups. This method of conversion obscures any differences in performance by age. Subtest scaled scores are added, then converted to WAIS-III composite scores. Three composite scores: verbal, performance, full scale, each with M=100, SD=15 Four index scores: verbal comprehension, perceptual organization, working memory, processing speed

17 Standardization of the WAIS
Standardized on a stratified sample of 2,450 adults representative of the US population aged There were 200 cases per age group, except for the smaller numbers in the two oldest groups. Still difficult to know the effects of self-selection since participants had to be invited and accept to be included.

18 Reliability of the WAIS
Internal consistency and test-retest reliabilities are about .95 or higher for full scale and verbal scores. They’re about .90 for performance and three other index scores: perceptual organization, working memory, and processing speed. Internal consistency reliability for the subtests range from upper .70s to low .90s. Test-retest is about .83. Generally, performance reliabilities are lower than verbal reliabilities on the subtests.

19 Validity of the WAIS Great deal of information on criterion-related and construct validity. Factors analyses support use of 4 index scores. Comparison studies show the pattern of WAIS-III scores for many special groups, e.g., Alzheimers’ Disease, Parkinson’s, learning disabled, brain injury. Is the top test used today

20 WISC-III Is the most popular test for assessing intellectual ability of children ages 6 years, 0 months to 16 years, 11 months. Similar to structure of the WAIS, with easier items Both tests yield verbal, performance, and full scale IQ and 4 index scores Most of the subtests are the same Differences: Freedom from distractibility replaces working memory. Subtest composition of the Verbal Comprehension and Perceptual organization index scores are slightly different. Doesn’t have a letter-number sequencing or matrix reasoning subtest. WISC-II does have a Mazes subtest.

21 Psychometric Properties of the WISC-III
Standardization program involved 2,200 cases selected to represent the US population of children aged 6-16. Composite scores generally have internal consistency reliabilities in the mid-.90s and test-retest reliabilities around .90. Subtest reliabilities are generally in the mid-.80s. Object Assembly and Mazes are problematic, with reliabilities in the .60s.

22 Group Differences in IQ
Psychological tests designed to measure differences among people. Test scores that demonstrate differences among people may suggest that people are not created with the same basic abilities. Biggest problem: Some ethnic groups obtain lower average scores on some psychological tests. On average African Americans score 15 points lower than whites on IQ tests. Dispute is not whether differences occur but why they occur.—environment vs. biology

23 Problems with Biology Argument
IQ scores are improving (called the Flynn effect), more so for African Americans than whites. Victimization by stereotyping could affect test performance and grades. Construct of race has no biological meaning based on evidence from studies in population genetics, the human genome and physical anthropology. When students were told they were taking a test of intellectual ability, white students scored significantly higher than African Americans. When told it was a test about problem solving unrelated to ability, the groups performed equivalently. In another study, even completing a demographic questionnaire that asked about race also suppressed performance.

24 Criticisms Related to Content Validity
Looking at specific items, it was thought that they might be biased because some children wouldn’t have the opportunity to learn the material Members of ethnic groups might answer some items differently but still correctly Scores affected by language skills inculcated as part of a white, middle-class upbringing foreign to inner city children Johns Hopkins child Development Study concluded that some children may be giving an appropriate response for the subculture they are familiar with but that response may not be given credit. Example: What would you do if you were sent to buy a loaf of bread and the grocer said he didn’t have anymore. Correct answer: go to another store. When examiner probed for reasons behind incorrect response (go home) they found a number of “intelligent” reasons for incorrect answers: No other stores in area Not allowed to go further without permission Used family credit at that store to shop and needed to go home for money if went elsewhere Sattler (1979b) criticized study because (1) there wasn’t any control group that wasn’t from the inner city and (2) there may have been serious rater bias. Also, no evidence that liberal scoring system enhanced criterion-related validity of the test. Would higher scores also be more meaningful scores?

25 Responses to Content Validity Criticisms
Test developers are indifferent to the opportunities people have to learn the information on the tests. The meaning they assign to the tests comes from correlations of test scores with other variables. Some evidence suggests that the linguistic bias in standardized tests does not cause the observed differences (Scheuneman, 1987). Elimination of biased items from a test didn’t change the test scores (Bianchini, 1976). Can’t find classes of items most likely to be missed by minority group members (Wild, et al., 1989) Stanford-Binet was administered to 100 Headstart children. Half took standard version and half took version that was in African-American dialect. Only showed 1 point increase in scores.

26 Other Ways of Thinking About Differences
Maybe difference in test scores may reflect patterns of problem-solving that characterize different subcultures (e.g., MBTI) R. D. Goldman (1973) proposed the differential process theory which maintains that different strategies may lead to effective solutions for many types of tasks. Strategies mediate abilities and performance. African American students typically score higher on verbal than quantitative. They may structure getting through school differently, developing their verbal skills rather than their quantitative abilities. May also reflect differences in the opportunity to learn proper quantitative skills Variety of studies have shown differences in information processing for different groups e.g., native American and Hispanic groups tend to do better on visual reasoning than verbal reasoning tasks (Suzuki & Valencia, 1997)

27 Criterion Issues Most standardized tests are evaluated against other standardized tests. The criterion may be the same test dressed up differently or measuring test-wiseness on both IQ tests may be correlated with achievement tests. Achievement may be moderated by opportunity to learn. Goldman and Hartig (1976) found scores on the WISC to be unrelated to teacher ratings of classroom performance for minority children, but significant for non-minority children. Majority and minority children grow up in different social environments. Perhaps test scores accurately reflect the effects of social and economic inequality.

