Presentation is loading. Please wait.

Presentation is loading. Please wait.

3. Procedure to elaborate measuring tools Psychometrics. 2011/12. Group A (English)

Similar presentations


Presentation on theme: "3. Procedure to elaborate measuring tools Psychometrics. 2011/12. Group A (English)"— Presentation transcript:

1 3. Procedure to elaborate measuring tools Psychometrics. 2011/12. Group A (English)

2 Most measuring tools imply performing some task, observation of behaviors or self report. In general, all have in common: systematic technique in which the task, instructions, responses, the application procedure, the correction and interpretation are scheduled and are equivalent for all subjects to whom they apply.

3 Contains only a sample of the behavior of subjects, representative of the possible behaviors that are observable and empirical manifestations of the characteristic that the tool measures. Scores are interpreted by reference to norms and standards. That is, the scores have a meaning when we compare one subject with others and see their relative position or when we compare with an external criterion which serves as a standard of the subjects performance.

4 The aim is to make inferences and predictions about other’s behaviors different or more general than those observed during the application of the test.

5 Different kinds of tools Test: – general term to refer us to any psychological measurement tool, especially to measure cognitive variables (skills, knowledge, performance, etc.). – The subjects' responses to each of the items are correct or incorrect. – The total score on the test is calculated by adding all correct answers obtained (directly or weighted sum).

6 Scale: – Usually it refers to tools designed to measure non- cognitive variables: attitudes, interests, preferences, etc. – On a scale of graduated and ordered categories, subjects have to choose the category that best represents his/her position on what the test is measuring. There are no right or wrong answers. – The total score is the sum of the scores assigned to the categories chosen by subjects.

7 Questionnaire: – Usually it is conformed by items that are not necessarily related to each other, whose response options are not ordered or graded, which can be interpreted individually and in which there are no right or wrong answers. – Usually includes varied questions to obtain more information about the subject and their environment (age, profession, education level, opinion on the discussed topic, etc.). – Typical in survey research.

8 Inventory: – It is often referred to tools developed to measure personality variables. – The responses of the subjects are not right or wrong, so they demonstrate compliance or not with respect to statements of the items.

9 Process of constructing a test Determine the purpose of the test: – What we want to measure – Who we are going to measure – What is the reason for the measurement Specify the characteristics of the test: – What will the content be – Which kind of items we want to include – How many items – psychometric characteristics Drafting of the items: – Choice items – Construction Items Critical revision of the items by a group of experts: – Which items we are going to select

10 Making the pilot test: – Instructions for the administration – Presentation format – Registration form of the answers Implementation of the pilot test: – Individual-collective – Paper and pencil-computer – Mail, by computerized interview, on the phone. Etc. Correction of pilot test and assignment of scores to the subjects: – In tests conformed by items of choice – In tests conformed by items of construction

11 THE PURPOSE OF THE TEST

12 The variable under study Psychological variables, unobservable directly. Construct. The constructs (unobservable theoretical variables) are manifested through a series of behaviors which can be observed directly and therefore, they can be measured. These behaviors, in order to be considered as manifestations of the construct have to be more or less uniform and constant along the time and in a variety of situations. All questions that refer to one construct should be reflected in the items of the test.

13 Ex. Verbal skill: it is a construct that is evidenced by the knowledge of meanings about a large number of words and the election of the most appropriate word in a given context. If the construct is well defined, it will be easier for us to determine which behaviors are representative of that construct and from them we will specify the test content.

14 Population to which it is addressed A test to evaluate one characteristic in a child population is different than the same construct applied to an adult population. The item content, wording, the length of the test and the conditions of application and completion, for example, will be different.

15 Intended use We have to consider: the reason to use it, which decisions will be taken from the scores obtained by subjects. One test can be used to make different choices. It is not the same if we want to build a test to detect gifted children than to detect handicapped children. In the first case, the items should be very difficult so they can only be answered correctly by smarter children. In the second case items should be very easy so they could only be answered incorrectly by children with difficulties.

16 Functions: – Selection – Classification – Diagnosis – Certification – Guidance / advice – Description / information

17 SPECIFICATION OF TEST CHARACTERISTICS

18 Content Once the construct is clearly defined, the specification of the content begins with determining which the domain of behaviors is, i.e., the set of behaviors through which the construct is manifested. Once we have done that, we can make decisions about the content of the test. It is very important that the construct will be clearly defined. It can be seen in various forms, such as: – Make a content analysis of the construct – Review published research – Perform a task analysis – direct observation – Using expert opinion – Review intervention programs

19 As you have more information about the content to be measured, it will be modified and vice versa. You can use a matrix of content specifications. To calculate and distribute the number of test items, starting from an initial number of items, depending on the content areas, on the processes to be evaluated or any other variable you want to consider.

20 The measurement of any attribute always involves three stages (Thorndike & Hagen, 1989): – Identification and definition of quality or attribute to be measured. – Determine the set of operations through which you can express and perceive that attribute. – Establish a set of procedures or definitions to convert the observations into quantitative statements of grade and quantity.

21 Format of the items Select the type of items that we will use to construct the test. Choice items: items of closed response, which requires that subjects respond by choosing one or more alternatives from among the proposals. 1. Two alternatives (true/false; yes/no; correct/incorrect). – It is often used to measure cognitive variables (skills, abilities and knowledge and performance test). – Advantage: quick and easy to use. Disadvantage: subjects who do not know the answer and respond randomly have a 50% chance of choosing the correct answer.

22 2. Multiple choice. It consists of: the statement, the response alternatives (one is the correct or the most appropriate and the others are "distractors"). – 3-5 alternatives to reduce the possibility that subjects choose the correct alternative by chance. – It is used to measure cognitive values and mainly in knowledge and performance tests. – Advantage: easy to manage, edit and score. Disadvantage: they are more difficult to construct than the test of two alternatives.

23 3. Matching. Pairing. The subject matches the items into two columns according to the instructions given in the statement. – Suitable for measuring cognitive variables and especially knowledge. 4. ‘Cloze’ format or incomplete. We offer the subjects, for example, one sentence with missing words and then we offer a word list that includes the missing words.

24 5. Rating scales. We present one statement and different response alternatives which are gradually arranged in a series of categories along a continuum. Subject must choose from the proposed alternatives the one that best reflects their personal attitude in respect to the statement. – We use them to measure non-cognitive variables (attitudes, interests, personality, etc.). – Advantage: subjects express their position accurately. Disadvantage : the meaning of response alternatives are not the same for all subjects.

25 6. Checklists. Rating scale in which subjects have to show his/her opinion on any facts presented in the statement. – The options list are not ordered but are independent. – Sometimes it is possible to choose more than one option. – Typical format of the questionnaire.

26 Items of construction (open response): the subject should develop their own response. – We can evaluate not only the level of knowledge of subjects and their way of structure it, but their higher order cognitive skills. 1. Short answer. 2. Long answer or essay.

27 Test length There is no single solution because you have to take into account many factors (population to be targeted, time constraints, objective test, etc.). It is recommended that the pilot includes a larger number of items that will be used for the final version. It may be useful to use the matrix of content specifications.

28 Psychometric characteristics of the items In CTT framework, one test is easy or difficult for a given population, depending on the probability that subjects have to respond to it properly. If this probability is high, the item will be easy and viceversa An item will have a high degree of homogeneity with the rest of the items which form the test when they measure the same thing. An item will be discriminatory to the extent that it serves to differentiate between subjects who have obtained extreme scores on the test.

29 Regarding the difficulty of the items: – Speed test: items should be very easy to solve. The difficulty is the limited time to answer it and this is the factor that allows us to discriminate between subjects. – Maximum Performance Test: primarily used in the evaluation of academic performance and to measure the abilities and skills. The items have varying degrees of difficulty. – Typical performance test: test of personality, attitudes, etc.. It makes no sense to talk about difficulty since there is no right and wrong.

30 DRAFTING OF ITEMS

31 General recommendations Avoid ambiguous statements. The meaning of the words used should be clear to all subjects because their responses would be hardly comparable if each could interpret the meaning of the statement in a different way. Ex. Religiosity. Promote short, direct and accurate statements. Ex. What are your career goals for the next years?

32 Avoid statements that cause biased response (one that is more likely to be chosen regardless of their opinion). Ex. Item in which the subject has to admit some kind of socially unacceptable behavior may provoke that the subject doesn’t express his/her true opinion. Express a single idea in the statement (to avoid double questions). Ex. Are you in favor of reducing alcohol consumption among young people and raise taxes on alcoholic drinks? Avoid double negations in sentences. Ex. Do you think it is possible or impossible that man landed on the moon?

33 Recommendations to choose items Be sure that the item is undoubtedly true or false. Ex. Dalí was the greatest painter of the twentieth century. Do not use phrases that are universally true or false. Avoid words in the title which might lead to the right answer although subjects don’t know it (such as always or never). Locate along the test, randomly, the items whose statement is correct. To avoid response patterns recognizable by subjects.

34 Recommendations for multiple- choice items Ensure that the statement of the item formulates the problem clearly. Include most of the text in the statement to avoid unnecessary repetitions in the response options. Ensure that the ‘distractors’ (incorrect alternatives) are plausible. Avoid response options as 'None of the previous' or ' All previous'. There is only one correct option (or right) unless you clearly indicate otherwise. That all alternatives are uniform in length and with a similar grammatical construction. Randomize the location of the correct alternative. Make that all options seem equally attractive. Ensure that each alternative is grammatically consistent with the item statement.

35 Response bias When choosing items, take into account the possibility of response bias, especially in affective test (personality, interests, attitudes, etc.): – Acquiescence. Tendency to respond systematically to agree (or disagree) with the statement of the item regardless of its content. – Social desirability. Tendency to respond to the item in a socially acceptable way. – Indecision. Tendency to select the neutral option. – Extreme response. Tendency to choose the ending categories.

36 CRITICAL REVIEW BY A GROUP OF EXPERTS

37 It is preferable that they haven’t been involved in developing the items, so they will be able not only if items are adapted to the content, but the clarity of writing, if they meet the standards in terms of format, etc. After reviewing the items and remove or correct those which were not suitable, you can construct the preliminary test (the pilot one).

38 PREPARATION OF THE PILOT TEST

39 Administration instructions Each type of test requires certain instructions, but some are common: Do not use threatening language. Ex. This test will allow us to know how smart you are. At maximum performance tests (eg. Skill test), explain that the items are of varying difficulty. It will reduce anxiety. At speed tests, explain that time is limited and that only very few people will be able to complete the test.

40 You must provide one or more items as an example. They should inform about how to allocate time and what to do when the subject doesn’t know the answer to an item. Instructions should encourage subjects to answer all the questions, because subjects’ score tends to decrease when many answers are left blank. Explain how to mark the choices.

41 Presentation format and recording the responses The presentation format should be clear and readable by all subjects, to prevent inadvertent mistakes like confusing the answer box. Request identification data from subjects at the beginning of the test. Then, you should present the instructions. Then, present the instructions: – At tests that measure cognitive variables (knowledge, skills, etc.), sort the items out based on their level of difficulty. Do not put hard questions first. – At tests which measure non-cognitive variables, be careful not to include delicate questions at the top.

42 When a test includes items of various formats, they should appear grouped. Group items that refer to the same topic.

43 IMPLEMENTATION OF THE PILOT TEST

44 Decide on the method of administration and select one sample of subjects belonging to the same population as those for which the test was designed. Method of administration: – Collective-individual. – Oral (in person or on the phone). Young children, difficulties with the language. – Paper and pencil. – By computer. Lower cost of time. Greater standardization of application conditions. – By mail.

45 CORRECTION OF THE PILOT TEST AND ALLOCATION OF SCORES TO SUBJECTS.

46 At test composed by choiced items At a cognitive test: – As it may have correct and incorrect answers, we check whether the responses of the subjects match or not with the correct template. One point for each correct answer. – Final score: normally, the sum of correct answers. – Given the influence of answering by chance and personality patterns (more or less risky), emphasize that subjects don’t leave any unanswered item or use a procedure to control the effect of chance on the final score. It is preferable to use a correction formula to carry out control.

47 Applying a correction formula can be made: 1. Penalizing mistakes. It is assumed that the subject does not know the right answer and that all item alternatives are equally attractive to them.

48 2. Reclaiming items not answered. It is assumed that the subject has only answered the questions they knew and, therefore there are no mistakes

49 It is advisable to use the first procedure because with the second one scores would be overestimated. Ex. 2 students who know 10 out of 20 questions in an exam (true-false). One student answers the 10 he/she knows, the other one takes the risk and decides to answer all questions (responding randomly, he/she has hit 5). – Student 1: obtains 10 points. – Student 2: obtains 15 points.

50 Correction by procedure 1: Correction by procedure 2:

51 When the same test is composed of items with different numbers of alternatives, to know the final score for each subject we will have to apply the correction of random in parts. – Items will be grouped according to the number of alternatives and we will calculate the subject score in each group. – Final score: sum of partial scores.

52 In non-cognitive tests: In the absence of correct and incorrect answers, items have a different numerical value assigned for each alternative of response, which implies a pre-scaling of items (stimuli) according to the degree that the construct manifests. – Correction of test and assignment of scores to subjects: adding the numerical values ​​assigned to the response alternatives chosen by the subject. – Need that the numerical assignment to each response category and each item are well done.

53 When it is used for example, a format of rating scales, we must be very clear which the direction of the continuous of the variable that is being measured is. If it is an attitudinal variable, we must know which the ends of the continuum that marks a favorable and unfavorable attitude are. Ex. Depression. Which end marks the lack of depression and which refers to the maximum extent. Then, decide to which end of the continuum the highest numerical value is assigned and take care about all items following the same allocation rule.

54 SUMMARY OF STEPS Process to develop a measurement tool (Croker & Algina, 1986): – Delineation of the target. – Definition of the construction: inductive or deductive process. – Description of the construct components: they can range from very specific or one-dimensional to very general or multidimensional. – Instrument design. – Drafting of items: clarity, no ambiguity, short essay. – Analysis of items quality: descriptive and statistical information. – Reliability: stability of test scores and internal consistency. – Validity: adequacy of inferences made ​​from scores on the test. – Development of implementing rules, interpretation and baremation.

55 PART II. MEASUREMENT OF ATTITUDES

56 Psychological measurement tools Test to measure cognitive variables: skills, performance, knowledge, etc. Scales, questionnaires or inventories to measure non-cognitive variables: personality, attitudes, interests, values​​, opinions, etc. – Main techniques to develop scales to measure attitudes (can be adapted to measure interests, values ​​...).

57 One of the characteristics that differentiate between attitude scales and interest or value scales is that in attitude scales all items that form the scale must refer to the same variable, while in interests or values scales items can ​​refer to numerous activities (specific activities when interests are measured and broad categories ​​when values are measured).

58 Thurstone scaling model

59 He developed procedures to elaborate scales in a psychological continuum that allow to locate the stimuli without the need to any physical operation in physical continuum. Differenciate between: – Construction process of the scale. Objective: To scale the stimuli (e.g, items) along a psychological continuum, assigning one value in the scale to each one. – Application. Once the scale is constructed, we have a set of items that constitute the pilot test, each one is assigned to a scalar value representing to what degree the specific attribute is present (psychological variable to scale).

60 The phases to develop one scale are basically the same as we saw to develop a test, but one we must add: the 'proof of judges', in which we assign scalar values ​​(scores) to each items (stimuli) that compound the test. Of all the procedures used by Thurstone, the most widely used is that of the 'apparently equal intervals'.

61 Basic assumptions of the model It is based on: – The differences between subjects at the time of perceiving stimuli. – The limitations of the subject to perceive the difference in magnitude between two stimuli. ASSUMPTIONS : A. There is a psychological or subjective continuum along which the studied attribute is varied. B. Each of these stimuli (subject to scale) upon presentation of a subject to its evaluation, it will cause a subjective process in the subject (called 'discriminant process') through which the subject will assign one value to them, also subjective in the psychological continuum.

62 C. If the stimulus is presented repeatedly the same discriminant process is not always originated in the subject, and the subjective value assigned may therefore change. D. If the number of times each stimulus is presented is very large, you can make one distribution about the subjective values ​​assigned to each of them and assume that this distribution follows a normal distribution. E. The mean of this distribution (called discriminant distribution) is the value of the stimulus in the psychological continuum (called a scalar value of the stimulus). The standard deviation is called discriminant dispersion and gives us an idea of ​​the ambiguity raised by the stimulus on the subject.

63 To the extent that the standard deviation is greater, the variation in the values for each stimulus ​​that the subject has been assigned to is higher, and vice versa. F. If we present several stimuli, each one will result in a different discriminant distribution (with its mean and standard deviation). G. The model is true whether : – A single subject issued numerous judgments about each of the stimuli. – A sample of subjects issued a single opinion about each stimulus. The sample of subjects used to assign scalar values ​​to stimuli is known as the sample of judges or experts.

64 The Law of Comparative Judgment: binary comparison method Subjects' task: to directly compare each of the stimuli presented to them with everyone else and say (to each pair consisting of) what the preferred stimulus in the direction of the attribute is being measured. One discriminant process will be produced by each judge: he will assign one subjective value to each stimulus and to compare them, there will be a difference between the subjective values ​​assigned to each of them. There will be a ‘discriminant difference'.

65 The results of the judgments made by each judge to each pair of stimuli are arranged in arrays (of frequencies, proportions and typical scores). The mean of typical scores assigned by judges to each stimulus (through discriminatory processes) is the best estimate of the scalar value.

66 Example: Study Spanish attitudes towards marriage. Scale is made by the method of binary comparisons. 6 items are used to make up all possible binary combinations (6 x 5 /2 = 15). They are presented to 100 subjects (judges). Task: choose, within each pair, the item whose statement shows a more favorable attitude toward marriage. Once collected, the data are arranged in a matrix of frequencies.

67 matrix of observed frequencies Stimulus123456 1---7065454080 230---60703070 33540---603060 4553040---5575 56070 45---65 62030402535--- Sumatory200240275245190350

68 Cells represent the number of judges who have considered that the stimulus corresponging to the column shows a more favorable degree of attitudes toward marriage than the one shown in the row. Item 6 is what, in the opinion of the judges, shows a more favorable attitude toward marriage. The 5 that shows a more unfavorable attitude.

69 With these data we can construct an ordinal scale of the stimulus. We know the order of items in respect to the degree of attitude they contain but we can not know the differences between them. Sort rows and columns so that the stimulus maintains the established order by the subjects.

70 Frequency matrix ordered Stimulus634215 6---4025302035 360---60403530 47540---3055 2706070---30 180654570---40 56570457060--- Sumatory350275245240200190 The sum of the symmetric elements in the matrix equals the number of judges

71 Matrix of proportions Stimulus634215 6---0.400.250.300.200.35 30.60---0.600.400.350.30 40.750.40---0.300.55 20.700.600.70---0.30 10.800.650.450.70---0.40 50.650.700.450.700.60--- Sumatory3.502.752.452.402.001.90 From the ordered frequency matrix we can obtain a matrix of proportions dividing each element of the matrix by the number of subjects (100).

72 From the proportions matrix it is necessary to obtain the typical score matrix. We have to use the normal curve table and find out the typical score that corresponds to each of the proportions. Cells on the diagonal, since subjects were not compared with themselves, we can assume that if we had made ​​the comparison 50% of subjects have opted for the stimulus of the row and 50% for that one in the column. In typical scores, 0 is the value that stays between the two halves of subjects.

73 Typical score matrix Stimulus634215 60.00-0.25-0.67-0.52-0.84-0.39 30.250.000.25-0.25-0.39-0.52 40.67-0.250.00-0.520.13 20.520.250.520.00-0.52 10.840.39-0.130.520.00-0.52 50.390.52-0.130.520.250.00 (a) EZ Kj 2.670.66-0.16-0.25-1.37-1.55 (b) EZ Kj /N0.450.11-0.03-0.04-0.23-0.26 (a) sum of the typical scores of the column. (b) estimate of the scalar values ​​of the 6 stimuli, since the best estimation we can make of them is the mean of typical scores. The sum of all scalar values ​​must be equal to 0.

74 Disadvantage: negative values. You can make a linear transformation of the scale moving the origin of the scale to the lowest scale value. In the example it would be the one corresponding to item 5 (-0.26). If to the stimulus with a scale word (-0.26) we assign the value 0, what we have done is add one constant equal to the scale value that it had. To maintain the same distance between stimuli in the two scales we have to add that constant ​​to the scale values from the rest of stimuli.

75 New scale: 5…1………………..2…….4…………3……….……6 (b) + 0.26= 0 0.03 0.22 0.23 0.37 0.71 That is one scale: subjective, unidimensional and of intervals. The item 5 is the one that has a level of attitude more unfavorable toward marriage, while 6 has a level more favorable. Now you can see the distance between items (e.g. Items 2 and 4 are more similar to each other that 1 and 2). An example, Item 5 and 6 could be: marriage restrict freedom of couples and marriage is the basis of the family, respectively.

76 The law of categorical judgment In addition to the general assumptions of the model, we must assume that the psychological continuum of each judge (along which they are going to stand the different stimuli) can be divided into a series of ordered categories. The subject must assign each stimulus that is presented to him/her to one of the categories depending on the degree of attribute that he/she believes the stimulus has. Continuing with the previous example, the subjects' task would now evaluate each item and assign them to a particular category based on the attitude more or less favorable or unfavorable that they think items have. There are 3 procedures: the sorting by ranges, the successive intervals and the apparently equal intervals.

77 Procedure of apparently equal intervals Each judge must imagine one scale divided into a series of ordered categories (e.g. 11), from the category that expresses the most negative degree of attitude (on an end-category 1) to the one that represents a more positive attitude (in the other end-category 11). In the center would be the category corresponding to a neutral point of the average attitude continuum (category 6). The intermediate categories between those 3 points are supposed to be equally spaced. If the first category has the value 1, the limits of this category will range from 0.5 - 1.5. Thus for all.

78 Example The following 2 items have been evaluated by 300 judges on a scale of 11 categories. – Marriage affects the freedom of the couple. – Marriage is the foundation of the family. Categories 1234567891011 Item 55010060402515100000 Item 60000101525406010050 Fa (5)50150210250275290300 Fa (6)000010255090150250500

79 To find the scalar value of stimuli: calculate the median of their distributions, for which we obtain the cumulative frequency to each item(Fa). L i = lower limit of the interval associated with the median category. I= interval amplitude (en this procedure = 1). f d = number of judges who classified the element at the category corresponding to the median. N/2=50% of the subjects from the sample of judges. F b = number of judges who classified the element at lower categories than the one corresponding to the median.

80 As the sample=300, el 50%=150. For each item we search in F a which category is what leaves above and below 150 judges. – In item 2 we can see is the category 2; in item 6 is the category 9. Scale value of item 5: Scale value of item 6:

81 The two items are located very close to each end. To select items to form the scale we will select those in which judges have shown more agreement. As a measure of degree of agreement we can use the ambiguity coefficient (distance between first and third quartile).

82 For item 5: For item 6:

83 If the coefficient is higher than 2, the item will be considered ambiguous and it should be removed from the scale. In neutral items, in which that their scale value are between 5.5-6.5 if the scale has 11 categories (or the central point of the scale regardless of the number of categories) the ambiguity coefficient can reach 3. In the example, items 5 and 6 to be greater than 2 should be eliminated, but being so close they could be left too.

84 The Likert technique

85 Objective: To develop a simple scale such as Thurstone scale but equally reliable. It is the summative model most commonly used to measure individual differences about psychological traits. It assumes that as the amount of trait expressed by the subjects increases or decreases, so does their score on the item. Advantage: easy to elaborate, very reliable, can be adapted to measure any kind of attitude.

86 Characteristics : – Assumes an ordinal level of measurement. – It measures one single dimension. – Operation: the subject is placed in the attitude variable from the point of view most favorable to most unfavorable. His/her value will be the sum scores obtained in the different items. – It assumes that the more favorable the attitude of one subject to what is being measured, the greater the probability that he/she chooses in each item the category that indicates that position. – Items should allow subjects to make value judgments and not factual judgments (subjects should express what they say should be, not what actually is).

87 Example The family should spend more time together. – A) completely agree – B) agree – C) indifferent – D) dissagree – E) completely disagree By assigning scores to the alternatives, the researcher must ensure that the highest value indicates the most positive attitude towards what is being measured.


Download ppt "3. Procedure to elaborate measuring tools Psychometrics. 2011/12. Group A (English)"

Similar presentations


Ads by Google