**Methods of Standard Setting**

2
Introduction All standard setting methods involve expert judgemental decision making at some level... (Jaegar, 1979) There is no such thing as a true standard, but there is a theoretical cut-score that would be set by a judge if he or she totally understood the process, the test, the content, and the policy and had a true score on the test in mind as the standard. The question is whether the standard setting method can recover the theoretical cut-score assuming a judge performed every task consistently and without error (Reckase, 2000) Many different terms are used in the measurement literature to refer to performance standards: “passing scores”, “cut scores”, “cutoff score”, “performance levels”, “achievement levels”, “mastery levels”, “proficiency levels”, “tresholds” and “standards” (Hambleton, 2001)

3
**The importance of standard-setting**

Cut-score – is crucial for all participants of testing must be reasoned and fair necessary to use methods that allow with a mathematical precision to make it possible

4
**Interpretation of the mass-testing results Common solution:**

Setting of cut-scores and division of examinees into groups in accordance with their ability level Participants of testing need to compare themselves with other examinees to estimate correctly and adequately their level of mastery of the material Policy-makers Are interested in overall level of educational achievements, which could reflect the real situation in schools and classes of a region

5
**Why is it important to establish reasonable and fair cut-scores?**

Professional and ethic responsibility of people, who conduct testing for the provided results 1. Interpretation of the results should be available to any understanding of the audience and should not cause an obvious disagreement with them 2. The results interpretation should reflect real situation and be informative for policy-makers 3. The results interpretation should not have a dual meaning – the examinees of one group should have really different levels of ability from examinees from another group 4.

6
**Criterion-referenced Standard-Setting Methods**

Test-centered Criterion-referenced Norm-referenced Examinee-centered Standard-Setting Methods Classification of

7
**Test-centered Examinee-centered**

The most commonly used classification scheme nowadays is the one suggested by Jaeger (1989) who splits the standard setting methods into two large groups Test-centered Angoff Ebel Nedelsky Jaeger Objective Standard Setting Bookmark Etc. Examinee-centered Method of Contrasting Groups Method of Borderline group Etc.

8
Test-centered method Angoff

9
**Method Angoff – one of the most preferred widely and frequently used methods**

Traditional Modified

10
**Procedure of standard setting (traditional method Angoff)**

Experts rate the probability that a barely or minimally satisfactory or qualified person would answer each test item correctly The average of these probabilities across judges or raters is the cutoff score

11
**Advantages and disadvantages**

+ - ? Objectiveness decision making about the probability of a correct answer by a minimally competent examinee One round in rating variable values (fluctuating rated probability) Transparency and clarity Simplicity Flexibility

12
Test-centered method Ebel

13
**Procedure of Standard Setting**

2 Rounds Experts classify independently test items by: I level of difficulty II level of relevance easy medium hard essential important acceptable questionable

14
**Number of items in a category % correctly performed items**

For each judge then: All items could be classified cells in a 3*4 grid defined by the three difficulty and four relevance category. As in the example: categories Expert №3 Expert №4 Expert №5 Number of items in a category (А) % correctly performed items (В) А*В Essential Easy 11 60 660 10 70 700 13 75 975 Medium 1 25 3 Hard … Questionable Easy Medium Hard Mean 25.1 26.7 35 Mean for all experts 28 Cut-score 12

15
**How to count a cut-score**

Judges indicated the percentage of items within each of the 12 cells that a student should answer correctly in order to be judged minimally competent each item assigned to one of the 12 cells based on the expert’s ratings the percent passing judgment for a cell multiplied times the number of items in a cell these products summed over all 12 cells to get an overall passing score for a judge these passing scores - averaged over judges in order to get the composite passing score

16
**Advantages and disadvantages**

+ - It may be challenging for standard setting participants to keep the two dimensions of difficulty and relevance distinct because those dimensions may, in some situations, be highly correlated Validity concern has to do with judgments about item relevance. Because the inclusion of items judged to be of questionable relevance appears on its face to weaken the validity evidence supporting defensible interpretation of the total test scores Can be used with different types of items (not only multiple-choice)

17
Test-centered Nedelsky

18
General concept Nedelsky proposed considering the characteristics and performance of a hypothetical borderline examinee that he referred to as the “F-D student”. Responses (distractors) which the lowest D-student should be able to reject as incorrect, and which therefore should be attractive to [failing students] are called F-responses… Students who possess just enough knowledge to eliminate F-responses and must choose among the remaining responses at random are called F-D students.

19
**Procedure of Standard Setting**

The experts independently determine F-responses which minimally competent examinees would be able to eliminate as incorrect The number of other options determines the probability with which the candidate will answer correctly the question: a plausible answer = 100%, 2 = 50%, 3 = 33%, 4 = 25%, and 5 = 0% probability of a correct answer

20
An example Participants judged that, for a certain five-option item, borderline examinees would be expected to rule out two of the options as incorrect, leaving them to choose from the remaining three options. The Nedelsky rating for this item would be 1/3 = Repeating the judgment process for each item would give a number of Nedelsky values equal to the number of items in the test (n). The sum of the n values can be directly used as a raw score cut score. For example, a 50-item test consisting entirely of items with Nedelsky ratings of 0.33 would yield a recommended passing score of 16.5 (i.e., 50 × 0.33 = 16.5)

21
**Advantages and disadvantages**

- + Nedelsky method is used for many years to establish threshold assessment. Probably it’s been popular for many years, because the procedure is clear for experts, they can make a decision about responses quickly, which is minimally competent examinee would be able to eliminate as incorrect. It can be used without preliminary approbation of a test Can be used only with multiple-choice items Raters tend not to assign probabilities of 1.00 (i.e., to judge that a borderline examinee could rule out all incorrect response options), this tends to create a downward bias in item ratings (i.e., a rating of .50 is assigned to an item instead of 1.00) with the overall result being a somewhat lower passing score than the participants may have intended to recommend, and somewhat lower passing scores compared to other methods

22
**Test-centered (based on Item-Response Theory)**

Bookmark

23
**Directions to Bookmark participants Student exemplar papers**

Essential materials Directions to Bookmark participants Ordered item booklet Booklet guideline Student exemplar papers Scoring Guide

24
**Standard Setting Presentation of the percentage of**

students falling into each performance level and each median cut-score from Round 2. After discussion individual judgments Overview of established cut-scores by every expert, repeating of the same procedure as in the first step Experts are informed about the essential number of cut-scores to establish. Experts work in small groups, all the essential material is introduced to them Basic steps of the procedure Round III Round II Round I

25
Round 1 The main goals are to get panelists familiar with the ordered item booklet, set initial bookmarks, and then discuss the placements. Panelists are asked to discuss and determine the content that students should master for placement into a given performance level. Their independent judgments of cut-scores are expressed by simply placing a bookmark between the items judged to represent a cut-point. One bookmark is placed for each of the required cut-points. Items preceding the participant's bookmark reflect content that all students at the given performance level are expected to know and be able to perform successfully with a probability of at least 0.67 or 0.50.

26
Round 2 The first activity in Round 2 involves having each member place bookmarks in his/her ordered item booklet where each of the other panelists in their small group made their bookmark placement. For a group of 6 people, each panelist’s ordered booklet will have 6 bookmarks for each cut point. Discussions are then focused on the items between the first and last bookmarks for each performance level. Upon completion of this discussion, the panelists then independently reset their bookmarks. The median of the Round 2 bookmarks for each cut point is taken as that group’s recommendation for that cut-point.

27
Round 3 The percentage of students falling into each performance level is presented, given each group’s median cut-score from Round 2. With this information of how students actually performed, the panelists discuss the bookmarks in the large group and then make their Round 3 independent judgments of where to place the bookmarks. The median for the large group is considered to be the final cut-point for a given performance level.

28
**Method of contrasting groups**

Examinee-centered Method of contrasting groups

29
**Method of contrasting groups**

Procedure includes testing of two groups of examinees Comparison of the distribution of test scores for each examinee, who was classified by category In the place of intersection of two distributions cut-score Competent Non-competent

31
**Advantages and disadvantages**

+ Can be used with any kind of an item type - Classifying students on competent and non-competent is doubted to be objective

32
**Thank you for attention**

Your questions? Thank you for attention

