Standard Setting

**What is Standard Setting?**

It’s a judgmental process, in which qualified experts (usually mostly or all teachers) determine “How much is enough” It uses an established set of activities designed to lead the panelists through the process in a consistent and systematic way

So what does this mean exactly? Let’s look at an example But first…

A Disclaimer Note that the hypothetical (i.e., fake) math test presented on the following slides has clearly never been seen by Measured Progress’s excellent content experts, editors, marketing folks, high-up decision-makers, etc., none of whom would ever let me get away with this for a variety of excellent reasons. However, for purposes of illustration…

For Example Let’s say we have a test, the Measured Progress Math Test, consisting of: 40 multiple-choice (MC) items 6 one-point short-answer (SA) items, and 1 four-point constructed-response (CR) item for a total of 50 points.

Math Test, cont. We need to use scores on the Math Test to meet federal accountability requirements as defined by NCLB There are lots of different types of test “scores:” raw score (i.e., number right), scaled scores, θ scores (which you’ll learn about next week when Mike talks about Item Response Theory), etc. NCLB requires that we report the percentage of students who “meet standards”

Math Test, cont. To do this, we need to establish cut points to define our four performance levels: Advanced (A) Proficient (P) Below Proficient (BP) Failing (F) Cut point 3 Cut point 2 Cut point 1

Math Test, cont. To help us with this task, we have general performance level descriptors, that tell us what it means to be in each of the four PLs (with apologies to one of our major contracts): A: Students demonstrate in-depth understanding and can solve complex problems P: Students demonstrate solid understanding and can solve routine problems BP: Students demonstrate partial understanding and can solve some simple problems F: Students demonstrate minimal understanding and cannot solve problems

**Standard Setting for the Math Test**

Goal is to “operationalize” these general performance level descriptors: What does “in-depth understanding” or “solid understanding” or “partial understanding” mean? What distinguishes “complex,” “routine,” and “simple” problems? What specific skills correspond to each of these general performance level descriptors? And, finally, what does this translate into in terms of performance (i.e., scores) on the Math Test? All of these questions illustrate why standard setting is a judgmental process and why we need content experts (teachers) to set standards

So, standard setting is simply an established process that takes panelists through a series of systematic steps: Becoming very familiar with the test and what it measures Defining in specific terms what it means to be “Advanced,” “Proficient” or “Below Proficient” and coming to consensus about those definitions as a group “Operationalizing” those definitions by determining the test scores that indicate a student has demonstrated the necessary knowledge, skills and abilities (KSAs) to be classified as (for example) Proficient

There are a variety of methods that can be used to accomplish these goals, and a myriad of variations on these methods So how do we decide which method to use?

**Selecting a standard-setting method**

As mentioned on the previous slide, there are lots and lots of standard-setting methods out there: Bookmark (and Modified Bookmark) Angoff (and Modified Angoff) Body of Work Analytic Judgment Dominant Profile Contrasting Groups and so on…

Choosing among these options is a matter of one or more of the following: Previous history Policy directive or recommendation Appropriateness for item types, for example: Bookmark works well for tests that contain mostly MC items Body of Work works well for tests that include a lot of CR items etc. No method is right or wrong; task is to select the most appropriate for a given situation Different methods yield different results.

For our test, the Math Test, we’re going to use the Bookmark Method, because the test consists primarily of MC items, but also includes a few SA items and 1 CR item. However, we also need to set standards for the Math Test-Alt. The Math Test-Alt consists entirely of polytomous items, so we’re going to use the Body of Work Method

Review: Best for tests with mostly __ items A test-centered method, i.e., rating decisions are based primarily on the test items rather than samples of student work Uses an Ordered Item Booklet and an Item Map Panelists make their ratings by placing bookmarks in the OIB

**Bookmark vs. Body of Work, cont.**

Ordered Item Booklet: Each page in the booklet is a single item (or a single score point for polytomous items) Items are presented in order from the easiest item on the test to the hardest Total number of ordered items = total possible raw score on the test Order is based on actual student performance (determined using IRT – come back next week)

**Ordered Item #1 (easiest item on the test)**

2+2= 4* 5 2.2 22

**Ordered Item #2* (second easiest item on the test)**

a) Won needs 54 tiles for a new kitchen floor. The tiles come in boxes of 8 tiles each. How many boxes of tiles will Won need to cover the floor? b) One box of tiles costs $ Estimate the amount Won will have to spend. Explain how you got your answer. Score point 1: Student response contains minimal evidence of understanding of number sense and operations *With apologies to one of our major contracts

Ordered Item #2 Example of a 1-point response:

**Ordered Item #XX (XXth easiest item on the test)**

a) Won needs 54 tiles for a new kitchen floor. The tiles come in boxes of 8 tiles each. How many boxes of tiles will Won need to cover the floor? b) One box of tiles costs $ Estimate the amount Won will have to spend. Explain how you got your answer. Score point 2: Student response contains fair evidence of understanding of number sense and operations

21
Ordered Item #XX Example of a 2-point response:

Sample Item Map OI # What KSAs does the student need to answer this question? Why is this question more difficult than the previous one? 1 2 3 … 50

Review: Best for tests with primarily __ items A student-centered method, i.e., rating decisions are based on intact samples of actual student work Sets of student work are presented in order from the lowest-scoring BOW to the highest-scoring Panelists make their ratings by sorting the BOWs into four piles (corresponding to the PLs)

**Standard-Setting Process**

Prior to the Meeting

**Creating Performance Level Descriptors**

Sometimes (if not often), the PLDs are not much more specific than the ones presented earlier for the Math Test. The less specificity these have going into standard setting, the more work panelists must to do establish an understanding of the definitions for each PL

26
Selecting Panelists Usually aim for 10 to 20 panelists per group, however… Panelists are expensive Panelists are sometimes hard to come by As a result, group sizes are often closer to 10 than 20, and are sometimes even smaller Panels usually consist mostly of teachers, but can also include administrators, parents, business or community leaders, legislators, etc. Panelists should be chosen to be representative of all important stakeholder groups in terms of: ethnicity, gender, geographic location, rural vs. urban area, etc.

**Training Facilitators**

Standards for each grade/content combination are set by a separate panel, so it’s important that the group facilitators are following the process consistently Prior to the meeting, a Facilitator’s Script is prepared and a training meeting is held to make sure facilitators have a common understanding of the process

During the Meeting

**Orientation/Training**

The standard-setting meeting (regardless of the method being used) starts with an orientation session that is attended by all panelists The session includes background information about the assessment as well as an overview of standard setting and the process they will be going through After the opening session, panelists break up into their grade/content area groups; each group is in a separate room

**Taking the Test/Reviewing Test Materials**

Once in their grade/content area groups, the panelists for the Math Test will start by taking the test For the Math Test-Alt, there isn’t really a test in the same sense, so the panelists will review the test materials This step ensures that panelists are very familiar with the test content and what students who take the test experience

31
Review PLDs Here is where the panelists determine what it means to be “Below Proficient,” “Proficient,” or “Advanced” They review the PLDs that are provided to them, and they discuss the specific KSAs students must demonstrate in order to fall into each category Often, they will create bulleted lists for each level that are then posted on chart paper for them to refer to as they do the rating process It is critical that panelists come to consensus about these definitions

**Completing the Item Map (Bookmark)**

For the Math Test, the panelists will then review the ordered item booklet and fill in the item map On the item map, for each ordered item, they will write the KSAs required to successfully complete that item and why it is more difficult than the one before (remember, the items are presented in order by difficulty) This will help them tie the items back to the PLDs they worked on in the previous step which will help them when it comes time to place their bookmarks

33
For the Math Test, panelists start with the lowest cut and, working their way through the OIB, ask themselves “Would a student who’s just barely over the line into ‘Below Proficient’ have at least a 2/3 chance of getting this item right?” For OI #1, the answer will probably be yes (although not necessarily); as the items get harder, at some point, the answer will change to no. This is where the panelist places his/her bookmark.

34
What do you mean at some point the answer will change to yes? Is it really that clear-cut? No, of course not. There will be gray areas. And what’s this whole “2/3 chance” business? I’m glad you asked that question. It reflects the probabilistic nature of the IRT model that’s used to estimate the difficulty of the items and order them in the OIB.

35
Once the panelists have placed the bookmark for the first cut (F vs. BP), they will repeat the process for the middle cut (BP vs. P) and, finally, the top cut (P vs. A)

36
For the Math Test-Alt, the panelists start with the first BOW in the pile (the lowest-scoring BOW) and compare the KSAs the student has demonstrated to the PLDs and decide which PL matches that student’s performance best For the first BOW, the answer will probably be F (but not necessarily). They will work their way through the entire set of BOWs and classify each one into one of the four piles

37
For both methods, ratings are done in three rounds (although that isn’t the case for all standard settings…)

Round 1 Work is done individually, without any consultation with other panelists Once the panelists have completed their Round 1 ratings, they fill in the Round 1 rating form R&A analyzes the results and calculates the group average cut points

39
Round 2 Panelists discuss the Round 1 results, including the average cut points, and share their rationale for how they did their ratings Once the Round 2 discussions are complete, panelists fill in the Round 2 rating form R&A again analyzes the results and calculates the group average cut points and impact data: the percentage of students who would fall into each of the PLs based on the Round 2 average cuts

Round 3 Round 3 is very similar to Round 2, except that now the panelists have the impact data to consider as part of their discussions Panelists are cautioned against basing their decisions solely on the impact data Once the Round 3 discussions are complete, panelists fill in the Round 3 rating form

41
ID _____________ F Ordered Item Numbers First Last ___ BP First Last ___ ___ P A ___

**Math Test-Alt Round 1 Rating Form**

BOW F BP P A 1 2 3 4 etc.

Some Technical Arcana Group Average Cutpoints for the Bookmark Method are determined using the IRT-based difficulty values. (Remember those? That’s what we used to order the items.) Group Average Cutpoints for the BOW Method are calculated using Logistic Regression Impact data are based on actual student performance on the test the last time it was administered

44
45
**Standard-Setting Process**

46
R&A calculates the Round 3 Cut Points and presents the results to the client for approval Sometimes, R&A may recommend adjustments to the cut points This happens most commonly when standards are set in multiple grades and it is undesirable to have standards that vary substantially across grade levels In this case, we might smooth the results Panelists are told at the beginning of the process that their cuts will be recommendations, and the final results may differ somewhat

47
The results of the panelists’ evaluation forms are compiled and reviewed On rare occasions, this review may identify an individual panelist whose ratings should be excluded from the results

48
The final step in the standard-setting process is to write up the process used and the results in a standard-setting report

