2What is Standard Setting? It’s a judgmental process, in which qualified experts (usually mostly or all teachers) determine “How much is enough”It uses an established set of activities designed to lead the panelists through the process in a consistent and systematic way
3What is Standard Setting? So what does this mean exactly?Let’s look at an exampleBut first…
4A DisclaimerNote that the hypothetical (i.e., fake) math test presented on the following slides has clearly never been seen by Measured Progress’s excellent content experts, editors, marketing folks, high-up decision-makers, etc., none of whom would ever let me get away with this for a variety of excellent reasons.However, for purposes of illustration…
5For ExampleLet’s say we have a test, the Measured Progress Math Test, consisting of:40 multiple-choice (MC) items6 one-point short-answer (SA) items, and1 four-point constructed-response (CR) itemfor a total of 50 points.
6Math Test, cont.We need to use scores on the Math Test to meet federal accountability requirements as defined by NCLBThere are lots of different types of test “scores:” raw score (i.e., number right), scaled scores, θ scores (which you’ll learn about next week when Mike talks about Item Response Theory), etc.NCLB requires that we report the percentage of students who “meet standards”
7Math Test, cont.To do this, we need to establish cut points to define our four performance levels:Advanced (A)Proficient (P)Below Proficient (BP)Failing (F)Cut point 3Cut point 2Cut point 1
8Math Test, cont.To help us with this task, we have general performance level descriptors, that tell us what it means to be in each of the four PLs (with apologies to one of our major contracts):A: Students demonstrate in-depth understanding and can solve complex problemsP: Students demonstrate solid understanding and can solve routine problemsBP: Students demonstrate partial understanding and can solve some simple problemsF: Students demonstrate minimal understanding and cannot solve problems
9Standard Setting for the Math Test Goal is to “operationalize” these general performance level descriptors:What does “in-depth understanding” or “solid understanding” or “partial understanding” mean?What distinguishes “complex,” “routine,” and “simple” problems?What specific skills correspond to each of these general performance level descriptors?And, finally, what does this translate into in terms of performance (i.e., scores) on the Math Test?All of these questions illustrate why standard setting is a judgmental process and why we need content experts (teachers) to set standards
10What is Standard Setting? So, standard setting is simply an established process that takes panelists through a series of systematic steps:Becoming very familiar with the test and what it measuresDefining in specific terms what it means to be “Advanced,” “Proficient” or “Below Proficient” and coming to consensus about those definitions as a group“Operationalizing” those definitions by determining the test scores that indicate a student has demonstrated the necessary knowledge, skills and abilities (KSAs) to be classified as (for example) Proficient
11What is Standard Setting? There are a variety of methods that can be used to accomplish these goals, and a myriad of variations on these methodsSo how do we decide which method to use?
12Selecting a standard-setting method As mentioned on the previous slide, there are lots and lots of standard-setting methods out there:Bookmark (and Modified Bookmark)Angoff (and Modified Angoff)Body of WorkAnalytic JudgmentDominant ProfileContrasting Groupsand so on…
13Selecting a method, cont. Choosing among these options is a matter of one or more of the following:Previous historyPolicy directive or recommendationAppropriateness for item types, for example:Bookmark works well for tests that contain mostly MC itemsBody of Work works well for tests that include a lot of CR itemsetc.No method is right or wrong; task is to select the most appropriate for a given situationDifferent methods yield different results.
14Selecting a method, cont. For our test, the Math Test, we’re going to use the Bookmark Method, because the test consists primarily of MC items, but also includes a few SA items and 1 CR item.However, we also need to set standards for the Math Test-Alt. The Math Test-Alt consists entirely of polytomous items, so we’re going to use the Body of Work Method
15Bookmark vs. Body of Work Review: Best for tests with mostly __ itemsA test-centered method, i.e., rating decisions are based primarily on the test items rather than samples of student workUses an Ordered Item Booklet and an Item MapPanelists make their ratings by placing bookmarks in the OIB
16Bookmark vs. Body of Work, cont. Ordered Item Booklet:Each page in the booklet is a single item (or a single score point for polytomous items)Items are presented in order from the easiest item on the test to the hardestTotal number of ordered items = total possible raw score on the testOrder is based on actual student performance (determined using IRT – come back next week)
17Ordered Item #1 (easiest item on the test) 2+2=4*52.222
18Ordered Item #2* (second easiest item on the test) a) Won needs 54 tiles for a new kitchen floor. The tiles come in boxes of 8 tiles each. How many boxes of tiles will Won need to cover the floor?b) One box of tiles costs $ Estimate the amount Won will have to spend.Explain how you got your answer.Score point 1: Student response contains minimal evidence of understanding of number sense and operations*With apologies to one of our major contracts
20Ordered Item #XX (XXth easiest item on the test) a) Won needs 54 tiles for a new kitchen floor. The tiles come in boxes of 8 tiles each. How many boxes of tiles will Won need to cover the floor?b) One box of tiles costs $ Estimate the amount Won will have to spend.Explain how you got your answer.Score point 2: Student response contains fair evidence of understanding of number sense and operations
22Sample Item MapOI #What KSAs does the student need to answer this question?Why is this question more difficult than the previous one?123…50
23Bookmark vs. Body of Work, cont. Review: Best for tests with primarily __ itemsA student-centered method, i.e., rating decisions are based on intact samples of actual student workSets of student work are presented in order from the lowest-scoring BOW to the highest-scoringPanelists make their ratings by sorting the BOWs into four piles (corresponding to the PLs)
25Creating Performance Level Descriptors Sometimes (if not often), the PLDs are not much more specific than the ones presented earlier for the Math Test.The less specificity these have going into standard setting, the more work panelists must to do establish an understanding of the definitions for each PL
26Selecting PanelistsUsually aim for 10 to 20 panelists per group, however…Panelists are expensivePanelists are sometimes hard to come byAs a result, group sizes are often closer to 10 than 20, and are sometimes even smallerPanels usually consist mostly of teachers, but can also include administrators, parents, business or community leaders, legislators, etc.Panelists should be chosen to be representative of all important stakeholder groups in terms of: ethnicity, gender, geographic location, rural vs. urban area, etc.
27Training Facilitators Standards for each grade/content combination are set by a separate panel, so it’s important that the group facilitators are following the process consistentlyPrior to the meeting, a Facilitator’s Script is prepared and a training meeting is held to make sure facilitators have a common understanding of the process
29Orientation/Training The standard-setting meeting (regardless of the method being used) starts with an orientation session that is attended by all panelistsThe session includes background information about the assessment as well as an overview of standard setting and the process they will be going throughAfter the opening session, panelists break up into their grade/content area groups; each group is in a separate room
30Taking the Test/Reviewing Test Materials Once in their grade/content area groups, the panelists for the Math Test will start by taking the testFor the Math Test-Alt, there isn’t really a test in the same sense, so the panelists will review the test materialsThis step ensures that panelists are very familiar with the test content and what students who take the test experience
31Review PLDsHere is where the panelists determine what it means to be “Below Proficient,” “Proficient,” or “Advanced”They review the PLDs that are provided to them, and they discuss the specific KSAs students must demonstrate in order to fall into each categoryOften, they will create bulleted lists for each level that are then posted on chart paper for them to refer to as they do the rating processIt is critical that panelists come to consensus about these definitions
32Completing the Item Map (Bookmark) For the Math Test, the panelists will then review the ordered item booklet and fill in the item mapOn the item map, for each ordered item, they will write the KSAs required to successfully complete that item and why it is more difficult than the one before (remember, the items are presented in order by difficulty)This will help them tie the items back to the PLDs they worked on in the previous step which will help them when it comes time to place their bookmarks
33Rating Process - Bookmark For the Math Test, panelists start with the lowest cut and, working their way through the OIB, ask themselves “Would a student who’s just barely over the line into ‘Below Proficient’ have at least a 2/3 chance of getting this item right?”For OI #1, the answer will probably be yes (although not necessarily); as the items get harder, at some point, the answer will change to no. This is where the panelist places his/her bookmark.
34Wait a minute – not so fast! What do you mean at some point the answer will change to yes? Is it really that clear-cut?No, of course not. There will be gray areas.And what’s this whole “2/3 chance” business?I’m glad you asked that question. It reflects the probabilistic nature of the IRT model that’s used to estimate the difficulty of the items and order them in the OIB.
35Rating Process – Bookmark, cont. Once the panelists have placed the bookmark for the first cut (F vs. BP), they will repeat the process for the middle cut (BP vs. P) and, finally, the top cut (P vs. A)
36Rating Process: Body of Work For the Math Test-Alt, the panelists start with the first BOW in the pile (the lowest-scoring BOW) and compare the KSAs the student has demonstrated to the PLDs and decide which PL matches that student’s performance bestFor the first BOW, the answer will probably be F (but not necessarily).They will work their way through the entire set of BOWs and classify each one into one of the four piles
37Rating Process: Both Methods For both methods, ratings are done in three rounds (although that isn’t the case for all standard settings…)
38Round 1Work is done individually, without any consultation with other panelistsOnce the panelists have completed their Round 1 ratings, they fill in the Round 1 rating formR&A analyzes the results and calculates the group average cut points
39Round 2Panelists discuss the Round 1 results, including the average cut points, and share their rationale for how they did their ratingsOnce the Round 2 discussions are complete, panelists fill in the Round 2 rating formR&A again analyzes the results and calculates the group average cut points and impact data: the percentage of students who would fall into each of the PLs based on the Round 2 average cuts
40Round 3Round 3 is very similar to Round 2, except that now the panelists have the impact data to consider as part of their discussionsPanelists are cautioned against basing their decisions solely on the impact dataOnce the Round 3 discussions are complete, panelists fill in the Round 3 rating form
41Math Test Round 1 Rating Form ID _____________FOrdered Item NumbersFirst Last___BPFirst Last___ ___PA___
42Math Test-Alt Round 1 Rating Form BOWFBPPA1234etc.
43Some Technical ArcanaGroup Average Cutpoints for the Bookmark Method are determined using the IRT-based difficulty values. (Remember those? That’s what we used to order the items.)Group Average Cutpoints for the BOW Method are calculated using Logistic RegressionImpact data are based on actual student performance on the test the last time it was administered
44EvaluationAfter completing Round 3 of the ratings, panelists are asked to complete an evaluation of the standard-setting process.
46Calculating Final Cut Points R&A calculates the Round 3 Cut Points and presents the results to the client for approvalSometimes, R&A may recommend adjustments to the cut pointsThis happens most commonly when standards are set in multiple grades and it is undesirable to have standards that vary substantially across grade levelsIn this case, we might smooth the resultsPanelists are told at the beginning of the process that their cuts will be recommendations, and the final results may differ somewhat
47Analyzing the Results of the Evaluation The results of the panelists’ evaluation forms are compiled and reviewedOn rare occasions, this review may identify an individual panelist whose ratings should be excluded from the results
48Standard-Setting Report The final step in the standard-setting process is to write up the process used and the results in a standard-setting report