# Interpretation: How to Use Psychometrics. A Different Format Previous talks were generally about one topic Today’s presentation: Where does this stuff.

## Presentation on theme: "Interpretation: How to Use Psychometrics. A Different Format Previous talks were generally about one topic Today’s presentation: Where does this stuff."— Presentation transcript:

Interpretation: How to Use Psychometrics

A Different Format Previous talks were generally about one topic Today’s presentation: Where does this stuff come up at MP, outside of the psychos? A little bit of info on several different things

The goals Understand various psychometric analyses as they arise in day-to-day work See which stats are used in different applications Answer questions

Topics Covered Things you’d find in a key verification file –Classical stats (p-values, point-biserials) Things you’d find at a form pulling –IRT stats (TCC’s, TIF’s) Things you’d find in a technical manual –All sorts of info A question you’d hear at a standard setting –IRT

1. Key Verification Files Purpose: To check the correctness of answer keys (MC items) A list of items whose stats are unusual or merit further investigation Items identified based on their p-values and/or point-biserials

P-value: The proportion of students answering an item correctly –“How easy is the item?” Point-biserial: The correlation between item score and total score –“If you do well on the item, do you tend to do well on the test?”

When might we be alarmed? Not many kids are picking the right answer –The p-value is low (less than.25) Low-performing kids are doing better on the item than high-performing kids –The point-biserial is low (less than.15) and/or If an incorrect answer choice has strange stats

Distractor Stats Distractor p-value: The proportion of students picking the distractor (say, choice C when the correct answer is B) –“How popular is choice C?” –Flag item if distractor p-value is higher than.3 Distractor point-biserial: The correlation between picking the distractor and total test score –“If you picked C, how well did you tend to do on the test?” –Flag item if distractor PBS is positive

An Operational Example A recent item had the following stats: –Key = D –P-value = 0.10 –Point-biserial = -0.02 – P-value for “C” = 0.60 –Point-biserial for “C” = 0.20 So the key was wrong? Nope

How Can That Happen? An example: What is the definition of the word travesty? A: Mockery B: Injustice C: Bellybutton D: Some even stupider answer than “bellybutton” Actual definition: “Any grotesque or debased likeness or imitation” The correct answer is “A”, but “travesty of justice” threw off the high-performing students

To sum up… Psychometrics can help us identify items whose keys need to be checked Stats used: –P-values –Point-biserials –Distractor p-values and point-biserials P-values & point-biserials should be relatively high, distractor values should be relatively low The key usually turns out to be right, but that’s OK

2. Form Pulling Context: We are choosing items for next year’s exam Clients like to look at psychometric info when picking items (e.g., MCAS) We know the stats ahead of time because items were field-tested Relevant stats: Test Characteristic Curves (TCC’s), raw score cut points, Test Information Functions (TIF’s)

This stuff relates to Item Response Theory (IRT) TCC is a plot that tells you the expected raw score for each value of ability (denoted theta) As ability increases, expected raw score increases

Example of a TCC: 5 Items

Raw Score Cut Points Suppose test has 4 performance levels: Below Basic, Basic, Proficient, Advanced How many points do you need in order to reach the Basic level? Proficient? Advanced? Example: Test goes from 0 to 72. Need 35 to reach Basic; 51 to reach Proficient; 63 to reach Advanced Standard Setting often tells us theta cut points; clients want to know raw score cuts

Using the TCC to find a cut point Suppose theta cut is 0.4 Find expected raw score at 0.4 using the TCC. It is 3.3 Cut is placed between 3 and 4

Test Information Functions TIF’s tell us the test precision at each level of ability The higher the curve, the more precision Easy items give us precision for low values of theta. Similarly: –Hard items give precision at high values –Medium items give precision at medium values

Example of a TIF

Why does the client care? It is often desired that next year’s forms are similar to this year’s forms Make sure tests are correct difficulty (TCC, RS cut points) & precision (TIF) Match TCC’s, cut points, TIF’s of the two years

Why should the forms be similar? Theoretically, we should be able to account for differences through equating (Liz) However, want the student experience to be similar from year to year Don’t want to give easy test to Class of ’07, hard test to Class of ’08 Don’t want to make this year’s test less precise than last year’s

Example: 2007 MCAS, Grade 10 Math Proposed 2007 TCC was lower than last year’s Solution: Replace some hard items with easy items

Example, Continued Proposed 2007 TIF had less info at low abilities, more info at high abilities Solution: –Replace some hard items with easy items –Use hard items with lower PBS, easy items with higher PBS

Example, Continued Proposed 2007 raw score cuts lower than 2006 raw score cuts Solution: Replace some hard items with easy items RAW SCORE CUTS OLDPROPOSED 2015 3328 4541

To sum up… Item Response Theory is useful in form pulling TCC’s, raw score cuts, TIF’s are often examined –Proposed values should be similar to current year’s –Tests shouldn’t be too easy or hard –Tests should be informative but not too informative It’s helpful to know how we can change these things based on item stats

3. Technical Manuals Things in Technical Manuals vary from program to program Often see some of the following: –P-values and point-biserials (thanks Louis!) –Test reliabilities (thanks Louis!) –TCC’s and TIF’s (thanks Mike!) –DIF (thanks Won!) –Standard Setting (thanks Liz and Abdullah!) –Equating (thanks in advance Liz!) –Inter-rater reliability (thanks for nothing!) –Decision consistency and accuracy (ditto)

Technical Manuals: P-Values & Point-Biserials You’ll often see a table like this: GradeSubjectStatALLMCOR 3MATDiff0.67 ( 0.15)0.7 ( 0.13)0.61 ( 0.16) 3MATDisc0.44 ( 0.08)0.43 ( 0.07)0.47 ( 0.1) 3MATN1428953 3READiff0.67 ( 0.15)0.71 ( 0.13)0.52 ( 0.11) 3READisc0.48 ( 0.1)0.45 ( 0.09)0.6 ( 0.05) 3REAN857015

Technical Manuals: Reliabilities (and other stats) Louis said: Reliability is the correlation between scores on parallel forms. Higher reliability  Greater consistency You’ll often see a table like this: GradeSubjectNPointsMinMaxMeanS.D. Rel. (α) 3MAT32219650 40.34113.6930.934 3REA32087520 31.44610.8690.895 4MAT32673650 39.62813.0430.925 4REA32527520 33.4529.1120.891 5MAT33532660 31.54613.680.917 5REA33402520 29.1538.640.876

Technical Manuals: TCC’s and TIF’s Give TCC, TIF of each grade / content area

Technical Manuals: DIF Won said: An item has DIF if the probability of getting the item right is dependent on group membership (e.g., gender, ethnic group) Measured Progress uses a method called the Standardized P-Difference Comparing groups –Male-Female –White-Black –White-Hispanic Minimum 200 examinees in each group

DIF, Continued A: [-0.05 ~ 0.05]negligible B: [-0.1 ~ -0.05) and (0.05 ~ 0.1]low C: outside the [-0.1 ~ 0.1]high CC A: [-0.05 ~ 0.05]negligible B: [-0.1 ~ -0.05) and (0.05 ~ 0.1]low C: outside the [-0.1 ~ 0.1]high AB B

DIF, Continued You may see a table like this:

Technical Manuals: Standard Setting & Equating Liz and Abdullah discussed Standard Setting In technical manuals, you’ll often see: –Report / summary of standard setting process Info about panelists (how many, who they are) What method was used (e.g., bookmark / Body of Work) Cut points Info about panelist evaluations Equating: Come next week and find out!

Inter-rater reliability When constructed-response items are rated by multiple scorers, how well do raters agree? The more agreement, the better Exact agreement: What % of the time do they give the same score? Adjacent agreement: What % of the time are they off by 1? Reading Open Response AgreementExactAdjacent> 1 Percentage 69.327.43.3

Decision Accuracy and Consistency: Introduction For most programs, four achievement levels, e.g., Below Basic, Basic, Proficient, Advanced Decision accuracy: degree to which observed categorizations match true categorizations Decision consistency: degree to which observed categorizations match those of a parallel form

Intuitive examples of accuracy TRUE LEVEL: Proficient OBSERVED LEVEL: Proficient DIAGNOSIS: ACCURATE (GOOD) TRUE LEVEL: Proficient OBSERVED LEVEL: Below Basic DIAGNOSIS: INACCURATE (BAD). False negative TRUE LEVEL: Basic OBSERVED LEVEL: Advanced DIAGNOSIS: INACCURATE (BAD). False positive

Intuitive examples of consistency OBSERVED LEVEL, Form 1: Basic OBSERVED LEVEL, Form 2: Basic DIAGNOSIS: CONSISTENT (GOOD) OBSERVED LEVEL, Form 1: Basic OBSERVED LEVEL, Form 2: Advanced DIAGNOSIS: INCONSISTENT (BAD)

Decision Accuracy and Consistency: Introduction Livingston and Lewis (1995) proposed method of estimating decision accuracy/consistency For most programs, many stats are computed. We will give an example of each The stats are all based on joint distributions A joint distribution gives the proportion of times that 2 things both happen. –What proportion of students are truly Basic and are observed as Below Basic?

Joint Distribution: True/Observed Achievement Levels Overall accuracy: 0.7484 Observed Status BBBPATotal BB 0.07060.01760.00070.00000.0889 B 0.03200.10580.04360.00000.1814 P 0.00140.05320.47260.07340.6007 A 0.0000 0.02960.09930.1290 Total 0.10410.17660.54660.17281.0000 True Status

Joint Distribution: Observed/Observed Achievement Levels Overall consistency: 0.6574 Observed Status: Form 2 BBBPATotal BB 0.06730.03100.00580.00000.1041 B 0.03100.08200.06320.00030.1766 P 0.00580.06320.40660.07090.5466 A 0.00000.00030.07090.10150.1728 Total 0.10410.17660.54660.17281.0000 Observed Status: Form 1

Indices Conditional upon Level Proportion of students correctly classified, given true level Proportion of students consistently classified by parallel form, given observed level AccuracyConsistency BB 0.79450.6466 B 0.58310.4645 P 0.78680.7439 A 0.77020.5876

Indices at Cut Points Accuracy & consistency at specified cut point Accuracy: What is the chance that a student is classified on the “correct side” of a cut point? Consistency: What is the chance that a student is classified on the same side of a cut point twice? AccuracyFalse PositiveFalse NegativeConsistency BB:B 0.94830.01830.03340.9264 B:P 0.90110.04430.05460.8612 P:A 0.89690.07340.02960.8575

To sum up… Lots of stuff in technical manuals Both classical test theory material (p-values, point-biserials, reliabilities) & IRT material (TCC’s, TIF’s, equating) are important to understand Hopefully, these seminars have helped familiarize you with their contents

4. Standard Setting Comes up all the time outside Psychoville Should be a perfect topic for this talk, but… Liz and Abdullah already did a wonderful job

4. Standard Setting Standard Setting is the process of recommending cut scores between achievement levels –Advance (A) –Proficient (P) –Below Proficient (BP) –Failing (F) Focus on one FAQ in bookmark: –How do we determine the arrangement of items in the ordered item booklets? Cut point 3 Cut point 2 Cut point 1

Brief Review of Bookmark Each panelist makes use of the ordered item booklet Items in the OIB are presented from easiest to hardest. One page per MC item Panelists’ job is to place bookmark in OIB for each cut For a given cut, where do panelists place a bookmark? –Where they think borderline students would no longer have a 2/3 chance (or better) of a correct answer Abdullah said: cut points are derived from bookmark placements

A Very Frequently-Asked Question First, a FMC: “You messed up the order of the items!” Then, the FAQ: “Well, how did you determine the order?” Important: Order is based on actual student performance We use the concept of IRT

Two MC items: Which is easier? Easier item Harder item

Depending on IRT Model this issue can become quite complex

An Intuitive Explanation An easy item: An item that even low-ability students get right a high proportion of the time That is, students with small theta values tend to get it right Which item has the smallest theta value corresponding to a high probability of a correct answer? How high a probability? Use 2/3 for consistency IN SUM: “Easiest” item is the one with the smallest theta corresponding to p = 2/3 –“Hardest” has largest theta corresponding to p = 2/3

Use the 2/3 Criterion Easiest to hardest: Orange, green, red, purple, blue Thetas: -0.6 0.2 0.3 0.8 1.2

How about polytomous items? A polytomous item is one that has more than 2 possible scores –MC items are dichotmous (0/1), not polytomous –Example of polytomous: OR item scored 0,1,2,3,4 Such an OR item is in the OIB four times, once for each score point 1,2,3,4 Where do you put this item’s 4 pages in the OIB?

Incorporating polytomous items Just as with dichotmous items, we use IRT What theta do you need to have a 2/3 chance of getting a 1 or better? 2 or better? 3 or better? 4? The theta must increase as the score increases Suppose the results are: -0.4, 0.4, 0.6, 1.8 Easiest to hardest: Orange, green, red, purple, blue Thetas: -0.6 0.2 0.3 0.8 1.2 1234

Download ppt "Interpretation: How to Use Psychometrics. A Different Format Previous talks were generally about one topic Today’s presentation: Where does this stuff."

Similar presentations