Hands-on Automated Scoring

Hands-on Automated Scoring
Sue Lottridge and Carlo Morales, Pacific Metrics Mark Shermis, University of Houston – Clear Lake

Workshop Goals Big-picture steps of how to train an engine and evaluate engine performance Hands-on experience with automated scoring across the training and validation pipeline

Overview of engine training Data handling Evaluation criteria
Agenda 8:30-10:00: Presentations Overview of engine training Data handling Evaluation criteria Engine fundamentals Considerations 10: :15: Break 10:15-11:30: Hands-on engine training Walk-through demo with data

Discovery Specifications Analysis Report Item Materials Scoring Rules
Engine Training Steps Discovery Item Materials Scoring Rules Data Specifications Create Datasets Identify Models Analysis Procedures Analysis Build Models Pick Final Model Score and Analyze Report Purpose Methods Results

Item Materials Scoring Rules Data Discovery
Presentation Information/Tools Item Rubric Passage Scoring Rules Relevant range-finding decisions Adjudication rules Read-behind procedures Choice of score on which to train Data Data definitions Data (ID, responses, human-assigned scores, other) Data summaries (validation) Review of raw data and summaries

Define Specifications
Create Sets Train set (2/3) – Number of folds Held-out test set (1/3) Build Models Preprocessing parameters Feature extraction parameters Scoring parameters Analysis H1-H2-Engine statistics Train and Test set Data deliverables Archiving

Build Models Finalize Model Score Implementation Steps Build data sets
Build models on train set Score “folds” based on trained models Cuts set, as needed Finalize Model Determine best-performing model Store model parameters Review with client if requested Score Score held-out test set Create data deliverable Conduct evaluation analyses Archive

Purpose Methods Results Generate Report Intro Reason for study
Description of items/rubrics/passages Scoring model design Methods Description of data Description of sample allocations across sets Description of number of models employed and final model Description of analyses and any criteria used to evaluate Results Tables of results for each dataset (train, test) Text description of results including scoring issues Recommendations Data deliverables

Evaluating the Performance of an Automated Scoring Engine
An industry-wide standard should be used. The current standard is from Williamson, Xi, and Breyer (2012). Note that other factors may weigh into decision: Identification of aberrant responses Concern for scores at critical decision points (near cut scores) Quality of scoring across the rubric Use of AS in the operational program (e.g., monitoring, sole read, hybrid)

Evaluating Engine Performance
Metrics Mean, SD, Standardized Mean Difference Exact Agreement, Adjacent, Non-Adjacent Kappa, Quadratic Weighted Kappa, Correlations Evaluation Criteria (Williamson, Xi, & Breyer, 2012) Exceed minimal QWK .70 threshold Exceed Correlation .70 threshold QWK degradation not to exceed .10 SMD values not to exceed .15 Exact Agreement degradation not to exceed 5% One can also evaluate performance by targeted subgroups.

Data Item, rubric, training papers, and ancillary materials
Scoring decisions and human scores available Responses with set of human scores Deciding on score to use for prediction and score to use for evaluation

Statistical Considerations
Reason Total N Do we have enough responses to train and validate an engine? What proportion should be used for each sample? N at each score point Do we have enough responses at each score point to produce a reliable prediction? Is there a bimodal distribution, suggestion potential issue with the rubric? Mean/SD Is item of hard, easy, or medium difficulty? Exact Agreement Does it seem too low or too high, given rubric scale? Is it dominated by one or more score points? Non-Adjacent Agreement Is it in the usual range (e.g., 1-5%)? If high, potentially a scoring issue… Quadratic Weighted Kappa Is it > .70? Correlation

Response length – anything unusual? Uncommon responses
Other considerations Once samples divided into training and validation, conduct review of training papers Response length – anything unusual? Uncommon responses Range of responses within score point Need for manual changes to response

Sample Human Rater Statistics
Score H1 H2 SOR 22.6% 23.8% 1 44% 41.3% 2 33.4% 34.9% mean 1.11 std 0.74 0.76 N 416 Agreement H1-H2 H1-SOR H2-SOR Exact 81.2% 100% Adjacent 18.5% 0% Non-Adj 0.2% Kappa 0.71 1 QWK 0.83 Pearson r Spearman r N 416

Training sample vs. held-out validation sample
Sample Handling Training sample vs. held-out validation sample Proportion (often 67% train, 33% test) Training sample used for building model Validation sample used for evaluating model (never touch this until model is finalized) K-fold cross-validation Enables multiple-model evaluation with realistic evaluation of performance Very powerful tool when used with grid-searching to build and evaluating competing models Once finalize model, train on entire sample (more data = better model)

K-Fold Cross-Validation

The Modeling Pipeline PreProcessing Feature Extraction Scoring
Spell correct/Term replace Stopword removal Lemmatization Punctuation handling Case handling Feature Extraction Term vectors (ngrams) Base counts Response characteristics Scoring Regression/Classification Parametric vs. Non-parametric models Choice of score to predict Fit indices

Protect against over-fit Score modelling choices
Other Considerations Protect against over-fit Score modelling choices Balance agreement with score distribution Quadratic weighted kappa best single metric for this Pay attention to both Manage aberrant responses Usually through other means but also through engine flagging Fairness issues How to handle?

15 minute break!

Hands-on Automated Scoring
Demo of machine scoring tool A few notes Does not use CRASE engine, simplified scoring tool to demonstrate AS methods Allows user to upload data, review stats on responses, conduct limited preprocessing, feature extraction, and model-building, visualize modeling process, and view results We have data from the Automated Scoring Assessment prize for contructed response scoring for you to use

Use your assigned link Example item: ASAP data Toolkit Demo
Walk though of functions File upload Train/test allocation proportions Sample view Describe PCA/LDA graphic and results below Observe changes with preproc

1. response processing parameters 2. n-gram size
Task 1 ‘Play’ with the various parameters and observe the performance of the engine, varying: 1. response processing parameters 2. n-gram size 3. Principal components What we your measure of quality? What was your approach to modelling? What seemed to produce the best results? Did anything reduce scoring quality?

Do the same as before, but tune using the automation feature
Task 2 Do the same as before, but tune using the automation feature 1. response processing parameters 2. n-gram size How did the results change across the various methods? Does one score prediction model seem to work better than another?

Determine the impact of overfitting your model OR
Task 3 One of: Determine the impact of overfitting your model OR Find a way to cheat your model!

Thank you!

Hands-on Automated Scoring

Similar presentations

Presentation on theme: "Hands-on Automated Scoring"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hands-on Automated Scoring

Similar presentations

Presentation on theme: "Hands-on Automated Scoring"— Presentation transcript:

Similar presentations

About project

Feedback