Presentation on theme: "Jonathan Huang Chuong Do Daphne Koller Zhenghao Chen Andrew Ng Chris Piech Stanford Coursera Tuned Models of Peer Assessment in MOOCs."— Presentation transcript:
Jonathan Huang Chuong Do Daphne Koller Zhenghao Chen Andrew Ng Chris Piech Stanford Coursera Tuned Models of Peer Assessment in MOOCs
How can we efficiently grade 10,000 students? A variety of assessments 2
The Assessment Spectrum 3 Short Response Long Response Multiple choice Essay questions Coding assignments Proofs Easy to automate Limited ability to ask expressive questions or require creativity Hard to grade automatically Can assign complex assignments and provide complex feedback
Video lectures + embedded questions, weekly quizzes, open ended assignments Stanford/Coursera’s HCI course
Calibrated peer assessment 1) Calibration2) Assess 5 Peers3) Self-Assess ✓ staff-graded Similar process also used in Mathematical Thinking, Programming Python, Listening to World Music, Fantasy and Science Fiction, Sociology, Social network analysis.... Slide credit: Chinmay Kulkarni (http://hci.stanford.edu/research/assess/) Image credit: Debbie Morrison (http://onlinelearninginsights.wordpress.com/) [Russell, ’05, Kulkarni et al., ‘13]
Peer Grading Desiderata Highly reliable/accurate assessment – Reduced workload for both students and course staff – Scalability (to, say, tens of thousands of students) – –Statistical model for estimating and correcting for grader reliability/bias –A simple method for reducing grader workload –Scalable estimation algorithm that easily handles MOOC sized courses Our work:
How to decide if a grader is good Submissions Graders 100% 30% 50% 55% 56% 54% Who should we trust? Idea: look at the other submissions graded by these graders! Need to reason with all submissions and peer grades jointly!
Model PG 1 True score of student u Grader reliability of student v Student v’s assessment of student u (observed) Grader bias of student v Modeling grader bias and reliability [Whitehill et al. (‘09), Bachrach et al. (‘12), Kamar et al. (‘12) ] Crowdsourcing [Batchelder & Romney (‘88)] Anthropology [Goldin & Ashley (‘11), Goldin (‘12)] Peer Assessment Related models in literature
Correlating bias variables across assignments Biases estimated from assignment T with biases at assignment T+1
Model PG 2 True score of student u Grader reliability of student v Student v’s assessment of student u (observed) Grader bias of student v Temporal coherence Grader bias at homework T depends on bias at T-1
Model PG 3 True score of student u Student v’s assessment of student u (observed) Grader bias of student v Coupled grader score and reliability Your reliability as a grader depends on your ability! Approximate Inference: Gibbs sampling (also implemented EM, Variational methods for a subset of the models) Running time: ~5 minutes for HCI 1 ** PG 3 cannot be Gibbs sampled in “closed form”
Incentives Scoring rules can impact student behavior Model PG 3 gives higher homework scores to students who are accurate graders! Model PG 3 gives high scoring graders more “sway” in computing a submission’s final score. Improves prediction accuracy Encourages students to grade better See [Dasgupta & Ghosh, ‘13] for a theoretical look at this problem
Baseline (median) prediction accuracy Model PG3 prediction accuracy Prediction Accuracy 33% reduction in RMSE Only 3% of submissions land farther than 10% from ground truth
Prediction Accuracy, All models HCI 1 HCI 2 PG 3 typically performs other models An improved rubric made baseline grading in HCI2 more accurate than HCI1 Despite an improved rubric in HCI2, the simplest model (PG1 with just bias) outperforms baseline grading on all metrics. Just modeling bias (constant reliability) captures ~95% of the improvement in RMSE
Experiments where confidence fell between.90-.95 When our model is 90% confident that its prediction is within K% of the true grade, then over 90% of the time in experiment, we are indeed within K%. (i.e., our model is conservative) Meaningful Confidence Estimates We can use confidence estimates to tell when a submission needs to be seen by more graders!
How many graders do you need? Some submissions need more graders! Some grader assignments can be reallocated! Note: This is quite an overconservative estimate (as in the last slide)
Mean Standard deviation Mean Standard deviation Understanding graders in the context of the MOOC Question: What factors influence how well a student will grade? “Easiest” submissions to grade “Harder” submissions to grade Better scoring graders grade better
Grader grade (z-score) Gradee grade (z-score) # standard deviations from mean Grade inflation Grade deflation Residual given grader and gradee scores Best students tend to downgrade the worst submissions The worst students tend to inflate the best submissions
How much time should you spend on grading? “sweet spot of grading”: ~ 20 minutes
What your peers say about you! Best submissions Worst submissions
sentiment polarity feedback length (words) residual (z-score) sentiment polarity feedback length Commenting styles in HCI On average, comments vary from neutral to positive, with few highly negative comments Students have more to say about weaknesses than strong points
00.10.20.30.188.8.131.52.80.91 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate True Positive Rate Student engagement and peer grading just grade Task: predict whether a student will complete last homework just bias just reliability all features (AUC = 0.97605)
Takeaways Peer grading is an easy and practical way to grade open-ended assignments at scale Reasoning jointly over all submissions and accounting for bias/reliability can significantly improve current peer grading in MOOCs Grading performance can tell us about other learning factors such as student engagement or performance Real world deployment: our system was used in HCI 3!