Karthik Raman, Thorsten Joachims Cornell University.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
CS590M 2008 Fall: Paper Presentation
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Yehuda Koren , Joe Sill Recsys’11 best paper award
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.
Evaluating Search Engine
Wisdom of Crowds and Rank Aggregation Wisdom of crowds phenomenon: aggregating over individuals in a group often leads to an estimate that is better than.
HYPOTHESIS TESTING Four Steps Statistical Significance Outcomes Sampling Distributions.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
The Classroom Presenter Project Richard Anderson University of Washington December 5, 2006.
Comparison and Combination of Ear and Face Images in Appearance-Based Biometrics IEEE Trans on PAMI, VOL. 25, NO.9, 2003 Kyong Chang, Kevin W. Bowyer,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Radial Basis Function Networks
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.
Standard Error of the Mean
Week 9: QUANTITATIVE RESEARCH (3)
CHAPTER 5: CONSTRUCTING OPEN- AND CLOSED-ENDED QUESTIONS Damon Burton University of Idaho University of Idaho.
Measurement in Exercise and Sport Psychology Research EPHE 348.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
by B. Zadrozny and C. Elkan
Classroom Assessment A Practical Guide for Educators by Craig A
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Instructor Resource Chapter 5 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Best Practices in Classroom Peer Review Edward F. Gehringer Department of Computer Science North Carolina State University The Expertiza project has been.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Discussion: So Who Won. Announcements Looks like you’re turning in reviews… good! – Some of you are spending too much time on them!! Key points, what.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université.
Impediments to the estimation of teacher value added Steven Rivkin Jun Ishii April 2008.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Simple examples of the Bayesian approach For proportions and means.
Lecture 07: Dealing with Big Data
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Deep Questions without Deep Understanding
WERST – Methodology Group
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning 5. Parametric Methods.
NTU & MSRA Ming-Feng Tsai
Acadia Institute for Teaching and Technology Peer Review.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Measured Progress ©2012 Combined Human and Automated Scoring of Writing Stuart Kahl Measured Progress.
Assessment Assessment is the collection, recording and analysis of data about students as they work over a period of time. This should include, teacher,
Heuristic Search Planners. 2 USC INFORMATION SCIENCES INSTITUTE Planning as heuristic search Use standard search techniques, e.g. A*, best-first, hill-climbing.
Lesson 3 Measurement and Scaling. Case: “What is performance?” brandesign.co.za.
ON ELICITATION TECHNIQUES OF NEAR-CONSISTENT PAIRWISE COMPARISON MATRICES József Temesi Department of Operations Research Corvinus University of Budapest,
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Web-Based Student Peer Review: A Research Summary
Learning Deep Generative Models by Ruslan Salakhutdinov
Classroom Assessment A Practical Guide for Educators by Craig A
Student Satisfaction Results
TESTING AND EVALUATION IN EDUCATION GA 3113 lecture 1
INF 141: Information Retrieval
Learning to Rank with Ties
Presentation transcript:

Karthik Raman, Thorsten Joachims Cornell University

 MCQs & Other Auto-graded questions are not a good test of understanding and fall short of conventional testing.  Limits kinds of courses offered. MOOCs: Promise to revolutionize education with unprecedented access Need to rethink conventional evaluation logistics: Small-scale classes (10-15 students) : Instructors evaluate students themselves Medium-scale classes ( students) : TAs take over grading process. MOOCs ( students) : ?? 2 Image Courtesy: PHDComics

 Cardinal Peer Grading [Piech et. al. 13] require cardinal labels for each assignment.  Each peer grader g provides cardinal score for every assignment d they grade.  E.g.: Likert Scale, Letter grade Students grade each other (anonymously)! Overcomes limitations of instructor/TA evaluation: Number of “graders” scales with number of students! 8/10 = 3

 Students are not trained graders:  Need to make feedback process simple!  Broad evidence showing ordinal feedback is easier to provide and more reliable than cardinal feedback:  Project X is better than Project Y vs. Project X is a B+.  Grading scales may be non-linear: Problematic for cardinal grades $ 45000$ 90000$ $ ><>< ? ?

 Each grader g provides ordering of assignments they grade: σ (g).  GOAL: Determine true ordering and grader reliabilities 5 > >.. > > … … … … P P Q Q Q R R … …

 Rank Aggregation/Metasearch Problem:  Given input rankings, determine overall ranking.  Performance is usually top-heavy (optimize top k results).  Estimating grader reliability is not a goal.  Ties and data sparsity are not an issue.  Social choice/Voting systems:  Aggregate preferences from individuals.  Do not model reliabilities and are again top-heavy.  Also typically assume complete rankings.  Crowdsourcing:  Estimating worker (grader) reliability is key component.  Scale of items (assignments) much larger than typical Crowdsourcing task.  Also tends to be top-heavy. 6

 GENERATIVE MODEL:  is the Kendall-Tau distance between orderings ( # of differing pairs).  OPTIMIZATION: NP-hard. Greedy algorithm provides good approximation.  WITH GRADER RELIABILITY:  Variant with score-weighted objective (MALS) also used. 7 APQRZ … ……… ……..… PQR ORDERINGSPROBABILITY (P,Q,R) x (P,R,Q); (Q,P,R) x/e (R,P,Q); (Q,R,P) x/e 2 (R,Q,P) x/e 3

 GENERATIVE MODEL:  Decomposes as pairwise preferences using logistic distribution of (true) score differences.  OPTIMIZATION: Alternating minimization to compute MLE scores (and grader reliabilities) using SGD subroutine.  GRADER RELIABILITY:  Variants studied include Plackett-Luce model (PL) and Thurstone model (THUR). 8

 Data collected in classroom during Fall 2013:  First real large evaluation of machine-learning based peer-grading techniques.  Used two-stages: Project Posters (PO) and Final-Reports (FR)  Students provided cardinal grades (10-point scale): 10-Perfect,8-Good,5-Borderline,3-Deficient  Also performed conventional grading: TA and instructor grading. 9 StatisticPosterReport Number of Assignments4244 Number of Peer Reviewers Total Peer Reviews Number of TA Reviewers79 Total TA Reviews7888

 Kendall-Tau error measure (lower is better).  As good as cardinal methods (despite using less information).  TAs had error of 22.0 ± 16.0 (Posters) and 22.2 ± 6.8 (Report). 10 Cardinal

 Added lazy graders. Can we identify them?  Does significantly better than cardinal methods and simple heuristics.  Better for posters due to more data. 11 Cardinal

12

 Benefits of ordinal peer grading for large classes.  Adapted rank aggregation methods for performing ordinal peer grading.  Using data from an actual classroom, peer grading found to be a viable alternate to TA grading.  Students found it helpful and valuable 13

14  CODE found at peergrading.org where a Peer Evaluation Web SERVICE is available.  DATA is also available

15

16

 Self-consistent 17

 Consistent results even on sampling graders.  Scales well (performance and time) with number of reviewers and ordering size. 18

 Assume generative model (NCS) for graders: 19 True Score of assignment d Grader g ’s reliability (precision) Grader g ’s bias Grader g’s grade for assignment d : Normally distributed Can use MLE/Gibbs sampling/Variational inference..

 Proposed/Adapted different rank-aggregation methods for the OPG problem:  Mallows model (MAL).  Score-weighted Mallows (MALS).  Bradley-Terry model (BT).  Thurstone model (THUR).  Plackett-Luce model (PL). 20 Ordering-based distributions Pairwise-Preference based distributions Extension of BT for orderings.

 Inter-TA Kendall-Tau: 47.5 ± 21.0 (Posters) and 34.0 ± 13.8 (Report). 21 Cardinal