Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

Similar presentations


Presentation on theme: "Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning."— Presentation transcript:

1 Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning Department Carnegie Mellon University

2 Judgment  Judgment is important in many domains, e.g. search, ads, game.  Judgment type: Absolute vs. Relative  Absolute judgments are noisy and less agreement  Relative judgments have much higher agreement.  Relative judgments are faster per judgment.  Judgment in crowdsourced settings: judgments provided by multiple annotators 2 A: 5 B: 3 C: 1 A: 5 B: 3 C: 1 A > B A > C B > C A > B A > C B > C [Carterette, et al. ECIR 2008 ] Pair-wise Comparisons

3 Challenges for Consensus from Crowd  Not all annotators are equally reliable.  This is true in many settings but exacerbated in crowd setting.  Model annotators’ reliability and incorporate into a well-studied probabilistic ranking aggregation framework.  Reliability can depend on how well the annotator fits the task.  An annotator’s reliability may depend on both characteristics of the annotator and characteristics of the task  Introduce pooling across tasks and incorporate representation of task features into model.  Maximum quality for minimum cost but we not only need to choose what objects to label but who should label them.  Active learning with no oracle.  Formalize as explore-exploit objective tradeoff.  Need to compute online and consider all annotator-object pairs.  Derive constant-time Bayesian online update 3

4 Ranking Aggregation  Permutation Based Methods  Mallows Model (Mallows, 1957)  CPS (T.Qin et. al., 10)  Computationally very expensive, approximation heuristics  Score-based Methods  Learn a real number score for each object:  Bradley-Terry model (Bradley and Terry, 1952)  Plackett-Luce (Luce 1959, Plackett, 1975)  Thurstone (Thurstone, 1927)  Low rank approximation (Jiang et al. 09, Gleich et al. 11) 4

5 Bradley-Terry Model 5 A > B A > C C > D A > B A > C C > D B > D C > A E > D B > D C > A E > D A > B B > C C > A A > B B > C C > A A > B > C > E > D

6 Roadmap 6 Modeling Annotators’ Reliability for Probabilistic Ranking Active Learning with Exploitation-Exploration Tradeoff Ranking in Multitask Setting

7 Model the Reliability of Annotators  Types of annotators :  (1) Perfect annotator  (2) Random annotator  (3) Malicious annotator Most annotators are good but imperfect, need to quantify their reliability  Model the reliability of annotators: 7

8 Interpretation and CrowdBT 8 Perfect annotator : Random annotator : Malicious annotator : Maximize Log-Likelihood

9 Effect of Annotators’ Average Reliability 9 Conclusion: (1) Avg. Reliability >0.5, CrowdBT works well with all one initialization (2) Avg. Reliability <=0.5, CrowdBT works well with rough estimate of reliability Initialization by the performance on 5 golden pairs No. of Objects :100 (score 1 to 100) No. of Annotator: 100 400 pairs and each pair labeled by 10 annotators [W. Yih, 09]

10 Roadmap 10 Modeling Annotators’ Reliability for Probabilistic Ranking Active Learning with Exploitation-Exploration Tradeoff Ranking in Multitask Setting (1)Extend Bradley-Terry model and incorporate annotators’ reliability (2)Distinguish different types of annotators and automatically recovery the error from malicious ones (3)Average Reliability v.s. Initialization strategy

11 Consensus Ranking for Different Tasks 11

12 Picture This Dataset 12 For each query and player, there are only 2.79 labeled pairs on average Statistics of Dataset 35 queries (features extracted using ODP classifier) 44,419 players 483,477 pairs of images [Bennett et. al., 2009] Explore the commonality among tasks

13 Roadmap 13 Modeling Annotators’ Reliability for Probabilistic Ranking Active Learning with Exploitation-Exploration Tradeoff Ranking in Multitask Setting (1)Extend Bradley-Terry model and incorporate annotators’ reliability (2)Distinguish different types of annotators and automatically recovery the error from malicious ones (3)Average Reliability v.s. Initialization strategy (1)Explore the commonality of annotators across tasks (2)Utilize task features in modeling annotators’ reliability

14 Active Learning  Active Learning  Oracle provides correct answer  Optimally select next sample  Active Learning in the Crowd  Optimally select next sample  Assign the sample to whom ?  Assign uncertain samples to good annotators  Assign the certain samples to test annotators’ reliability Computational Challenges: online learning algorithms 14 Pictures from [Settles 2009] Exploitation-Exploration Tradeoff (a)Exploit labels of uncertain samples from annotators with known reliability (b) Explore annotators’ reliability, discover good annotators Exploitation-Exploration Tradeoff (a)Exploit labels of uncertain samples from annotators with known reliability (b) Explore annotators’ reliability, discover good annotators

15 Online Learning: Bayesian Modeling  Assign Priors:  Bayesian Inference: 15 Likelihood: Prior: Posterior: Approximation: Two Approximation: (1) Independent (2) Gaussian + Beta distribution

16 Active Learning 16

17 Active Learning Simulated Study 17 Too much exploration Too little exploration

18 Area under the active learning curve 18

19 Bayesian inference  Many choices: MCMC, variational inference, expectation propagation  How to get constant-time inference for each pair?  Moment-matching !  How to estimate these first and second order moments?  The technique from [Weng et al., 11] (Stein’s Lemma)  Constant-time update ! 19

20 20 Stein’s Lemma [Woodroofe, 89, Weng et al. 11]

21 Bayesian inference 21 Perfect Annotator: Random Annotator: Malicious Annotator: Random Annotator: Perfect or Malicious Annotator:

22 Reading Difficulty Dataset 22 Ratio to Best (0.6843) 5Ran dom 98%1850140036507250 95%70085024505350 90%4004508502150 Dataset 491 articles with difficulty level 1 ~ 12 624 annotators 12,728 comparisons Provided By Kevyn Collins-Thompson

23 Reading Difficulty Dataset 23 Dataset 491 articles with difficulty level 1 ~ 12 624 annotators 12,728 comparisons Provided By Kevyn Collins-Thompson

24 Roadmap 24 Modeling Annotators’ Reliability for Probabilistic Ranking Active Learning with Exploitation-Exploration Tradeoff Ranking in Multitask Setting (1)Extend Bradley-Terry model and incorporate annotator’s reliability (2)Distinguish different types of annotators and automatically recovery the error from malicious ones (3)Average Reliability v.s. Initialization strategy (1)Explore the commonality of annotators across tasks (2)Utilize task features in modeling annotators’ reliability (1)Active Learning in Crowd: selection of sample & annotator (2)Exploitation-Exploration tradeoff (3)Efficient online Bayesian update

25 Conclusions and Future Works  Probabilistic Ranking in Crowdsourced Setting  Extend the classical Bradley-Terry model to model the reliability of annotators: techniques can be applied to other models (e.g., Thurstone) !  Saving Cost : Active learning in crowd and explicitly model the exploitation-exploration tradeoff :  Efficient online learning for active learning  Future Works  Active Learning in a batch mode  Value of Information: how much effort should be used to test the performance of annotators? 25


Download ppt "Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning."

Similar presentations


Ads by Google