Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google) SIGIR 2012 This work was conducted when all authors were affiliated with Yahoo! 1
Why I Present This Paper? Vote bias exists in many social media platforms This paper solves a problem in a relatively old context “CQA” from a new perspective, “crowd sourcing quality content identification” 2
Outline Motivation Related Work Data Set Vote Calibration Model Exploratory Analysis Features Experimental Results Conclusion 3
Community Question Answering Crowd sourced alternative to search engines for providing information 4
Community Question Answering Commercial spam: mostly can be tackled by conventional machine learning Low quality content: difficult for machines to detect! Crowdsourcing quality content identification 5
Voting Mechanism Content quality User expertise 6
Vote in Yahoo! Answers Asker vote for the best answer Asker does not vote for the best answer within certain period, other users in the community vote Thumb-up or thumb-down votes on each individual answer However… Are users’ votes always un-biased? 7
Potential Bias Vote more positively for friends’ answers Use votes to show appreciation instead of identifying high quality content Game the system to obtain high status, multiple accounts, vote for one another Questions about opinions, vote for answer that share same opinions … 8
Potential Bias Trained human editors to judge answers based on a set of well-defined guidelines Raw user votes have low correlation with editorial judgment 9
Motivation Propose the problem of vote calibration in CQA systems Based on exploratory data analysis, identify a variety of potential factors that bias the votes Develop a model for vote calibration based on supervised learning, content-agnostic approach 10
Related Work Predicting user-voted best answer – Assumption: readily available user-voted best answer are ground truth Predicting editorial judgments – User votes are used as features, calibration of each individual vote has not be studied Content-agnostic user expertise estimation 11
Dataset Editorial data – Sample questions and answers from Yahoo! Answers – Give quality grade to the answer according to pre- determined set of editorial guideline, excellent, good, fair, bad – 21,525 editorial judged answers on 7,372 questions 12
Dataset Distribution of editorial grades for best answers are not very different from non-best answers. Low correlation between users’ best-answer votes and answer quality Significant percentage (>70%) of best answers are not even good Many non-best answers are actually good or excellent 13
Dataset Numeric quality scores, excellent=1,good=0.5,fair=0,bad=-0.5 Voting data, 1.3M questions, 7.0M answers, 0.5M asker best answer votes, 2.1M community best answer votes, 9.1M thumb up/down votes 14
Vote Calibration Model 15
Vote Calibration Model Three types of votes – Asker votes: best answer votes by asker +1 for best answer -1 for other answers – CBA votes: community best answer votes +1 from the voter that votes for best answer -1 from the voter for other answers – Thumb votes: thumb-up and thumb down +1 for thumb up -1 for thumb down 16
Average Vote of An Answer Pseudo votes, prior Calibrated type-t votes 17
Average Vote of An Answerer/User 18
Quality Prediction Function Quality prediction: weighted sum of answer-level and user-level average vote values of all types on an answer Calibrated vote aggregation model: 19 Bias term Answer levelUser level
Training Algorithm Determine model parameters by minimizing the following loss function Using gradient descent to determine model parameters 20
Self Voting Self votes contribute to 33% of total CBA votes Users who cast at least 20 votes, percentage of self votes goes above 40% 21
Vote Spread and Reciprocity 22
Interaction Bias Chi-squared statistic and randomized test show past interaction could be useful features for vote calibration 23
Feature Voter features 24
Feature Relation feature 25
Feature Transformation Each for the features C that are counts, consider log(1+C) as an additional feature For ratio features R, include a quadratic term R 2 26
Experimental Results User-level expert ranking – How well we rank users based on the predicted user-level scores Answer ranking – How well we rank answers based on the predicted answer-level scores 27
Experimental Results 28
Comparison of Calibration Models 29
Impact on Heavy Users 30
Conclusion Introduce vote calibration problem to CQA Propose a set of features to capture bias by analyzing potential bias in users’ voting behavior Supervised calibrated models are better than non-calibrated versions 31
Thanks Q & A 32