Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution John Le - CrowdFlower Andy Edmonds - eBay Vaughn.

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

CrowdER - Crowdsourcing Entity Resolution

Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Fast Algorithms For Hierarchical Range Histogram Constructions

Applying Crowd Sourcing and Workflow in Social Conflict Detection By: Reshmi De, Bhargabi Chakrabarti 28/03/13.

Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Exploring the Neighborhood with Dora to Expedite Software Maintenance Emily Hill, Lori Pollock, K. Vijay-Shanker University of Delaware.

Search Engines and Information Retrieval

Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.

Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.

1 CS 430: Information Discovery Lecture 20 The User in the Loop.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Yan, Kumar & Ganesan Presented by Tim Calloway.

Beyond datasets: Learning in a fully-labeled real world Thesis proposal Alexander Sorokin.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Search Engines and Information Retrieval Chapter 1.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.

 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.

An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette University of Delaware.

Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Bug Localization with Machine Learning Techniques Wujie Zheng

Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Tingxin Yan, Vikas Kumar, Deepak Ganesan Presented.

Group Work vs. Cooperative Learning

Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

Diversifying Search Results Rakesh AgrawalSreenivas GollapudiSearch LabsMicrosoft Research Alan HalversonSamuel.

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

David Ackerman, Associate VP Crystal Butler, Research Associate.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:

Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)

Evaluation of IR Systems

FRM: Modeling Sponsored Search Log with Full Relational Model

Modern Information Retrieval

TITLE OF POSTER Line 1 TITLE OF POSTER Line 2

Query Caching in Agent-based Distributed Information Retrieval

Q4 Measuring Effectiveness

Adverse Event Post mentions

TITLE OF POSTER Line 1 TITLE OF POSTER Line 2

COMPUTER NETWORKS PRESENTATION

Marketing Experiments I

Presentation transcript:

Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution John Le - CrowdFlower Andy Edmonds - eBay Vaughn Hester - CrowdFlower Lukas Biewald - CrowdFlower

Background/Motivation Human judgments for search relevance evaluation/training Quality Control in crowdsourcing Observed worker regression to the mean over previous months

Our Techniques for Quality Control Training data = training questions Questions to which we know the answer Dynamic learning for quality control An initial training period Per HIT screening questions

Contributions Questions explored – Does training data setup and distribution affect worker output and final results? Why important? – Quality control is paramount – Quantifying and understanding the effect of training data

The Experiment: AMT Using Mechanical Turk and the CrowdFlower platform 25 results per HIT 20 cents per HIT No Turk qualifications Title: “Judge approximately 25 search results for relevance”

Judgment Dataset Dataset: major online retailer’s internal product search projects 256 queries with 5 product pairs associated with each query = 1280 search results Examples: “epiphone guitar”, “sofa,” and “yamaha a100.”

Experimental Manipulation Experiment12345 Matching72.7%58%45.3%34.7%12.7% Not Matching8%23.3%47.3%56%84% Off Topic19.3%18%7.3%9.3%3.3% Spam0%0.7%0%0.7%0% Judge Training Question Answer Distribution Skews MatchingNot MatchingOff TopicSpam 14.5%82.67%2.5%0.33% Underlying Distribution Skew

Experimental Control Round-robin workers into the simultaneously running experiments Note only one HIT showed up on Turk Workers were sent to the same experiment if they left and returned

Results 1.Worker participation 2.Mean worker performance 3.Aggregate majority vote Accuracy Performance measures: precision and recall

Worker Participation Experiment12345 Came to the Task Did Training Passed Training Failed Training Percent Passed73%72%92.6%74%80.9% Matching skew Not Matching skew

Mean Worker Performance Worker \ Experiment12345 Accuracy (Overall) Precision (Not Matching) Recall (Not Matching) Matching skew Not Matching skew

Aggregate Majority Vote Accuracy: Trusted Workers Underlying Distribution Skew

Aggregate Majority Vote Performance Measures Experiment12345 Precision Recall Matching skew Not Matching skew

Discussion and Limitations Maximize entropy -> minimize perceptible signal For a skewed underlying distribution

Future Work Optimal judgment task design and metrics Quality control enhancements Separate validation and ongoing training Long term worker performance optimizations Incorporation of active learning IR performance metric analysis

Acknowledgements We thank Riddick Jiang for compiling the dataset for this project. We thank Brian Johnson (eBay), James Rubinstein (eBay), Aaron Shaw (Berkeley), Alex Sorokin (CrowdFlower), Chris Van Pelt (CrowdFlower) and Meili Zhong (PayPal) for their assistance with the paper.

QUESTIONS? Thanks!