Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.

Similar presentations


Presentation on theme: "Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University."— Presentation transcript:

1 Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

2 “A Computer Scientist in a Business School” “A Computer Scientist in a Business School” Panos Ipeirotis - Introduction  New York University, Stern School of Business

3 Example: Build an “Adult Web Site” Classifier  Need a large number of hand-labeled sites  Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr

4 Amazon Mechanical Turk: Paid Crowdsourcing

5

6 Example: Build an “Adult Web Site” Classifier  Need a large number of hand-labeled sites  Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr

7 Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

8 Improve Data Quality through Repeated Labeling  Get multiple, redundant labels using multiple workers  Pick the correct label based on majority vote  Probability of correctness increases with number of workers  Probability of correctness increases with quality of workers 1 worker 70% correct 1 worker 70% correct 11 workers 93% correct 11 workers 93% correct

9 Using redundant votes, we can infer worker quality  Look at our spammer friend ATAMRO447HWJQ together with other 9 workers Our “friend” ATAMRO447HWJQ mainly marked sites as G. Obviously a spammer…  We can compute error rates for each worker Error rates for ATAMRO447HWJQ  P[X → X]=0.847%P[X → G]=99.153%  P[G → X]=0.053%P[G → G]=99.947%

10 Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 49.6%  P[X → X]=0.847%P[X → G]=99.153%  P[G → X]=0.053%P[G → G]=99.947% Action: REJECT and BLOCK Results:  Over time you block all spammers  Spammers learn to avoid your HITS  You can decrease redundancy, as quality of workers is higher

11 Too much theory? Demo and Open source implementation available at:  Input: –Labels from Mechanical Turk –Some “gold” data (optional) –Cost of incorrect labelings (e.g., X  G costlier than G  X)  Output: –Corrected labels –Worker error rates –Ranking of workers according to their quality

12

13

14 How to handle free-form answers?  Q: “My task does not have discrete answers….”  A: Break into two HITs: –“Create” HIT –“Vote” HIT  Vote HIT controls quality of Creation HIT  Redundancy controls quality of Voting HIT  Catch: If “creation” very good, in voting workers just vote “yes” –Solution: Add some random noise (e.g. misspell) Creation HIT (e.g. transcribe caption) Creation HIT (e.g. transcribe caption) Voting HIT: Correct or not? Voting HIT: Correct or not? Example: Collect URLs

15 But my free-form is not just right or wrong…  “Create” HIT  “Improve” HIT  “Compare” HIT Creation HIT (e.g. describe the image) Creation HIT (e.g. describe the image) TurkIt toolkit: Improve HIT (e.g. improve description) Improve HIT (e.g. improve description) Compare HIT (voting) Which is better? Compare HIT (voting) Which is better? Describe this

16 version 1: A parial view of a pocket calculator together with some coins and a pen. version 2: A view of personal items a calculator, and some gold and copper coins, and a round tip pen, these are all pocket and wallet sized item used for business, writting, calculating prices or solving math problems and purchasing items. version 3: A close-up photograph of the following items: A CASIO multi- function calculator. A ball point pen, uncapped. Various coins, apparently European, both copper and gold. Seems to be a theme illustration for a brochure or document cover treating finance, probably personal finance. version 4: …Various British coins; two of £1 value, three of 20p value and one of 1p value. … version 8: “A close-up photograph of the following items: A CASIO multi-function, solar powered scientific calculator. A blue ball point pen with a blue rubber grip and the tip extended. Six British coins; two of £1 value, three of 20p value and one of 1p value. Seems to be a theme illustration for a brochure or document cover treating finance - probably personal finance."

17 Future: Break big task to simple ones and build workflow  Running experiment: Crowdsource big tasks (e.g., tourist guide) My Boss is a Robot (mybossisarobot.com)mybossisarobot.com Nikki Kittur (Carnegie Mellon) + Jim Giles (New Scientist) –Identify sights worth checking out (one tip per worker) Vote and rank –Brief tips for each monument (one tip per worker) Vote and rank –Aggregate tips in meaningful summary Iterate to improve…

18 Thank you! Questions? “A Computer Scientist in a Business School” “A Computer Scientist in a Business School”

19 Correcting biases  Classifying sites as G, PG, R, X  Sometimes workers are careful but biased  Classifies G → P and P → R  Average error rate : too high Is she a spammer? Error Rates for CEO of company detecting offensive content (and parent) P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for CEO of company detecting offensive content (and parent) P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

20 Correcting biases  For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias  True error-rate ~ 9% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

21 Scaling Crowdsourcing: Use Machine Learning  Human labor is expensive, even when paying cents  Need to scale crowdsourcing  Basic idea: Build a machine learning model and use it instead of humans Data from existing crowdsourced answers Data from existing crowdsourced answers New Case Automatic Model (through machine learning) Automatic Answer Automatic Answer

22 22 Tradeoffs for Automatic Models: Effect of Noise  Get more data  Improve model accuracy  Improve data quality  Improve classification Example Case: Porn or not? Data Quality = 50% Data Quality = 60% Data Quality = 80% Data Quality = 100%

23 Confident Automatic Model (through machine learning) Scaling Crowdsourcing: Iterative training  Use machine when confident, humans otherwise  Retrain with new human input → improve model → reduce need for humans Get human(s) to answer New Case Not confident Automatic Answer Automatic Answer Data from existing crowdsourced answers Data from existing crowdsourced answers

24 24 Tradeoffs for Automatic Models: Effect of Noise  Get more data  Improve model accuracy  Improve data quality  Improve classification Example Case: Porn or not? Data Quality = 50% Data Quality = 60% Data Quality = 80% Data Quality = 100%

25 Not confident Confident Automatic Model (through machine learning) Scaling Crowdsourcing: Iterative training, with noise  Use machine when confident, humans otherwise  Ask as many humans as necessary to ensure quality Get human(s) to answer New Case Automatic Answer Automatic Answer Confident for quality? Not confident for quality? Data from existing crowdsourced answers Data from existing crowdsourced answers


Download ppt "Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University."

Similar presentations


Ads by Google