Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad.

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad

Crowdsourcing Wikipedia: Crowdsourcing is the act of outsourcing tasks, traditionally performed by an employee or contractor, to a large group of people or community (a crowd), through an open call.

Crowdsourcing services Web and Logo Design: 99designs (>72000 designers, from $150)99designs Brand names: namethis ($99 for the best 3 names after a 48 hour contest/voting session)namethis Business innovation: Chaordix (engage the crowd via the web to “submit, discuss, refine and rank ideas…”)Chaordix Advertising: Poptent (“connects video creators with Top Brands…”)Poptent Software & usability testing: uTest (>18000 professionals to test Web, mobile, gaming and desktop apps)uTest Brainstorming / feedback: kluster (“brainstorming ideas from trusted people”)kluster Product redesign: redesignme (“…actively seeks out badly-designed products…users are then invited to complete design challenges”)redesignme … Data cleansing & entry / content creation: Amazon’s Mechanical Turk CrowdFlowerAmazon’s Mechanical Turk

MTurk & CF MTurk (www.mturk.com) launched in 2005www.mturk.com – Directly accessible only to US requesters – > 500.000 Workers from >100 countries CF (www.crowdflower.com) launched in 2007www.crowdflower.com – channel to Mturk accessible to non-US requesters Ipeirotis, 2010. New demographics of Mechanical Turk. US (46.8%)India (34%) 68% women70% male ~40% 20-30 years~65% 20-30 years 35% Bachelors degree53% Bachelors degree ~45% $25-60K/yr55% <$10K/yr ~35% “to kill time”~30% “primary source of income” ~25% 4-8 hours per week ~36% 20/100 HITs (i.e. work units) per week ~60% earns less than $10 per week

MTurk & CF Basic unit of work: "Human Intelligence Task" (HIT) – Simple, repetitive, hard to automate tasks – Prices from $0.01 to $10 (the end of un-supervised learning?) Requester – Prepay the money – Publish HITs – Get results Worker (aka “turker”) – Complete the HITs – Get paid RequesterHITs Workers Completed HITs

Sample HITS from MTurk (July 2, 2010) Transcribe this audio into text (audio length: 1h3'41’’).$13.37 Visit the given website and complete the short survey. About 5 minutes to complete. $1.00 Tweet a specified message on your valid Twitter account with at least 200 followers. $1.00 Share Your Room Painting Project (photo + description). $1.00 Sell me your old college/university writing assignments and summaries (400+ words). I am looking for original writing done about university-level topics & readings. $0.50 Share a 16 th birthday party idea. 300 + words. $0.50 Click a link to a website, enter your zip code, click submit to test (Takes 10 Seconds). $0.50 Provide on my website quality improvement tip for Singers and aspiring vocalist looking for vocal training tips. $0.40 How good is your Refrigerator model? Share your experience! $0.25 Tell us a true, interesting story from your life about acne, pimples, zits. etc., like products you tried, bad dates, embarrassing moments, etc. $0.10 Download and rate my free Android App. $0.01 Adult/inappropriate video identification. You will view or scrub this video and decide if it contains adult material. $0.01 199,799 HITS

Sample NLP HITS 1 Corpus collection – Given a topic, prepare a brief speech expressing your true opinion on the topic. Next, prepare a second brief speech expressing the opposite of your opinion Word Sense Disambiguation – Given a text passage containing a target word w, select w’s most appropriate sense from a list Word similarity – Assign numeric judgments of word similarity for 30 word pairs on a scale of [0,10] Textual Entailment – Given two sentences, choose whether the second sentence can be inferred from the first. Answer quality evaluation – Given a question-answer pair, rate the following 13 statements on scale of 1 to 5: “This answer provides enough information for the question”, “this is an easy to read answer”, … Sentiment/polarity/bias classification – Given a list of short headlines, assign numeric judgments in the interval [0,100] rating the headline for six emotions (anger, disgust, fear, joy, sadness, surprise) and a single numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline

Sample NLP HITS 2 Machine Translation evaluation – Given a source text, rank each of the 5 translations from Best to Worst Speech transcription – listen to the utterance by using the audio player embedded in the task web page, and transcribe every audible word. You can replay the audio as many times as necessary to produce a satisfactory transcript. Temporal ordering of events – Given a verb event pair, take a binary choice on whether the event described by the first verb occurs before or after the second. Relation extraction – Given a text passage with two highlighted terms, indicate if one of the following relations hold between them: …

Sample NLP HITS 3 Word alignment – link words in the source sentence to one or more target words or the empty word. JAVASCRIPTAPI

Popular, simple, fast, cheap,… … BUT tricky!!! How to design HITS? – …to attract turkers – …to collect reliable data – …to boost speed How to price HITS? How to ensure quality control? – …to weed out untrustable workers – …to weed out spammers/cheaters – …to avoid money waste

A bunch of hints Keep your HIT simple and concise – Difficult tasks = low agreement, few reliable results, slow progress Try different settings before launching a big job – Different definitions of your HIT – Different payment amounts Make cheating a hard task – Make successful completion with random clicks impossible – Use a gold standard – Use regional qualifications – Define your HIT in the appropriate language – Transform texts into images

The importance of gold data 1 Using a gold standard is optional but REMEMBER THAT: You are going to pay only for successfully completed HITs!!! – MTurk +10% over the price of successfully completed HITs – CF +30% (!) You need a criterion to discriminate successfully/unsuccessfully completed HITs – No criterion=ALL results are good (and paid!) HIT: Transcribe this audio into text (audio length: 1h3'41’’).$13.37 Agfdagfa ah ah ah! Valid result without gold standard!!!

The importance of gold data 2 No criterion=ALL results are good (and paid!)…another example ABSynonyms? carbook- volumeloudness- volumebook- volumemass- crabshrimp- HIT: given two English words A and B, decide if they can be synonyms or not Data to be annotated

The importance of gold data 2 No criterion=ALL results are good (and paid!)…another example ABSynonyms? carbookYES volumeloudnessNO volumebookNO volumemassNO crabshrimpYES HIT: given two English words A and B, decide if they can be synonyms or not Valid results without gold standard!!!

Adding gold units 1 Sometimes it’s easy: gold units can be merged with the required annotations ABSynonyms? - carbook- volumeloudness- volumebook- - volumemass- crabshrimp- HIT: given two English words A and B, decide if they can be synonyms or not Data to be annotated carautomobile volumetable GOLD YES - - - NO - - Gold units

Adding gold units 1 Sometimes it’s easy: gold units can be merged with the required annotations ABSynonyms? YES carbookNO volumeloudnessYES volumebookYES volumemassYES crabshrimpYES HIT: given two English words A and B, decide if they can be synonyms or not carautomobile volumetable Gold units GOLD YES - - - NO - - #67911 Judgments made: 7 Gold Seen: 2 / Missed:1 Trust: 50% Worker #67911

Adding gold units 2 Sometimes it’s harder: gold units cannot be directly merged with the required annotations HIT 1: translate the given English sentence into Spanish HIT 2: summarize a 300 words story HIT 3: Given a list of headlines, assign a numeric rating in the interval [- 100,100] to denote the overall positive or negative valence of the emotional content of the headline One valid output Vs. multiple valid outputs Known output Vs. unknown output Data annotation Vs. survey/content creation

Adding gold units 2 Sometimes it’s harder: gold units cannot be directly merged with the required annotations HIT 1: translate the given English sentence into Spanish PROBLEM: Since there’s not ONE single good translation, we cannot directly check the quality of turkers’ work through comparison with a gold reference translation PROBLEM: Since there’s not ONE single good translation, we cannot directly check the quality of turkers’ work through comparison with a gold reference translation

Adding gold units 2 Sometimes it’s harder: gold units cannot be directly merged with the required annotations Possible solution: a 2-steps HIT (validation over gold units + translation) HIT 1: translate the given English sentence into Spanish HIT 1.0: given two sentences, S1 in English and S2 in Spanish, decide if S2 is a correct translation of S1. HIT1.1: translate the given English sentence S3 into Spanish. HIT 1.0: given two sentences, S1 in English and S2 in Spanish, decide if S2 is a correct translation of S1. HIT1.1: translate the given English sentence S3 into Spanish. Gold units Data to be collected S1S2Correct?GoldS3Translation 2002 Olympic Winter games took place in Salt Lake. 2002 Juegos Olímpicos de Invierno tendrá lugar en Salt Lake. -NOA variety of mercy killing is when a patient is removed from a life support system with legal approval. -

AMT Vs CF AMTCF Regional qualification ✔✔ Accessible to international requesters ✗✔ Multiple channels for job distribution ✗✔ Built-in gold standard qualification ✗✔ Trustability qualification ✔✗ Qualification certificate ✔✗ Selection of good workers on your job ✔✗ Charge on successfully completed HITs +10%+30%

Next steps 1.Creation/publication of a job A simple task: word similarity 2.Monitoring your job

Terminology Unit (HIT) – Basic task given to each worker. Assignment – Number of units each worker will do at a time. Judgment – Completion of an assignment by an individual worker. Job – Your published assignments waiting for judgment. – Cost = # Assignments * # Judgments per assignment * Pay per assignment

Creating a new job: word similarity Task: Given a sentence containing a term t, choose among a list of 3 terms t 1,t 2,t 3 the most similar to t. Note: One valid output  simple gold standard creation! – gold units can be easily merged with the required annotations – 1-step HIT HIT: select from a list of terms the most similar to the one extracted from the given sentence SentenceTT1T2T3GoldMost Similar He was reading a book while waiting for his guests. bookhatvolumecatvolume they left the harbor during the night harborseaportairportmountain-

A closer look at

Creating a job

1: upload data

2: define your HIT

3: calibration (optional)

4: ordering NOTE: MTurk +10% over the price of successfully completed HITs CF +30% (!) Gambit: payments company for social games! Players are paid with “chips” for taking simple, online jobs… Gambit: payments company for social games! Players are paid with “chips” for taking simple, online jobs…

Checking a job (progress and results)

Summary page

Preview

Workers NOTE: Only workers having seen at least 4 gold units, with >= 70% Trust are paid (and their work is retained)!

A trustable worker

Issues How to design HITS? – …to attract turkers – …to collect reliable data – …to boost speed How to price HITS? What can we do with low budget? Quality control, cheating/spam detection Experts Vs non experts (correlation between the two groups, what to expect from non experts)

A recent experience…

Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush T: Wolfgang Amadeus Mozart was born in Salzburg. H: Mozart was born in Austria. T: Wolfgang Amadeus Mozart was born in Salzburg. H: Mozart nació en Austria. NAACL 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk Joint work with Yashar Mehdad

Translation HIT Validation HIT Translated T-H pairs Monolingual TE Corpus (PASCAL RTE3) Monolingual TE Corpus (PASCAL RTE3) Validated T-H pairs CLTE Corpus(English- Spanish)

Naïve methodology No qualification mechanisms Very fast and cheap: $12 for 800 translations in 1 hour $12 for 5*800 validations in ~6 hours Poor quality of the results (61% rejections) Need of gold standard units! DAY1 $24

Improving validation Gold units (50 positive/negative examples) Task definition in Spanish DAYS 2-7 $58 Better results…still at low cost 97% Accuracy on 20% of the retained translations +25% in the validation costs Considerable increase in duration 4 days for the first iteration (many rejected judgments, automatic pausing mechanism in CF) Need of qualification mechanisms! More money to boost speed!

Improving translation Gold units (validity check) Regional qualification, as in Mturk (upon request) Payment increase DAYS 8-10 $99.75 Better results… less rejections (45%) Automatic pausing avoided Faster procedure Doubling the payment, halved the accomplishment time

Summary 800 English pairs (RTE3 Development Set) 426 validated English/Spanish pairs in our CLTE Corpus $99.75 spent to define a reliable and fast procedure o translation/validation cycles o non-redundant acquisitions o systematic use of gold units o simple binary decisions Cost-effective solution o $30 to create the full corpus of 800 pairs Some limitations found in the CrowdFlower service o lack of regional qualification (only available upon request) o lack of other qualification mechanisms o automatic pausing mechanisms

MTurk & CF MTurk (www.mturk.com) launched in 2005www.mturk.com – Directly accessible only to US requesters – Workers from >100 countries – > 500.000 workers ~47% from US (34% from India) ~68 % women ~52 % 22-40 years ~70% to spend free time fruitfully (~15% for “primary” income purposes) ~25% for 4-8 hours per week ~60% earning less than $10 per week >50% with college education CF (www.crowdflower.com) launched in 2007www.crowdflower.com – channel to Mturk accessible to non-US requesters Ipeirotis, 2010. New demographics of Mechanical Turk.

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad.

Similar presentations

Presentation on theme: "Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad.

Similar presentations

Presentation on theme: "Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower Matteo Negri and Yashar Mehdad."— Presentation transcript:

Similar presentations

About project

Feedback