Presentation on theme: "Maxine Eskenazi Language Technologies Institute Carnegie Mellon University."— Presentation transcript:
Maxine Eskenazi Language Technologies Institute Carnegie Mellon University
What is the problem? How to insure that crowdsourcing results are reliable The solutions: ◦ Testing the equipment ◦ Framing the task ◦ Testing the workers ◦ Training the workers ◦ Assessing the work
Crowdsourcing is a great resource! ◦ You have large amounts of data to process ◦ It’s faster and cheaper while maintaining high quality But, you can make it say what you want ◦ Example: Looking for sentences that include a well- pronounced example of the word, “table”: “Do you agree that the word “table” was said in this sentence?” vs “Please annotate this sentence” You can get results that are meaningless But you can get great results if you are careful!
Testing the equipment - for those who will listen to something (to annotate, for example) ◦ Ask them to use a headset and then ask them to click yes if they can hear something Relying on worker self-assessment is nice, but not very reliable ◦ Play something to them and ask them to write down what they heard Compare what they wrote to what they heard (you had already written this down) and give them feedback, if they still can’t hear, on how to connect the headset
Testing the equipment - for those who will record something Ask them to speak into the microphone and then play it back to them and ask them if they heard something Relying on worker self-assessment has sometimes worked in this case. Ask them to read something from the screen and then use a speech recognizer to align what they said with what they read MIT has the WAMI toolkit for this, and there are others as well Have some other worker listen to what they said and annotate it, then compare that annotation to the text This may take too much time
Framing the task - Workers need to know what the task is and how to do it ◦ Write a description of the task and instructions on what to do Get others to read that description and follow your instructions -sandbox Revise and try out again ◦ Give examples and counterexamples Give at least two to three of each ◦ Become a worker and try others’ tasks yourself!! You understand issues better when you put yourself in their shoes
Framing the task VERY IMPORTANT ◦ Keep the cognitive load as low as possible! Break one complex task into several tasks ◦ Example – instead of “label the words you hear as well as the non-words, parts of words and pauses”, you would ask “label the words you hear”, then in a separate task “label the non-words, like lipsmacks, you hear” in a separate task “label the parts of words, like restarts, you hear” In a separate task “label where the pauses are”
Framing the task ◦ Another example Interspeech2013 – 25 th anniversary Statistics on past 25 years – 18 categories Total number of papers Total number of different authors 2 harder-to-define categories - Total number of cohorts of authors 1500 attendees were quizzed Crowd had close to correct or right answer on the first 16, nothing close on the last 2
Framing the task ◦ Workers will choose the task they want to work on for several reasons: How much they can make per hour Calculate how much you should pay them so they make at least minimum wage (how much time it takes to complete one task) How can you make the task go faster? Putting all of one task on one page without scrolling No scrolling saves their time Example, ten sentences to annotate plus the instructions Let them minimize the instructions if they want Change font size and space between sentences to get it all on the screen at the same time Eliminate any other unnecessary keystrokes
Framing the task ◦ What it will be used for You make your task more appealing when you tell people why you want them to do this task Example from our work: We are asking you to simplify some sentences. They are taken from everyday documents like driver license applications. This is so that we can automatically simplify everyday documents ◦ How nice it looks Subliminal detail that has been shown to be effective
Testing the workers – why? ◦ Do not assume they are native speakers of X – test them! Just because you have geolocation, that does not mean the person fluently speaks the language of that country ◦ Do not assume that all speakers of Y can write down what they hear – test them! ◦ Not everyone is honest and there are bots
Testing the workers – How? ◦ To test for speakers of X, you could ask them to translate (type in) something from English into the target language Make sure that there is some word or expression that Google Translate or other would get wrong You have already translated this sentence by hand Compare the two texts
Testing the workers – How? ◦ Give a new worker three items to do Say you want them to listen to a sentence and annotate it Give them three sentences to annotate Compare their annotation with the hand annotation you already have done for this ◦ Getting good work often requires some human expert work to establish a “gold standard” ahead of time!! So if you have lots of data, the investment is worth it, but it may not be for small datasets
Training the workers - t he pretesting you have done should serve as training for most tasks You could give more specific feedback if there is something they are doing that can be corrected Example, you asked for annotation that ends with a $ and one worker is not adding that $ but is annotating well. Just send that person a message to add the $. And keep the worker.
Training the workers You can put up a small amount of tasks to start Say 100 tasks (for example, 100 utterances to annotate) Check whether the tasks are being done correctly Check whether each worker is doing the work correctly Revise your task if all workers are not doing well Or notify a worker if they are not doing as well as the other workers they risk not being paid and may want to abandon your tasks
Assessing the work ◦ There are three places where you can assess work: Before starting the task See training and testing While tasks are still live Here is the best place to get rid of bots and cheaters After tasks are done (post-processing)
◦ During the task Compare work to “golden standard” Create a dataset (about 10 percent of total items to be processed), for example of human expert labelled items For every ten items, put in 1 gold standard item Compare worker output to that item Compare one worker’s output to that of others (inter- worker) Majority wins, so have an odd number of workers for each task Compare one worker’s output to their own work (intra- worker) Give the worker the same item every 20 or 30 items and compare his/her performance on that item - consistency
Assessing the work during the task ◦ Another thing to watch out for is bots and cheaters Bots – creators model the task Cheaters – get through the task as quickly as possible While you would pay a poor worker, you should refuse to pay a bot and someone who you are sure is a cheater ◦ For cheaters, look at how much time it took to do each item too fast? It’s a cheater ◦ Give a series of multiple choice items If a worker answers B consistently they are either a bot or a cheater ◦ Put up small groups of tasks with different names The tasks will be finished too quickly for a bot to be created (model of your task to be made)
Assessing the work - after the task, on all of the data at once Gold standard Pull out the gold standard you created and compare the work that you have collected to it Intraworker comparison Does a worker consistently agree with the crowd? Ask the worker if they are confident in their answer – if they consistently say no, do not use their work Note that consulting the workers often brings in good feedback!
Assessing the work - after the task, on all of the data at once Interworker comparison In the same way that you would compare the work of one worker to the gold standard, you can compare the work of one worker to another. Look for one worker who does not agree with all of the others (uneven numbers again) No need for gold standard for this, so your expert might need to label less data Assess the work of one crowd by another Ask one crowd to do the task Give the same task to another crowd, showing the first crowd’s work, for example: “Please correct the following” “Does this text match what was said?” (yes-no or change what was wrong)
We have seen ways to ensure that what you get is high quality and makes sense Equipment can be tested reliably Instructions and all of the setup that ensures the task makes sense can be tested Workers can be pretested and trained Bots and cheaters can be eliminated The work can be assessed before, during or after the task is completed.
Too much information? These slides will be up on my website Google for Maxine Eskenazi Research