Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

University of Sheffield NLP Module 4: Machine Learning.
Determining How to Select a Sample
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Problem Semi supervised sarcasm identification using SASI
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Alternative Assessments FOUN 3100 Fall 2003 Sondra M. Parmer.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Predicting Cloze Task Quality for Vocabulary Training Adam Skory, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University
Validity, Sampling & Experimental Control Psych 231: Research Methods in Psychology.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology Stockholm
Page 1 NAACL-HLT BEA Los Angeles, CA Annotating ESL Errors: Challenges and Rewards Alla Rozovskaya and Dan Roth University of Illinois at Urbana-Champaign.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
Testing Writing. We have to : have representative sample of the tasks that we expect the students to perform. those task should elicit valid samples of.
Preposition Usage Errors by English as a Second Language (ESL) learners: “ They ate by* their hands.”  The writer used by instead of with. This work is.
Testing Writing Miss. Mona AL-Kahtani.
Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS), Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University) May , LREC.
Modern Retrieval Evaluations Hongning Wang
The Ups and Downs of Preposition Error Detection in ESL Writing Joel Tetreault[Educational Testing Service] Martin Chodorow[Hunter College of CUNY]
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
CHAPTER 10 – VOCABULARY: STUDENTS IN CHARGE Presenter: 1.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 11a Boosting and Naïve Bayes All lecture slides will be available as.ppt,.ps, &.htm at
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Misuse of Articles By: Liz M. LaboyWorkshop four Albanice FloresProf. C. Garcia Jennifer M. Serrano ENGL 245.
Psychometrics & Validation Psychometrics & Measurement Validity Properties of a “good measure” –Standardization –Reliability –Validity A Taxonomy of Item.
CHAPTER 10 – VOCABULARY: STUDENTS IN CHARGE Presenter: Laura Mizuha 1.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Lectures ASSESSING LANGUAGE SKILLS Receptive Skills Productive Skills Criteria for selecting language sub skills Different Test Types & Test Requirements.
Automated Suggestions for Miscollocations the Fourth Workshop on Innovative Use of NLP for Building Educational Applications Authors:Anne Li-E Liu, David.
Law of Contrariness "Our chief want in life is somebody who shall make us do what we can. Having found them, we shall then hate them for it." Ralph Waldo.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
David Ackerman, Associate VP Crystal Butler, Research Associate.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Detecting Missing Hyphens in Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service ACL 2013 Martin Chodorow Hunter College.
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Key Stage 1 National Curriculum Assessments Information and Guidance on the Changes and Expectations for 2015/16 A Presentation for Parents.
Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Annotating and measuring Temporal relations in texts Philippe Muller and Xavier Tannier IRIT,Université Paul Sabatier COLING 2004.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Determining How to Select a Sample
Erasmus University Rotterdam
Automatic Hedge Detection
Annotating ESL Errors: Challenges and Rewards
Presentation transcript:

Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham University] Martin Chodorow[Hunter College of CUNY]

313 Preposition Error Detection Task Preposition Selection Task

Error Detection in 2010 Much work on system design for detecting ELL errors: ◦ Articles [Han et al. ‘06; Nagata et al. ‘06] ◦ Prepositions [Gamon et al. ’08] ◦ Collocations [Futagi et al. ‘08] Two issues that the field needs to address: 1.Annotation 2.Evaluation

Annotation Issues Annotation typically relies on one rater ◦ Cost ◦ Time But in some tasks (such as usage), multiple raters are probably desired ◦ Only relying on one rater can skew system precision as much as 10% [Tetreault & Chodorow ’08: T&C08] Context can license multiple acceptable usages Aggregating multiple judgments can address this, but is still costly

Evaluation Issues To date, none of the preposition or article systems have been evaluated on the same corpus! ◦ Mostly due to proprietary nature of corpora (TOEFL, CLC, etc.) Learner corpora can vary widely: ◦ L1 of writer? ◦ Proficiency of writer? ◦ How advanced the writer is? ◦ ESL or EFL?

Evaluation Issues A system that performs at 50% on one corpus may actually perform at 80% on another Inability to compare systems makes it difficult for the grammatical error detection field to move forward

Goal Amazon Mechanical Turk (AMT) ◦ Fast and cheap source of untrained raters 1. Show AMT to be an effective alternative to trained raters on the tasks of preposition selection and ESL error annotation 2. Propose a new method for evaluation that allows two systems evaluated on different corpora to be compared

Outline 1. Amazon Mechanical Turk 2. Preposition Selection Experiment 3. Preposition Error Detection Experiment 4. Rethinking Grammatical Error Detection

Amazon Mechanical Turk AMT: service composed of requesters (companies, researchers, etc.) and Turkers (untrained workers) Requesters post tasks or HITS for workers to do, with each HIT usually costing a few cents

Amazon Mechanical Turk AMT found to be cheaper and faster than expert raters for NLP tasks: ◦ WSD, Emotion Detection [Snow et al. ‘08] ◦ MT [Callison-Burch ‘09] ◦ Speech Transcription [Novotney et al. ’10]

AMT Experiments For preposition selection and preposition error detection tasks: 1.Can untrained raters be as effective as trained raters? 2.If so, how many untrained raters are required to match the performance of trained raters?

Preposition Selection Task Task: how well can a system (or human) predict which preposition the writer used in well-formed text ◦ “fill in the blank” Previous research shows expert raters can achieve 78% agreement with writer [T&C08]

Preposition Selection Task 194 preposition contexts from MS Encarta Data rated by experts in T&C08 study Requested 10 Turker judgments per HIT Randomly selected Turker responses for each sample size and used majority preposition to compare with writer’s choice

Preposition Selection Results 3 Turker responses > 1 expert rater

Preposition Error Detection Turkers presented with a TOEFL sentence and asked to judge target preposition as correct, incorrect, ungrammatical 152 sentences with 20 Turker judgments each kappa between three trained raters Since no gold standard exists, we try to match kappa of trained raters

Error Annotation Task 13 Turker responses > 1 expert rater

Rethinking Evaluation Prior evaluation rests on the assumption that all prepositions are of equal difficulty However, some contexts are easier to judge than others: “It depends of the price of the car” “The only key of success is hard work.” Easy “Everybody feels curiosity with that kind of thing.” “I am impressed that I had a 100 score in the test of history.” “Approximately 1 million people visited the museum in Argentina in this year.” Hard

Rethinking Evaluation Difficulty of cases can skew performance and system comparison If System X performs at 80% on corpus A, and System Y performs at 80% on corpus B  ◦ …Y is probably the better system ◦ But need difficulty ratings to determine this Easy Hard Corpus A 33% Easy Cases Corpus B 66% Easy Cases

Rethinking Evaluation Group prepositions into “difficulty bins” based on AMT agreement ◦ 90% bin: 90% of the Turkers agree on the rating for the prepositions (strong agreement) ◦ 50% bin: Turkers are split on the rating for the preposition (low agreement) Run system on each bin separately and report results

Conclusions Annotation: showed AMT is cheaper and faster than expert raters: ◦ Selection Task: 3 Turkers > 1 expert ◦ Error Detection Task: 13 Turkers > 3 experts Evaluation: AMT can be used to circumvent issue of having to share corpora: just compare bins! ◦ But…both systems must use same AMT annotation scheme TaskTotal HIT S Judgments per HIT CostTotal Cost # of Turkers Time Selection19410$0.02$ hrs Error15220$0.02$ hrs

Future Work Experiment with guidelines, qualifications and weighting to reduce the number of Turkers Experiment with other errors (e.g., articles, collocations) to determine how many Turkers are required for optimal annotation