© 2008 The MITRE Corporation. All rights reserved Sherri Condon, Jon Phillips, Christy Doran, John Aberdeen, Dan Parvaz, Beatrice Oshika, Greg Sanders,

Slides:



Advertisements
Similar presentations
A small taste of inferential statistics
Advertisements

Statistical modelling of MT output corpora for Information Extraction.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Annie Louis University of Pennsylvania Derrick Higgins Educational Testing Service 1.
User and Task Analysis Howell Istance Department of Computer Science De Montfort University.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Automated Essay Evaluation Martin Angert Rachel Drossman.
1 The role of the Arabic orthography in reading and spelling Salim Abu-Rabia University of Haifa.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Addressing Uncertainty in Performance Measurement of Intelligent Systems Raj Madhavan 1,2 Elena Messina 1 Hui-Min Huang 1 Craig Schlenoff 1 1 Intelligent.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
SIG IL 2000 Evaluation of a Practical Interlingua for Task-Oriented Dialogue Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro Watanabe,
7 - 1 Chapter 7: Data Analysis for Modeling PowerPoint Slides Prepared By: Alan Olinsky Bryant University Management Science: The Art of Modeling with.
Automatic Detection of Plagiarized Spoken Responses Copyright © 2014 by Educational Testing Service. All rights reserved. Keelan Evanini and Xinhao Wang.
LAS LINKS DATA ANALYSIS. Objectives 1.Analyze the 4 sub-tests in order to understand which academic skills are being tested. 2.Use sample tests to practice.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
The Four P’s of an Effective Writing Tool: Personalized Practice with Proven Progress April 30, 2014.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Background: Speakers use prosody to distinguish between the meanings of ambiguous syntactic structures (Snedeker & Trueswell, 2004). Discourse also has.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
HIGH SCHOOL TEACHER TRAINING WORKSHOP
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Interpretese vs Translationese
Is Neural Machine Translation the New State of the Art?
Multi-Engine Machine Translation
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Vorlesung Maschinelle Übersetzung, SS 2010
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
The role of the Arabic orthography in reading and spelling
Understanding Results
Basque language: is IT right on?
Human Resource Management By Dr. Debashish Sengupta
Lecture 12: Data Wrangling
Significance Tests: The Basics
Presentation transcript:

© 2008 The MITRE Corporation. All rights reserved Sherri Condon, Jon Phillips, Christy Doran, John Aberdeen, Dan Parvaz, Beatrice Oshika, Greg Sanders, and Craig Schlenoff LREC 2008 Applying Automated Metrics to Speech Translation Dialogs

© 2008 The MITRE Corporation. All rights reserved DARPA TRANSTAC: Speech Translation for Tactical Communication DARPA Objective: rapidly develop and field two-way translation systems for spontaneous communication in real-world tactical situations English Speaker Iraqi Arabic Speaker “There were four men” “How many men did you see?”  Speech Recognition  Machine Translation  Speech Synthesis

© 2008 The MITRE Corporation. All rights reserved 3 Evaluation of Speech Translation n Few precedents for speech translation evaluation compared to machine translation of text n High level human judgments –CMU (Gates et al., 1996) –Verbmobil (Nübel, 1997) –Binary or ternary ratings combine assessments of accuracy and fluency n Humans score abstract semantic representations –Interlingua Interchange Format (Levin et al., 2000) –Predicate-argument structures (Belvin et al, 2004) –Fine-grained, low-level assessments

© 2008 The MITRE Corporation. All rights reserved 4 Automated Metrics n High correlation with human judgments for translation of text, but dialog is different than text –Relies on context vs. explicitness –Variability: contractions, sentence fragments –Utterance length: TIDES average 30 words/sentence n Studies have primarily involved translation to English and other European languages, but Arabic is different than Western languages –Highly inflected –Variability: orthography, dialect, register, word order

© 2008 The MITRE Corporation. All rights reserved 5 TRANSTAC Evaluations n Directed by NIST with support from MITRE (see Weiss et al. for details) n Live evaluations –Military users –Iraqi Arabic bilinguals (English speaker is masked) –Structured interactions (Information is specified) n Offline evaluations –Recorded dialogs held out from training data –Military users and Iraqi Arabic bilinguals –Spontaneous interactions elicited by scenario prompts

© 2008 The MITRE Corporation. All rights reserved 6 TRANSTAC Measures n Live evaluations –Global binary judgments of ‘high level concepts’ –Speech input was or was not adequately communicated n Offline evaluations –Automated measures n WER for speech recognition n BLEU for translation n TER for translation n METEOR for translation –Likert-style human judgments for sample of offline data –Low-level concept analysis for sample of offline data

© 2008 The MITRE Corporation. All rights reserved 7 Issues for Offline Evaluation n Initial focus was similarity to live inputs –Scripted dialogs are not natural –Wizard methods are resource intensive n Training data differs from use of device –Disfluencies –Utterance lengths –No ability to repeat and rephrase –No dialog management n I don’t understand n Please try to say that another way n Same speakers in both training and test sets

© 2008 The MITRE Corporation. All rights reserved 8 Training Data Unlike Actual Device Use n then %AH how is the water in the area what's the -- what's the quality how does it taste %AH is there %AH %breath sufficient supply? n the -- the first thing when it comes to %AH comes to fractures is you always look for %breath %AH fractures of the skull or of the spinal column %breath because these need to be these need to be treated differently than all other fractures. n and then if in the end we find tha- -- that %AH -- that he may be telling us the truth we'll give him that stuff back. n would you show me what part of the -- %AH %AH roughly how far up and down the street this %breath %UM this water covers when it backs up?

© 2008 The MITRE Corporation. All rights reserved 9 Selection Process n Initial selection of representative dialogs (Appen) –Percentage of word tokens and types that occur in other scenarios: mid range (87-91% in January) –Number of times a word in the dialog appears in the entire corpus: average for all words is maximized –All scenarios are represented, roughly proportionately –Variety of speakers and genders are represented n Criteria for selecting dialogues for test set –Gender, speaker, scenario distribution –Exclude dialogs with weak content or other issues such as excessive disfluencies and utterances directed to interpreter “Greet him” “Tell him we are busy”

© 2008 The MITRE Corporation. All rights reserved 10 July 2007 Offline Data n About 400 utterances for each translation direction –From 45 dialogues using 20 scenarios –Drawn from entire set held back from data collected in 2007 n Two selection methods from held out data (200 each) –Random: select every n utterances –Hand: select fluent utterances (1 dialogue per scenario) n 5 Iraqi Arabic dialogues selected for rerecording –About 140 utterances for each language –Selected from the same dialogues used for hand selection

© 2008 The MITRE Corporation. All rights reserved 11 Human Judgments n High-level adequacy judgments (Likert-style) –Completely Adequate –Tending Adequate –Tending Inadequate –Inadequate –Proportion judged completely adequate or tending adequate n Low-level concept judgments –Each content word (c-word) in source language is a concept –Translation score based on insertion, deletion, substitution errors –DARPA score is represented as an odds ratio –For comparison to automated metrics here, it is given as total correct c-words / (total correct c-words) + (total errors)

© 2008 The MITRE Corporation. All rights reserved Measures for Iraqi Arabic to English 12 Automated MetricsHuman Judgments TRANSTAC Systems: A BCDE

© 2008 The MITRE Corporation. All rights reserved Measures for English to Iraqi Arabic 13 Automated MetricsHuman Judgments TRANSTAC Systems: A BCDE

© 2008 The MITRE Corporation. All rights reserved 14 Directional Asymmetries in Measures BLEU ScoresHuman Adequacy Judgments English to ArabicArabic to English

© 2008 The MITRE Corporation. All rights reserved 15 Normalization for Automated Scoring n Normalization for WER has become standard –NIST normalizes reference transcriptions and system outputs –Contractions, hyphens to spaces, reduced forms (wanna) –Partial matching on fragments –GLM mappings n Normalization for BLEU scoring is not standard –Yet BLEU depends on matching n-grams –METEOR’s stemming addresses some of the variation n Can communicate meaning in spite of inflectional errors n two book, him are my brother, they is there n English-Arabic translation introduces much variation

© 2008 The MITRE Corporation. All rights reserved 16 Orthographic Variation: Arabic n Short vowel / shadda inclusions: جَمهُورِيَّة, جمهورية n Variations by including explicit nunation: أحيانا, أحياناً n Omission of the hamza: شي, شيء n Misplacement of the seat of the hamza: الطوارئ or الطوارىء n Variations where the taa martbuta should be used: بالجمجمة, بالجمجمه n Confusions between yaa and alif maksura: شي, شى n Initial alif with or without hamza/madda/wasla: اسم, إسم n Variations in spelling of Iraqi words: وياي, ويايا

© 2008 The MITRE Corporation. All rights reserved 17 Data Normalization Two types of normalization were applied for both ASR/MT system outputs & references 1.Rule based: simple diacritic normalization ne.g. آ,أ,إ => ا 2.GLM based: lexical substitution ne.g. doesn’t => does not ne.g. ﺂﺑﺍی => ﺂﺒﻫﺍی

© 2008 The MITRE Corporation. All rights reserved 18 Normalization for English to Arabic Text: BLEU Scores Norm0Norm1Norm2 Average *CS = Statistical MT version of CR, which is rule-based

© 2008 The MITRE Corporation. All rights reserved 19 Normalization for Arabic to English Text: BLEU Scores Norm0Norm1Norm2 Average

© 2008 The MITRE Corporation. All rights reserved Summary n For Iraqi Arabic to English MT, there is good agreement on the relative scores among all the automated measures and human judgments of the same data n For English to Iraqi Arabic MT, there is fairly good agreement among the automated measures, but relative scores are less similar to human judgments of the same data n Automated MT metrics exhibit a strong directional asymmetry with Arabic to English scoring higher than English to Arabic in spite of much lower WER for English n Human judgments exhibit the opposite asymmetry n Normalization improves BLEU scores. 20

© 2008 The MITRE Corporation. All rights reserved Future Work n More Arabic normalization, beginning with function words orthographically attached to a following word n Explore ways to overcome Arabic morphological variation without perfect analyses n Arabic WordNet? n Resampling to test for significance, stability of scores n Systematic contrast of live inputs and training data 21

© 2008 The MITRE Corporation. All rights reserved 22 Rerecorded Scenarios n Scripted from dialogs held back for training –New speakers recorded reading scripts –Based on the 5 dialogs used for hand selection n Dialogues are edited minimally –Disfluencies, false starts, fillers removed from transcripts –A few entire utterances deleted –Instances of قل له “tell him” removed n Scripts recorded at DLI –138 English utterances, 141 Iraqi Arabic utterances –89 English and 80 Arabic utterances have corresponding utterances in the hand and randomly selected sets

© 2008 The MITRE Corporation. All rights reserved 23 WER Original vs. Rerecorded Utterances English Offline English Rerecorded Arabic Offline Arabic Rerecorded Average

© 2008 The MITRE Corporation. All rights reserved 24 English to Iraqi Arabic BLEU Scores: Original vs. Rerecorded Utterances OriginalRerecorded Average *E2 = Statistical MT version of E, which is rule-based

© 2008 The MITRE Corporation. All rights reserved 25 Iraqi Arabic to English BLEU Scores: Original vs. Rerecorded Utterances OriginalRerecorded Average