June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Slides:



Advertisements
Similar presentations
Arthur Chan Prepared for Advanced MT Seminar
Advertisements

Chapter 8: Estimating with Confidence
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Bootstrapping LING 572 Fei Xia 1/31/06.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
CONFIDENCE INTERVALS What is the Purpose of a Confidence Interval?
Sample Size Determination In the Context of Hypothesis Testing
Chapter 11: Inference for Distributions
Let sample from N(μ, σ), μ unknown, σ known.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Chapter 10: Estimating with Confidence
Quiz 6 Confidence intervals z Distribution t Distribution.
Bootstrapping applied to t-tests
1 Terminating Statistical Analysis By Dr. Jason Merrick.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.
Empirical Research Methods in Computer Science Lecture 2, Part 1 October 19, 2005 Noah Smith.
ESTIMATING with confidence. Confidence INterval A confidence interval gives an estimated range of values which is likely to include an unknown population.
Comparing Two Population Means
Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
Chapter 11 Inference for Distributions AP Statistics 11.1 – Inference for the Mean of a Population.
PARAMETRIC STATISTICAL INFERENCE
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Correlation and Prediction Error The amount of prediction error is associated with the strength of the correlation between X and Y.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Sampling And Resampling Risk Analysis for Water Resources Planning and Management Institute for Water Resources May 2007.
Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
+ “Statisticians use a confidence interval to describe the amount of uncertainty associated with a sample estimate of a population parameter.”confidence.
I271B The t distribution and the independent sample t-test.
Simulation & Confidence Intervals COMP5416 Advanced Network Technologies.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Sampling Fundamentals 2 Sampling Process Identify Target Population Select Sampling Procedure Determine Sampling Frame Determine Sample Size.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Chapter 13 Understanding research results: statistical inference.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Project Plan Task 8 and VERSUS2 Installation problems Anatoly Myravyev and Anastasia Bundel, Hydrometcenter of Russia March 2010.
Chapter 7: The Distribution of Sample Means
Lecture 22 Dustin Lueker.  Similar to testing one proportion  Hypotheses are set up like two sample mean test ◦ H 0 :p 1 -p 2 =0  Same as H 0 : p 1.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
CHAPTER 8 (4 TH EDITION) ESTIMATING WITH CONFIDENCE CORRESPONDS TO 10.1, 11.1 AND 12.1 IN YOUR BOOK.
DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)
Chapter 9 Introduction to the t Statistic
Inference: Conclusion with Confidence
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Chapter 9 Roadmap Where are we going?.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Sampling Fundamentals 2
Quantifying uncertainty using the bootstrap
Bootstrap Confidence Intervals using Percentiles

Estimating a Population Proportion
STA 291 Spring 2008 Lecture 22 Dustin Lueker.
Presentation transcript:

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie Mellon University

June 2004 D ARPA TIDES MT Workshop MT Evaluation Metrics Human Evaluations (LDC) –Fluency and Adequacy Automatic Evaluation Metrics –mWER: edit distance between the hypothesis and the closest reference translation –mPER: position independent error rate –BLEU: –Modified BLEU: –NIST:

June 2004 D ARPA TIDES MT Workshop Measuring the Confidence Intervals One score per test set How accurate is this score? To measure the confidence interval a population is required Building a test set with multiple human reference translations is expensive Bootstrapping (Efron 1986) –Introduced in 1979 as a computer-based method for estimating the standard errors of a statistical estimation –Resampling: creating an artificial population by sampling with replacement –Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics

June 2004 D ARPA TIDES MT Workshop A Schematic of the Bootstrapping Process Score 0

June 2004 D ARPA TIDES MT Workshop An Efficient Implementation Translate and evaluate on 2,000 test sets? –No Way! Resample the n-gram precision information for the sentences –Most MT systems are context independent at the sentence level; –MT evaluation metrics are based on information collected for each testing sentences –E.g. for BLEU and NIST RefLen: ClosestRefLen 56 1-gram: –Similar for human judgment and other MT metrics Approximation for NIST information gain Scripts available at: torial.htm torial.htm

June 2004 D ARPA TIDES MT Workshop Confidence Intervals 7 MT systems from June 2002 evaluation Observations: –Relative confidence interval: NIST<M-Bleu<Bleu –I.e. NIST scores have more discriminative powers than BLEU

June 2004 D ARPA TIDES MT Workshop Are Two MT Systems Different? Comparing two MT systems’ performance –Using the similar method as for single system –E.g. Diff(Sys1-Sys2):Median= [ , ] –If the confidence intervals overlap with 0, two systems are not significantly different –M-Bleu and NIST have more discriminative power than Bleu –Automatic metrics have pretty high correlations with the human ranking –Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not

June 2004 D ARPA TIDES MT Workshop How much testing data is needed

June 2004 D ARPA TIDES MT Workshop How much testing data is needed NIST scores increase steadily with the growing test set size The distance between the scores of the different systems remains stable when using 40% or more of the test set The confidence intervals become narrower for larger test set * System A, (Bootstrap Size B=2000)

June 2004 D ARPA TIDES MT Workshop How many reference translations are sufficient? Confidence intervals become narrower with more reference translations [100%](1-ref)~[80~90%](2-ref)~[70~80%](3-ref)~[60%~70%](4-ref) One additional reference translation compensates for 10~15% of testing data * System A, (Bootstrap Size B=2000)

June 2004 D ARPA TIDES MT Workshop Bootstrap-t interval vs. normal/t interval Normal distribution / t-distribution Student’s t-interval (when n is small) Bootstrap-t interval –For each bootstrap sample, calculate –The alpha-th percentile is estimated by the value, such that –Bootstrap-t interval is – e.g. if B=1000, the 50 th largest value and the 950 th largest value gives the bootstrap-t interval Assuming that

June 2004 D ARPA TIDES MT Workshop Bootstrap-t interval vs. Normal/t interval (Cont.) Bootstrap-t intervals assumes no distribution, but –It can give erratic results –It can be heavily influenced by a few outlying data points When B is large, the bootstrap sample scores are pretty close to normal distribution Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500) –STDEV=0.27 for bootstrap-t interval –STDEV=0.14 for normal/student-t interval

June 2004 D ARPA TIDES MT Workshop The Number of Bootstrap Replications B Ideal bootstrap estimate of the confidence interval takes Computational time increases linearly with B The greater the B, the smaller of the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative confidence interval –STDEV = 0.60 when B=100 –STDEV = 0.27 when B=500 Two rules of thumb: –Even a small B, say B=100 is usually informative –B>1000 gives quite satisfactory results

June 2004 D ARPA TIDES MT Workshop References Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p F. Och Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan. M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO. I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada. King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA. S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece. NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co- Occurrence Statistics, Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL. Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.