Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Linear Regression.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
On-line learning and Boosting
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
What is Statistical Modeling
x – independent variable (input)
Classification and risk prediction
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Speaker Adaptation for Vowel Classification
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Scalable Text Mining with Sparse Generative Models
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Crash Course on Machine Learning
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Optimal Bayes Classification
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
NTU & MSRA Ming-Feng Tsai
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Article Filtering for Conflict Forecasting Benedict Lee and Cuong Than Comp 540 4/25/2006.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
An Empirical Comparison of Supervised Learning Algorithms
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Classification: Logistic Regression
Data Mining Lecture 11.
Asymmetric Gradient Boosting with Application to Spam Filtering
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
Advanced Artificial Intelligence Classification
Parametric Methods Berlin Chen, 2005 References:
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Multivariate Methods Berlin Chen
Machine learning overview
Multivariate Methods Berlin Chen, 2005 References:
Learning to Rank with Ties
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki

Abstract Detailed study of CE for machine translation Various machine learning methods CE for sentences and for words Different definitions of correctness Experiments NIST 2003 Chinese-to-English MT evaluation

1 Introduction CE can improve usability of NLP based systems CE techniques is not well studied in Machine translation Investigate sentence and word level CE

2 Background Strong vs. weak CE CE Score ThresholdBinary output CE Score ThresholdBinary output Correctness probabilities Strong CE: require probability Weak CE: require only binary classification NOT necessary probability

2 Background Has CE layer or not No distinct CE layer Has distinct CE Layer NLP system CE module Na ï ve Bayes, NN, SVM etc … Require a training corpus Powerful and modular

3 Experimental Setting Src Hyp Input sentences Translation system ISI Alignment Template MT system N-best Validation Train Test Reference sentences C Correct or Not

3.1 Corpora Chinese-to-English Evaluation sets from NIST MT competitions Multi reference corpus from LDC

3.2 CE Techniques Data : A collection of pairs (x,c) X: feature vector, c: correctness Weak CE X  score X  MLP  score (Regressing MT evaluation score) Strong CE X  na ï ve Bayes  P(c=1|x) X  MLP  P(c=1|x)

3.2 Na ï ve Bayes (NB) Assume features are statistically independent Apply absolute discounting C x1x1 x2x2 xDxD

3.2 Multi Layer Perceptron Non-linear mapping of input features Linear transformation layers Non-linear transfer functions Parameter estimation Weak CE (Regression) Target: MT evaluation score Minimizing a squared error loss Strong CE (Classification) Target: Binary correct/incorrect class Minimizing negative log likelihood

3.3 Metrics for Evaluation Strong CE metric: Evaluates probability distribution Normalized cross entropy (NCE) Weak CE metrics: Evaluates discriminability Classification error rate (CER) Receiver operating characteristic (ROC)

3.3 Normalized Cross Entropy Cross Entropy (negative log-likelihood) Normalized Cross Entropy (NCE) Estimated probability from CE module Empirical probability obtained from test set

3.3 Classification Error Rate CER: Ratio of samples with wrong binary (Correct/Incorrect) prediction Threshold optimization Sentence-level experiments: test set Word-level experiments: validation set Baseline

3.3 Receiver operating characteristic CorrectIncorrect Correct ab Incorrect cd Prediction Fact Cf. 0,01 1 random ROC curve IROC Better Correct-reject-ratio Correct-accept-ratio

4 Sentence Level Experiments MT evaluation measures WERg: normalized word error rate NIST: sentence-level NIST score “ Correctness ” definition Thresholding WERg Thresholding NIST Threshold value 5% “ correct ” examples 30% “ correct ” examples

4.1 Features Total of 91 sentence level features Base-Model-Intrinsic Output from 12 functions for Maximum entropy based base-system Pruning statistics N-best List Rank, score ratio to the best, etc … Source Sentence Length, ngram frequency statistics, etc … Target Sentence LM scores, parenthesis matching, etc … Source/Target Correspondence IBM model1 probabilities, semantic similarity, etc …

4.2 MLP Experiments MLPs are trained on all features for the four problem settings Classification models are better than regression model Performance is better than baseline Strong CE (Classification ) Weak CE (Regression) N/A BASE CER N:NIST W:WERg Table 2

4.3 Feature Comparison Compare contributions of features Individual feature Group of features All: All features Base: base model scores BD: base-model dependent BI: base model independent S: apply to source sentence T: apply to target sentence ST: apply to source and target sentence

4.3 Feature Comparison (results) Base All BD > BI T>ST>S CE Layer > No CE Layer ALL Base BD BI S T ST Table 3 Figure 1 Exp. Condition: NIST 30%

5 Word Level Experiments Definition of word correctness A word is correct if: Pos: occurs exactly at the same position as reference WER: aligned to reference PER: occurs in the reference Select a “ best ” transcript from multiple references Ratio of “ correct ” words Pos(15%) < WER(43%) < PER(64%)

5.1 Features Total of 17 features SMT model based features (2) Identity of alignment template, whether or not translated by a rule IBM model 1 (1) Averaged word translation probability Word posterior and Related measures (3x3) Target language based features (3+2) Semantic features by WordNet Syntax check, number of occurrences in the sentence Relative freq.Rank weighted freq.Word Posterior prob. Any Source Target WPP-any WPP-source WPP-target

5.2 Performance of Single Features Experimental setting Na ï ve Bayes classifier PER based correctness Table 4 WPP-any give the best results WPP-any>model1>WPP-source Top3>any of the single features No gain for ALL

5.3 Comparison of Different models Na ï ve Bayes, MLPs with different number of hidden units All features, PER based correctness Na ï ve Bayes MLP0 Na ï ve Bayes < MLP5 MLP5 NLP10 NLP20 Figure 2

5.4 Comparison of Word Error Measures Experimental settings MLP20 All features PER is the easiest to lean Table 5

6 Conclusion Separate CE layer is useful Features derived from base model are better than external ones N-best based features are valuable Target based features are more valuable than those not MLPs with hidden units are better than na ï ve Bayes