Download presentation
Presentation is loading. Please wait.
1
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki
2
Abstract Detailed study of CE for machine translation Various machine learning methods CE for sentences and for words Different definitions of correctness Experiments NIST 2003 Chinese-to-English MT evaluation
3
1 Introduction CE can improve usability of NLP based systems CE techniques is not well studied in Machine translation Investigate sentence and word level CE
4
2 Background Strong vs. weak CE CE Score ThresholdBinary output CE Score ThresholdBinary output Correctness probabilities Strong CE: require probability Weak CE: require only binary classification NOT necessary probability
5
2 Background Has CE layer or not No distinct CE layer Has distinct CE Layer NLP system CE module Na ï ve Bayes, NN, SVM etc … Require a training corpus Powerful and modular
6
3 Experimental Setting Src Hyp Input sentences Translation system ISI Alignment Template MT system N-best Validation Train Test Reference sentences C Correct or Not
7
3.1 Corpora Chinese-to-English Evaluation sets from NIST MT competitions Multi reference corpus from LDC
8
3.2 CE Techniques Data : A collection of pairs (x,c) X: feature vector, c: correctness Weak CE X score X MLP score (Regressing MT evaluation score) Strong CE X na ï ve Bayes P(c=1|x) X MLP P(c=1|x)
9
3.2 Na ï ve Bayes (NB) Assume features are statistically independent Apply absolute discounting C x1x1 x2x2 xDxD
10
3.2 Multi Layer Perceptron Non-linear mapping of input features Linear transformation layers Non-linear transfer functions Parameter estimation Weak CE (Regression) Target: MT evaluation score Minimizing a squared error loss Strong CE (Classification) Target: Binary correct/incorrect class Minimizing negative log likelihood
11
3.3 Metrics for Evaluation Strong CE metric: Evaluates probability distribution Normalized cross entropy (NCE) Weak CE metrics: Evaluates discriminability Classification error rate (CER) Receiver operating characteristic (ROC)
12
3.3 Normalized Cross Entropy Cross Entropy (negative log-likelihood) Normalized Cross Entropy (NCE) Estimated probability from CE module Empirical probability obtained from test set
13
3.3 Classification Error Rate CER: Ratio of samples with wrong binary (Correct/Incorrect) prediction Threshold optimization Sentence-level experiments: test set Word-level experiments: validation set Baseline
14
3.3 Receiver operating characteristic CorrectIncorrect Correct ab Incorrect cd Prediction Fact Cf. 0,01 1 random ROC curve IROC Better Correct-reject-ratio Correct-accept-ratio
15
4 Sentence Level Experiments MT evaluation measures WERg: normalized word error rate NIST: sentence-level NIST score “ Correctness ” definition Thresholding WERg Thresholding NIST Threshold value 5% “ correct ” examples 30% “ correct ” examples
16
4.1 Features Total of 91 sentence level features Base-Model-Intrinsic Output from 12 functions for Maximum entropy based base-system Pruning statistics N-best List Rank, score ratio to the best, etc … Source Sentence Length, ngram frequency statistics, etc … Target Sentence LM scores, parenthesis matching, etc … Source/Target Correspondence IBM model1 probabilities, semantic similarity, etc …
17
4.2 MLP Experiments MLPs are trained on all features for the four problem settings Classification models are better than regression model Performance is better than baseline Strong CE (Classification ) Weak CE (Regression) N/A BASE CER 3.21 32.5 5.65 32.5 N:NIST W:WERg Table 2
18
4.3 Feature Comparison Compare contributions of features Individual feature Group of features All: All features Base: base model scores BD: base-model dependent BI: base model independent S: apply to source sentence T: apply to target sentence ST: apply to source and target sentence
19
4.3 Feature Comparison (results) Base All BD > BI T>ST>S CE Layer > No CE Layer ALL Base BD BI S T ST Table 3 Figure 1 Exp. Condition: NIST 30%
20
5 Word Level Experiments Definition of word correctness A word is correct if: Pos: occurs exactly at the same position as reference WER: aligned to reference PER: occurs in the reference Select a “ best ” transcript from multiple references Ratio of “ correct ” words Pos(15%) < WER(43%) < PER(64%)
21
5.1 Features Total of 17 features SMT model based features (2) Identity of alignment template, whether or not translated by a rule IBM model 1 (1) Averaged word translation probability Word posterior and Related measures (3x3) Target language based features (3+2) Semantic features by WordNet Syntax check, number of occurrences in the sentence Relative freq.Rank weighted freq.Word Posterior prob. Any Source Target WPP-any WPP-source WPP-target
22
5.2 Performance of Single Features Experimental setting Na ï ve Bayes classifier PER based correctness Table 4 WPP-any give the best results WPP-any>model1>WPP-source Top3>any of the single features No gain for ALL
23
5.3 Comparison of Different models Na ï ve Bayes, MLPs with different number of hidden units All features, PER based correctness Na ï ve Bayes MLP0 Na ï ve Bayes < MLP5 MLP5 NLP10 NLP20 Figure 2
24
5.4 Comparison of Word Error Measures Experimental settings MLP20 All features PER is the easiest to lean Table 5
25
6 Conclusion Separate CE layer is useful Features derived from base model are better than external ones N-best based features are valuable Target based features are more valuable than those not MLPs with hidden units are better than na ï ve Bayes
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.