Presentation is loading. Please wait.

Presentation is loading. Please wait.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Similar presentations


Presentation on theme: "Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki."— Presentation transcript:

1 Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki

2 Abstract Detailed study of CE for machine translation Various machine learning methods CE for sentences and for words Different definitions of correctness Experiments NIST 2003 Chinese-to-English MT evaluation

3 1 Introduction CE can improve usability of NLP based systems CE techniques is not well studied in Machine translation Investigate sentence and word level CE

4 2 Background Strong vs. weak CE CE Score ThresholdBinary output CE Score ThresholdBinary output Correctness probabilities Strong CE: require probability Weak CE: require only binary classification NOT necessary probability

5 2 Background Has CE layer or not No distinct CE layer Has distinct CE Layer NLP system CE module Na ï ve Bayes, NN, SVM etc … Require a training corpus Powerful and modular

6 3 Experimental Setting Src Hyp Input sentences Translation system ISI Alignment Template MT system N-best Validation Train Test Reference sentences C Correct or Not

7 3.1 Corpora Chinese-to-English Evaluation sets from NIST MT competitions Multi reference corpus from LDC

8 3.2 CE Techniques Data : A collection of pairs (x,c) X: feature vector, c: correctness Weak CE X  score X  MLP  score (Regressing MT evaluation score) Strong CE X  na ï ve Bayes  P(c=1|x) X  MLP  P(c=1|x)

9 3.2 Na ï ve Bayes (NB) Assume features are statistically independent Apply absolute discounting C x1x1 x2x2 xDxD

10 3.2 Multi Layer Perceptron Non-linear mapping of input features Linear transformation layers Non-linear transfer functions Parameter estimation Weak CE (Regression) Target: MT evaluation score Minimizing a squared error loss Strong CE (Classification) Target: Binary correct/incorrect class Minimizing negative log likelihood

11 3.3 Metrics for Evaluation Strong CE metric: Evaluates probability distribution Normalized cross entropy (NCE) Weak CE metrics: Evaluates discriminability Classification error rate (CER) Receiver operating characteristic (ROC)

12 3.3 Normalized Cross Entropy Cross Entropy (negative log-likelihood) Normalized Cross Entropy (NCE) Estimated probability from CE module Empirical probability obtained from test set

13 3.3 Classification Error Rate CER: Ratio of samples with wrong binary (Correct/Incorrect) prediction Threshold optimization Sentence-level experiments: test set Word-level experiments: validation set Baseline

14 3.3 Receiver operating characteristic CorrectIncorrect Correct ab Incorrect cd Prediction Fact Cf. 0,01 1 random ROC curve IROC Better Correct-reject-ratio Correct-accept-ratio

15 4 Sentence Level Experiments MT evaluation measures WERg: normalized word error rate NIST: sentence-level NIST score “ Correctness ” definition Thresholding WERg Thresholding NIST Threshold value 5% “ correct ” examples 30% “ correct ” examples

16 4.1 Features Total of 91 sentence level features Base-Model-Intrinsic Output from 12 functions for Maximum entropy based base-system Pruning statistics N-best List Rank, score ratio to the best, etc … Source Sentence Length, ngram frequency statistics, etc … Target Sentence LM scores, parenthesis matching, etc … Source/Target Correspondence IBM model1 probabilities, semantic similarity, etc …

17 4.2 MLP Experiments MLPs are trained on all features for the four problem settings Classification models are better than regression model Performance is better than baseline Strong CE (Classification ) Weak CE (Regression) N/A BASE CER 3.21 32.5 5.65 32.5 N:NIST W:WERg Table 2

18 4.3 Feature Comparison Compare contributions of features Individual feature Group of features All: All features Base: base model scores BD: base-model dependent BI: base model independent S: apply to source sentence T: apply to target sentence ST: apply to source and target sentence

19 4.3 Feature Comparison (results) Base All BD > BI T>ST>S CE Layer > No CE Layer ALL Base BD BI S T ST Table 3 Figure 1 Exp. Condition: NIST 30%

20 5 Word Level Experiments Definition of word correctness A word is correct if: Pos: occurs exactly at the same position as reference WER: aligned to reference PER: occurs in the reference Select a “ best ” transcript from multiple references Ratio of “ correct ” words Pos(15%) < WER(43%) < PER(64%)

21 5.1 Features Total of 17 features SMT model based features (2) Identity of alignment template, whether or not translated by a rule IBM model 1 (1) Averaged word translation probability Word posterior and Related measures (3x3) Target language based features (3+2) Semantic features by WordNet Syntax check, number of occurrences in the sentence Relative freq.Rank weighted freq.Word Posterior prob. Any Source Target WPP-any WPP-source WPP-target

22 5.2 Performance of Single Features Experimental setting Na ï ve Bayes classifier PER based correctness Table 4 WPP-any give the best results WPP-any>model1>WPP-source Top3>any of the single features No gain for ALL

23 5.3 Comparison of Different models Na ï ve Bayes, MLPs with different number of hidden units All features, PER based correctness Na ï ve Bayes MLP0 Na ï ve Bayes < MLP5 MLP5 NLP10 NLP20 Figure 2

24 5.4 Comparison of Word Error Measures Experimental settings MLP20 All features PER is the easiest to lean Table 5

25 6 Conclusion Separate CE layer is useful Features derived from base model are better than external ones N-best based features are valuable Target based features are more valuable than those not MLPs with hidden units are better than na ï ve Bayes


Download ppt "Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki."

Similar presentations


Ads by Google