Presentation is loading. Please wait.

Presentation is loading. Please wait.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Similar presentations


Presentation on theme: "A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati."— Presentation transcript:

1 A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati

2 Outline  BLEU and ROUGE metric families  BLANC –family of adaptable metrics All common skip n-grams Local n-gram model Overall model  Experiments and results  Conclusions  Future work  References

3 Automatic Evaluation Metrics  Manual human judgments  Edit distance ( WER )  Word overlap ( PER )  Metrics based on n-grams n-gram precision (BLEU) weighted n-grams (NIST) longest common subsequence (Rouge-L) skip 2-grams (pairs of ordered words – Rouge-S)  Integrate additional knowledge (synonyms, stemming) (METEOR) t i m e translation quality ( candidate | reference )

4 Automatic Evaluation Metrics  Manual human judgments  Machine translation (MT) evaluation metrics Manually created estimators of quality Improvements often shown on the same data Rigid notion of quality Based on existing judgment guidelines  Goal: trainable evaluation metric t i m e translation quality ( candidate | reference )

5 Goal: Trainable MT Metric  Build on the features used by established metrics ( BLEU, ROUGE )  Extendable – additional features/processing  Correlate well with human judgments  Trainable models Different notions of “translation quality”  E.g. computer consumption vs. human consumption Different features will be more important for different  Languages  Domains

6 The WER Metric R: the students asked the professor C: the students talk professor Word Error Rate = # of word insertions, deletions, and substitutions # words in R  Transform reference (human) translation R into candidate (machine) translation C Levenshtein (edit) distance

7 The PER Metric  Word overlap between candidate (machine) translation C and reference (human) translation R Bag of words Position Independent Error Rate  | count of w in R – count of w in C | # words in R  R: the students asked the professor C: the students talk professor w in C

8 The BLEU Metric Modified n-gram precisions  1-gram precision = 3 / 4  2-gram precision = 1 / 3  …  Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C R: the students asked the professor C: the students talk professor BLEU = (   P i-gram ) 1/n * ( brevity penalty ) i = 1 n

9 The BLEU Metric  BLEU is the most established evaluation metric in MT  Basic feature: contiguous n-grams of all sizes  Computes modified precision  Uses a simple formula to combine all precision scores  Bigram precision is “as important” as unigram precision  Brevity penalty – quasi recall

10 The Rouge-L Metric R: the students asked the professor C: the students talk professor  Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R LCS = 3 “the students … professor” Precision LCS ( C,R ) # words in C == Recall LCS ( C,R ) # words in R == Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)

11 The Rouge-S Metric R: the students asked the professor C: the students talk professor  Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Skip 2 ( C ) = 6 { “the students”, “the talk”, “the professor”, “students talk”, “students professor”, “talk professor” } Skip 2 ( C,R ) = 3 { “the students”, “the professor”, “students professor” } 11

12 The Rouge-S Metric R: the students asked the professor C: the students talk professor  Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Precision Skip 2 ( C,R ) |C| choose 2 == Recall Skip 2 ( C,R ) |R| choose 2 == Rouge-S = harmonic mean (Precision, Recall)

13 The ROUGE Metrics  Rouge-L Basic feature: longest common subsequence LCS  Size of the longest common skip n-gram Weighted LCS  Rouge-S Basic feature: skip bigrams Skip bigram gap size irrelevant Limited to n-grams of size 2  Both use harmonic mean (F1-measure) to combine precision and recall

14 Is BLEU Trainable?  Can we assign/learn relative importance between P 2 and P 3 ?  Simplest model: regression Train/test on past MT output [C,R] Inputs: P 1, P 2, P 2 … and brevity penalty  P 1, P 2, P 2, b p  HJ fluency score BLEU = (   P i-gram ) 1/n * ( brevity penalty ) i = 1 n

15 Is Rouge Trainable?  Simple regression on Size of the longest common skip n-gram Number of common skip 2-grams  Second order parameters (dependencies) – model is not linear in its inputs anymore Window size (computation reasons) F-measure to  F (replacing brevity penalty)  Potential models Iterative methods Hill climbing?  Non-linear (B p, |LCS|, Skip 2,  F, ws)  HJ fluency score

16 The BLANC Metric Family  Generalization of established evaluation metrics N-gram features used by BLEU and ROUGE  Trainable parameters Skip n-gram contiguity in C Relative importance of n (i.e. bigrams vs. trigrams) Precision-recall balance  Adaptability to different: Translation quality criteria, languages, domains  Allow additional processing/features (e.g. METEOR matching)

17 All Common Skip N-grams C: the one pure student brought the necessary condiments R: the new student brought the food C: the one pure student brought the necessary condiments R: the new student brought the food (,,, ) 1 1 1 10 1 2 3 0 0 1 3 0 0 0 1 # 1grams: 4 # 2grams: 6 # 3grams: 4 # 4grams: 1 the(0,0) the(4,5) student(2,3)brought(3,4) the(0,5) the(4,0)

18 All Common Skip N-grams C: the one pure student brought the necessary condiments R: the new student brought the food C: the one pure student brought the necessary condiments R: the new student brought the food the(0,0) the(4,5) student(2,3)brought(3,4) (,,, ) 1 1 1 10 s 22 s 32 3 0 0 1 ? 0 0 0 1 score(1-grams) score(2-grams) score(3-grams) score(4-grams)  score(the 0,0,student 2,3 )   ’’  ’’

19 All Common Skip N-grams  Algorithms literature: all common subsequences  Listing vs. counting subsequences  Interested in counting # common subsequences of size 1, 2, 3 …  Replace counting with  score over all n-grams of the same size Score(w 1 …w i,w i+1 …w n ) = Score(w 1 …w i )  Score(w 1+1 …w n )  BLANC i (C,R) = f(common i-grams of C,R)

20 Modeling Gap Size Importance skip 3-grams … the ____ ____ ____ ____ student ____ ____ has … … the ____ student has … … the student has …

21 Modeling Gap Size Importance  Model the importance of skip n-gram gap size as an exponential function with one parameter (  )  Special cases Gap size doesn’t matter (Rouge-S):  = 0 No gaps are allowed (BLEU):  = large number C: … the __ __ __ __ student __ __ has …

22 Modeling Candidate-Reference Gap Difference skip 3-gram match C 1 : … the ____ ____ ____ ____ student ____ ____ has … R: … the ____ student has … C 2 : … the student has …

23 Modeling Candidate-Reference Gap Difference  Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter (  )  Special cases Gap size differences do not matter:  = 0 Skip 2-gram overlap (Rouge-S):  = 0,  = 0, n=2 Largest skip n-gram (Rouge-L):  = 0,  = 0, n=LCS C: … the __ __ __ __ student __ __ has … R: … the __ student has …

24 Skip N-gram Model  Incorporate simple scores into an exponential model Skip n-gram gap size Candidate-reference gap size difference  Possible to incorporate higher level features Partial skip n-grams matching (e.g. synonyms, stemming)  “the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student” From word classing to syntax  e.g. score( “students __ __ professor”) ? score (“the __ __ of”)

25 Candidates References Find Common Skip Ngram Find All Common Skip Ngrams Compute Skip Ngram Pair Features e -  i  f i (sn) Combine All Common Skip Ngram Scores Global parameters precision/recall f(skip ngram size) Compute Correlation Coefficient pearson spearman Criterion adequacy fluency f(adequacy, fluency) other Trained Metric BLANC Overview

26 Incorporating Global Features  Compute BLANC precision and recall for each n- gram size i  Global exponential model based on N-gram size: I  BLANC i (C,R) i=1..n F-measure parameter F  for each size i Average reference segment size Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) …  Train for average human judgment vs. train for best overall correlation (as the error function)

27 Experiment Setup  Tides evaluation data Arabic  English 2003, 2004  Training and test sentences separated by year  Optimized: n-gram contiguity difference in gap size (C vs. R) Balance between precision and recall  Correlation using the Pearson correlation coefficient  Compared BLANC to BLEU and ROUGE  Trained BLANC for Fluency vs. adequacy System level vs. sentence level

28 Tides 2003 Arabic Evaluation System LevelSentence Level Method AdequacyFluencyAdequacyFluency BLEU0.9500.9340.3820.286 NIST0.9620.9390.4390.304 Rouge-L0.9740.9260.4400.328 Rouge-S0.9490.9350.3600.328 BLANC0.9880.9790.4920.391  Pearson [-1,1] correlation with human judgments at system level and sentence level

29 Tides 2004 Arabic Evaluation System LevelSentence Level Method AdequacyFluencyAdequacyFluency BLEU0.9780.9940.4460.337 NIST0.9870.9520.5290.358 Rouge-L0.9810.9850.5380.412 Rouge-S0.9370.9800.3670.408 BLANC0.9820.9940.5650.438  Pearson [-1,1] correlation with human judgments at system level and sentence level

30 Advantages of BLANC  Consistently good performance  Candidate evaluation is fast  Adaptable fluency and adequacy languages, domains  Help train MT systems for specific tasks e.g. information extraction, information retrieval  Model complexity  Can be optimized for specific MT system performance levels

31 Disadvantages of BLANC  Training data vs. number of parameters  Model complexity  Guarantees of the training process

32 Conclusions  Move towards learning evaluation metrics Quality criteria – e.g. fluency, adequacy Correlation coefficients – e.g. Pearson, Spearman Languages – e.g. English, Arabic, Chinese  BLANC – family of trainable evaluation metrics Consistently performs well on evaluating machine translation output

33 Future Work  Recently obtained a two year NSF Grant  Try different models and improve the training mechanism for BLANC Is a local exponential model the best choice? Is a global exponential model the best choice? Explore different training methods  Integrate additional features  Apply BLANC to other tasks (summarization)

34 References  Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005  Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004  Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005  Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002  Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001  Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992

35 Thank you

36 Acronyms, acronyms …  Official: Broad Learning Adaptation for Numeric Criteria  Inspiration: white light contains light of all frequencies  Fun: Building on Legacy Acronym Naming Conventions  Bleu, Rouge, Orange, Pourpre … Blanc?


Download ppt "A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati."

Similar presentations


Ads by Google