Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations  We are moving.

Similar presentations


Presentation on theme: " Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations  We are moving."— Presentation transcript:

1

2  Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations  We are moving in that direction:  Morphology  Syntax  Semantics (SRL):  (Wu & Fung 2009)  (Liu & Gildea 2010)  (Aziz et al. 2011) Meanwhile… 2

3  Linguistic information to evaluate MT quality  Based on reference translations  Linguistic information to estimate MT quality  Using machine learning  Linguistic information to detect errors in MT  Automatic post-editing 3

4  Handle variations in MT (words and structure) wrt reference or identify differences between MT and reference  METEOR (Denkowski & Lavie 2011): words and phrases  (Giménez & Màrquez 2010) : matching of lexical, syntactic, semantic and discourse units  (Lo & Wu 2011): SRL and manual matching of ‘who’ did ‘what’ to ‘whom’, etc.  (Rios et al. 2011): automatic SRL with automatic (inexact) matching of predicates and arguments 4

5  Essentially: matching of linguistic units  Similar to n-gram matching metrics, but units are not only words  Metrics based on lexical units perform better  Issues:  Lack of (good) resources for certain languages  Unreliable processing of incorrect translations  Sparsity for sentence-level: depending on the actual features. E.g.: matching of named entities MT evaluation 5

6  Goal: given the output of an MT system for a given input, provide an estimate of its quality  Uses ◦ Filter bad quality translations from post-editing ◦ Select “perfect” translations for publishing ◦ Spot unreliable translations to readers of target language only ◦ Select best translation for a given input when multiple MT/TM systems are available 6

7  NOT standard MT evaluation: ◦ Reference translations are NOT available ◦ Estimation for unseen translations  My approach: ◦ Translation unit: sentence ◦ Independent from MT system 7

8 1. Define aspect of quality to estimate and how to represent it 2. Identify and extract features that explain that aspect of quality 3. Collect examples of translations with different levels of quality and annotate them 4. Learn a model to predict quality scores for new translations and evaluate it 8

9 Source text Translation MT system Confidence indicators Complexity indicators Fluency indicators Adequacy indicators Quality? Features can be shallow or linguistically motivated 9

10  (S/T/S-T) Sentence length  (S/T) Language model  (S/T) Token-type ratio  (S) Readability metrics: Flesch, etc  (S) Average number of possible translations per word  (S) % of n-grams belonging to different frequency quartiles of a source language corpus  (T) Untranslated/OOV words  (T) Mismatching brackets, quotation marks  (S-T) Preservation of punctuation  (S-T) Word alignment score  etc These do well for estimation of general quality wrt post-editing needs, but not enough for other aspects of quality… 10

11 Count-based  (S/T/S-T) Content/non-content words  (S/T/S-T) Nouns/verbs/… NP/VP/…  (S/T/S-T) Deictics (references)  (S/T/S-T) Discourse markers (references)  (S/T/S-T) Named entities  (S/T/S-T) Zero-subjects  (S/T/S-T) Pronominal subjects  (S/T/S-T) Negation indicators  (T) Subject-verb / adjective-noun agreement  (T) Language Model of POS  (T) Grammar checking (dangling words)  (T) Coherence 11

12 Alignment-based  (S-T) Correct translation of pronouns  (S-T) Matching of dependency relations  (S-T) Matching of named entities  (S-T) Alignment of parse trees  (S-T) Alignment of predicates & arguments  etc Some features are language-dependent, others need resources that are language-dependent, but apply to most languages, e.g. LM of POS tags 12

13  Count-based feature representation: ◦ Source/target only: count or proportion ◦ Contrastive features (S-T): very important – but not a simple matching of linguistic units  Alignment may not be possible (e.g. clauses/phrases)  Force same linguistic phenomena in S an T?  Vs translated as Ns How to model different linguistic phenomena? S = linguistic unit in source; T = linguistic unit in target … 13

14  Count-based feature representation: ◦ Monotonicity of features ◦ Sparsity: is 0-0 as good as 10-10?  Our representation: precision and recall ◦ Does not rely on alignment ◦ Upper bound = 1 (also holds for S,T=0) ◦ Lower bound =  0 14

15  S-T: (Pighin and Màrquez 2011): learn expected projection of SRL from source to target  S-T: (Xiong et al 2010) ◦ Target LM of words and POS tags, dangling words (link grammar parser), word posterior probabilities  S-T: (Bach et al 2011) ◦ Sequences of words and POS tags, context, dependency structures, alignment info Fine grained – need a lot of training data: 72K sentences, 2.2M words and their manual correction (!) 15

16  Estimating post-editing effort  Human scores (1-4): how much post-editing effort?  Estimating adequacy  Human scores (1-4): to which degree does the translation convey the meaning of the original text? 1: requires complete retranslation2: a lot of post-editing needed, but quicker than retranslation 3: a little post-editing needed4: fit for purpose 1: completely inadequate2: poorly adequate 3: Fairly Adequate 4: Highly Adequate 16

17  Machine learning algorithm: SVM for regression  Evaluation  Root Mean Square Error (RMSE) 17

18  English-Spanish Europarl data ◦ 4 SMT systems  4 sets of 4,000 {source, translation, score} triples  Quality score: 1-4 post-editing effort  Features: 96 shallow versus 169 shallow + ling: 18

19  Distribution of post-editing effort scores: ScoreMT1MT2MT3MT4 1 4%9%10%73% 2 25%36%39%21% 3 54%40%43%6% 4 17%10%9%0% Avg. quality

20  RMSE: LanguagesMT SystemAll features No ling. features en-esMT en-esMT en-esMT en-esMT Deviation of 17-22% 20

21 MT: The student still has claimed to take the exam at the end of the year - although she has not chosen course. SRC: A estudante ainda tem pretensão de prestar vestibular no fim do ano – embora não tenha escolhido o curso REF: The student still has the intention to take the exam at the end of the year – although she has not chosen the course. 21

22  Arabic-English Newswire data (GALE) ◦ 2 SMT systems (Rosetta team)  2 sets of 2,585 {source, translation, score} triples  Quality score: 1-4 adequacy  Features: 82 shallow versus 122 shallow + ling: 22

23  Distribution of adequacy scores: ScoreMT1MT2 1 2%2.3% 2 20%23% 3 45%46% 4 33%28.7% Avg. quality

24  RMSE : LanguagesMT SystemAll features No ling features ar-enMT ar-enMT Deviation of 14-26% 24

25  Best performing: ◦ Length (words, content-words, etc.)  Absolute numbers are better than proportions ◦ Language model / corpus frequency ◦ Ambiguity of source words  Shallow features are better than linguistic features ◦ Except for one adequacy estimation system  Source/target features are better than contrastive features (shallow and linguistic) ◦ Absolute numbers are better than proportions 25

26  Issues: ◦ Feature representation ◦ Sparsity ◦ Need deeper features for adequacy estimation ◦ Annotation:  1-4 post-editing effort: could be more objective  1-4 adequacy: can we isolate adequacy from fluency? ◦ Language-dependency ◦ Reliability of resources  Low quality translations ◦ Availability of resources 26

27  General vs specific errors  Bottom-up approach: word-based CE ◦ (Xiong et al 2010)  Word posterior probability, dangling words (link grammar parser), target words & POS patterns ◦ (Bach et al 2011)  Dependency relations, words and POS patterns, e.g. relate target words to patterns of POS tags in source 27

28 ◦ (Bach et al 2011): best features are source-based 28

29  Top-down approach (on-going work) ◦ Corpus-based analysis: generalize errors in categories ◦ Portuguese-English ◦ 150 sentences (2 domains, 2 MT systems) ◦ RBMT: more systematic errors Linguistic Indicators Europarl MT1 News MT1 Europarl MT2 News MT2 Inflectional error Incorrect voice26136 Mistranslated pronoun Missing pronoun Incorrect subject-verb order ~700 errors / 150 sentences 42 error categories : a few rules per category… 29

30  It is possible to estimate the quality of MT systems wrt post-editing needs using shallow, language- and system-independent features  Adequacy estimation is a harder problem ◦ Need more complex linguistic features…  Linguistic features are relevant: ◦ Directly useful for error detection (word-level CE) ◦ Directly useful for automatic post-editing ◦ But… for sentence-level CE:  Issues with sparsity  Issues with representation: length bias 30

31 Lucia Specia

32  Aziz, W., Rios, M., Specia, L. (2011). Shallow Semantic Trees for SMT. WMT  Denkowski, M. and Lavie. A Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, WMT.  Giménez, J. and Màrquez, L Linguistic Measures for Automatic Machine Translation Evaluation. Machine Translation, Volume 24, Numbers 3-4.  Hardmeier, C Improving Machine Translation Quality Prediction with Syntactic Tree Kernels. EAMT  Liu, D. and Gildea, D Semantic role features for machine translation. 23rd Conference on Computational Linguistics.  Pado, S., Galley, M., Jurafsky, D., and Manning, C Robust Machine Translation Evaluation with Entailment Features. ACL. 32

33  Pighin, D. and Màrquez, L Automatic Projection of Semantic Structures: an Application to Pairwise Translation Ranking, SSST-5.  Tatsumi, M. and Roturier, J Source Text Characteristics and Technical and Temporal Post-Editing Effort : What is Their Relationship ?, nd JEC Workshop.  Wu,D. and Fung, P Semantic roles for SMT: a hybrid two- pass model. HLT/NAAACL.  Xiong, D., Zhang, M. and Li, H Error Detection for SMT Using Linguistic Features. ACL

34  Best features (Pearson’s correlation) (S3 en-es): 34

35  Filtering out bad translations: 1-2 (S3 en-es) ◦ Average human scores in the top n translations: 35

36  QE x MT metrics: Pearson’s correlation (S3 en-es) 36

37 ◦ QE score x MT metrics: Pearson’s correlation across MT systems: Test setTraining setPearson QE and human S3 en-esS1 en-es0.478 S2 en-es0.517 S3 en-es0.542 S4 en-es0.423 S2 en-esS1 en-es0.531 S2 en-es0.562 S3 en-es0.547 S4 en-es

38  SMT model global score and internal features  Distortion count, phrase probability,...  % search nodes aborted, pruned, recombined …  Language model using n-best list as corpus  Distance to centre hypothesis in the n-best list  Relative frequency of the words in the translation in the n-best list  Ratio of SMT model score of the top translation to the sum of the scores of all hypothesis in the n-best list, … 38

39  Best performing: ◦ Length (words, content-words, etc.)  Absolute numbers are better than proportions ◦ Language model / corpus frequency ◦ Ambiguity of source words  Shallow features are better than linguistic features ◦ Except for one adequacy estimation system  Source/target features are better than contrastive features (shallow and linguistic) ◦ Absolute numbers are better than proportions LanguagesMT System All feature s No ling. feature s All feature s abs. en-esMT en-esMT en-esMT en-esMT


Download ppt " Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations  We are moving."

Similar presentations


Ads by Google