Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Czech-English Word Alignment Ondřej Bojar (obo@cuni.cz), Magdalena Prokopová (magda.prokopova@gmail.com) obo@cuni.czmagda.prokopova@gmail.comobo@cuni.czmagda.prokopova@gmail.com Institute of Formal and Applied Linguistics, ÚFAL MFF, Charles University in Prague Motivation Full text, acknowledgement and the list of references in the proceedings of LREC 2006. Manual Annotation Automatic Word Alignment Types of connections used to compare annotations: Possible, Sure, Phrasal Connection of Any Type AnnotatorA115,47615,399 A216,63116,246 MismatchA1 but not A22,3431,146 A2 but not A13,4981,714 Relative mismatch18.2 %9.0 % Intersection (1-1)Union (n-n) PrecRecAERPrecRecAER Baseline97.457.627.465.986.725.5 Lemmas97.975.015.077.189.817.2 Lemmas + Numbers97.975.214.877.589.917.0 Lemmas + Singletons backed off with POS 97.475.814.677.888.517.4 HumansGIZA++BaselineImproved encsencs Problems 14.315.514.315.5 ProblemsOK0.1 0.20.1 OKProblems38.635.725.225.0 OK 46.948.760.459.4 Problematic WordsProblematic Parts of Speech EnglishCzechEnglishCzech 361to319,679IN1348N 259the271se519DT1283V 159of146v510NN661R 143a112na386PRP505P 124,74o361TO448Z 107be61že327VB398A 99it55.310JJ280D 95that47a245RB192J 84in41bude216NNP59C 80by37k199VBN22T ………… CzechEnglish Sentences21,141 Running Words475,719494,349 Running Words without Punctuation404,523439,304 BaselineVocabulary57,08530,770 Singletons31,45814,637 LemmasVocabulary28,00725,000 Singletons13,00911,873 Lemmas + Singletons Vocabulary15,04113,150 Singletons122 Where GIZA++ Fails, Humans Were Often in Trouble, Too Details about the Prague Czech-English Dependency Treebank Two human annotations compared against each other. GIZA++ compared against golden alignments (i.e. merged human annotations).  Out of all the positions where GIZA++ failed, 38% were problematic for humans.  The improvement thanks to lemmatization is not observed on words that are difficult for humans anyway. Source: Wall Street Journal section of the Penn Treebank Translated sentence-by-sentence to Czech.  Used twice (Cs->En and En->Cs)  The two guessed alignments can be merged using union, intersection or possibly other techniques. Motivation to manually annotate word alignment: to create evaluation data for automatic alignment methods to learn more about inter-annotator agreement and the limits of the task both annotators mark a sure connection  required connection one of the annotators chooses sure connection and the other any other connection type  required connection at least one of the annotators chooses any connection type  allowed connection otherwise  connection not allowed Two annotators independently annotated 515 sentences using 3 main connection types: the word has no counterpart (null, ) the words can be possibly linked (possible, ) the words are translations of each other (sure, ) Additionally, some segments could have been marked as phrasal translations: whole phrases correspond, but not the individual words (phrasal, ) Top Ten Problematic Words and POSes Steps in statistical machine translation:  Mismatch rate relatively high, but it reduces to a half if the differences in connection type are disregarded. Preprocessing of the input text such as lemmatization significantly reduces data sparseness (see the table Details about the PCEDT below) and helps to achieve better alignments: English Penn Treebank Tag-Set: IN - Preposition or subordinating conjunction, DT - Determiner, NN - Noun, common, singular or mass, PRP - Pronoun, personal, TO - to, VB - Verb, base form, JJ - Adjective, NNP - Noun, proper, singular, VBN - Verb, past participle.Czech Tag-Set: N - Noun, V - Verb, R - Preposition, P - Pronoun, Z - Punctuation, sentence border, A - Adjective, D - Adverb, J - Conjunction, C - Number, T - Particle Verbs and their belongings, including the negative particle. English articles in cases where the rule “connect to the Czech governing noun” cannot be clearly applied. Punctuation: commas are used more frequently in Czech, the dollar symbol ($) is almost always translated and thus rarely repeated in Czech. Most Frequent Problematic Cases Sentence-parallel corpus Automatic word alignment Phrase extraction Evaluation metrics: Precision penalizes superfluous connections (connections generated automatically but not even allowed), recall penalizes forgotten required connections. Alignment-error rate (AER) is a combination of precision and recall. GIZA++ (Och and Ney, 2003) automatically creates asymmetric alignments (1 source word connected to n target words). The test set for GIZA++ was created by merging the two human annotations: The following table displays the percentage of tokens where there was a match (OK) or mismatch (Problems) in the respective languages: Phrase table ~Translation dictionary of multi-word expressions Word-to-word alignments Baseline (raw input text)Zisksevyšvihlna117milionůdolarů Lemmasziskse-1vyšvihnoutna-1117miliondolar Lemmas + Numbersziskse-1vyšvihnoutna-1NUMmiliondolar Lemmas + Singletons Backed off with POSziskse-1VERBna-1117miliondolar GlossRevenuereflsoaredto117milliondollar Results of automatic word alignment:

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Similar presentations

Presentation on theme: "Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Similar presentations

Presentation on theme: "Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová"— Presentation transcript:

Similar presentations

About project

Feedback