Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anastasiou 1 Idioms in EBMT Idiom Processing within the EBMT System METIS-II Dimitra Anastasiou Institut für Angewandte Informationsforschung.

Similar presentations


Presentation on theme: "Anastasiou 1 Idioms in EBMT Idiom Processing within the EBMT System METIS-II Dimitra Anastasiou Institut für Angewandte Informationsforschung."— Presentation transcript:

1 Anastasiou 1 Idioms in EBMT Idiom Processing within the EBMT System METIS-II Dimitra Anastasiou dimitra@d-anastasiou.com Institut für Angewandte Informationsforschung (IAI) Saarland University, Germany School of Computing, Dublin City University, Dublin 15 th October 2008

2 2 / 40 Anastasiou Idioms in EBMT Aim-Methods Aim Enhancement of translation quality of idiomatic expressions (idiomatic VPs in particular) within the German-to-English EBMT system METIS-II Resources Bilingual idiom dictionary Monolingual corpus Syntactic rules according to the German topological field model

3 3 / 40 Anastasiou Idioms in EBMT Outlook EBMT: statistical or rule-based MT? Interpretation of idioms Topological field model Treatment of idioms by MT METIS-II idiom resources Translation process of METIS-II Evaluation of METIS-II

4 4 / 40 Anastasiou Idioms in EBMT EBMT: Statistical or Rule-Based MT? Two tendencies of EBMT: 1) Combinations of EBMT with rule-based MT (RBMT) as hybrid systems [Sumita et al., 1990]; 2) Pure EBMT systems [Sato & Nagao, 1990]. EBMT lies between RBMT and statistical MT (SMT) [Carl & Way, 2005] Reason: The transfer between SL and TL is always guided by translation examples, even if the replacement and/or modification of the sub-sequences are completely rule- or data-based.

5 5 / 40 Anastasiou Idioms in EBMT Outlook EBMT: statistical or rule-based MT? Interpretation of idioms Identification by MT Semantics Syntax Grammatical and lexical variants Topological field model Treatment of idioms by MT METIS-II idiom resources Translation process of METIS-II Evaluation of METIS-II

6 6 / 40 Anastasiou Idioms in EBMT Interpretation of Idioms Diverse terms (and accordingly definitions): idiom, semi-idiom, (cranberry) collocation, idiomatic/figurative/fixed/periphrastic phrase/expression, phraseologism, (dead) metaphor, etc. Irregularity of idioms depends on: Fixedness of constituents [Moon, 1998; Trawinski, 2008]; Degree of compositionality; Syntactic opaqueness: kick the bucket – die [Jackendoff, 1997; Gazdar et al., 1985]; Poetic marking of the form, e.g. klipp und klar (clear as daylight) mit Rat und Tat (help and advice)

7 7 / 40 Anastasiou Idioms in EBMT Idiom Identification by MT jmdn. mit Argusaugen beobachten so.-with-Argus eyes-observe watch so. like a hawk Er beobachtete den Mann, der die Bank betrat, mit Argusaugen. He was watching the man, who entered the bank, like a hawk. The contiguous parts of the idiom (mit Argusaugen); The discontinuous parts of the idiom (beobachten) in any of its declination forms; The syntactic requirements of the idiom; The clause boundaries (usually in one clause). More information can be found in Volk (1998).

8 8 / 40 Anastasiou Idioms in EBMT Semantics (Degree of Compositionality) 1) Non-compositional: cranberry/unical constituents, e.g.: A recent study on cranberry expressions in English and German is that of Trawinski et al. (2008); 2)Partially compositional: light-verb constructions (SVCs) A recent study on German PP-verb SVCs is that of Krenn (2008); 3) Strictly compositional: collocations, e.g. as happy as a sandboy on tenterhooks außer Betrieb gehen – go out of service außer Betrieb sein – be out of order Maßnahmen ergreifen take measures

9 9 / 40 Anastasiou Idioms in EBMT Syntax Syntactic categories of idioms Realization of idioms as for the syntactic gaps Continuous (without gaps) Discontinuous (with gaps)

10 10 / 40 Anastasiou Idioms in EBMT Syntactic Categories Noun phrase (NP): pink slip Prepositional phrase (PP): by hook or crook Combination NP-PP: danger for life and health Adjective: prim and proper Verbal phrase (iVP) NP-Verb: kick the bucket PP-Verb: fall on deaf ears NP-PP-Verbthrow out the baby with the bath water Proverb less is sometimes more Sayinggimme a break

11 11 / 40 Anastasiou Idioms in EBMT Grammatical Variants (1) Number: pull up stakes *pull up stake Exception! keep tabs on sb/sth keep a tab on sb/sth Case: auf die Strasse gehen *auf der Strasse gehen take to the streets Determ.: play a role *play the role Posses.: in Verbindung treten *in Pos.Pron. Verbindung treten contact Pos.Pron. Ohr leihen*das Ohr leihen listen

12 12 / 40 Anastasiou Idioms in EBMT Grammatical Variants (2) Negation: eine Rolle spielen (play a role) keine Rolle spielen (any-role-play) *nicht eine Rolle spielen (not-a-role-play) auf keinen grünen Zweig kommen (never get anywhere) nicht/nie auf einen grünen Zweig kommen Passivization The more syntactically opaque an idiom has, the less possible it is to undergo passivization. opaque: [kick] [the bucket] – die *The bucket was kicked by him (only literal meaning) transparent: [spill] [the beans] – [tell] [a secret] The beans were spilled by him

13 13 / 40 Anastasiou Idioms in EBMT Lexical Variants Substitution: kick the bucket *kick the pail hit the sack *hit the hay Modifiers Adjective: keep tabs on keep close tabs on Adverb: noch grün hintern den Ohren sein noch absolut grün hintern den Ohren sein be half-baked

14 14 / 40 Anastasiou Idioms in EBMT Outlook EBMT: statistical or rule-based MT? Interpretation of idioms Topological field model Realization of idioms Discontinuous patterns Treatment of idioms by MT METIS-II idiom resources Translation process of METIS-II Evaluation of METIS-II

15 15 / 40 Anastasiou Idioms in EBMT Topological Field Model for German The German clauses are divided into five fields; each field can be occupied by a certain number and kind of constituents [Drach, 1963; DUDEN, 1998; Dürscheid, 2000]: pre-field (PF):only 1 constituent!; left bracket (LB):finite (modal/auxiliary verb); middle field (MF):many constituents and in free order; right bracket (RB):non-finite verb (infinitive/participle form); post-field (PF):subclause(s).

16 16 / 40 Anastasiou Idioms in EBMT Realization of Idioms Continuous form: ( iNP MF | iPP MF | [iNP MF iPP MF ] ) iV RB Er will nicht bei den Argumenten ständig den Bock (iNP MF ) zum Gärtner (iPP MF ) machen (iV RB )! He-wants-not-during-the-arguments-always- the-bock-to-the-gardner-make! He does not always want to set the fox to keep the geese during the argumentation! Discontinuous form: iV LB (Adverb)* MF ( iNP MF | iPP MF | [iNP MF iPP MF ] ) Er macht (iV LB ) oft (Adverb) den Bock (iNP MF ) zum Gärtner (iPP MF ). He-makes-often- the-bock-to-the-gardner. He often sets the fox to keep the geese.

17 17 / 40 Anastasiou Idioms in EBMT Discontinuous patterns Den Bock zum Gärtner machen (set the fox to keep the geese) Er macht (iV LB ) oft (Adverb) den Bock (iNP MF ) zum Gärtner (iPP MF ). Er hat den Bock (iNP MF ) zum Gärtner (iPP MF ) oft gemacht (iV RB ). ?Den Bock (iNP PF ) zum Gärtner (iPP PF ) hat er oft gemacht (iV RB ). ?Den Bock (iNP PF ) hat er oft zum Gärtner (iPP MF ) gemacht (iV RB ).

18 18 / 40 Anastasiou Idioms in EBMT Outlook EBMT: statistical or rule-based MT? Interpretation of idioms Topological field model Treatment of idioms by MT Idioms suitable for EBMT METIS-II idiom resources Translation process of METIS-II Evaluation of METIS-II

19 19 / 40 Anastasiou Idioms in EBMT Treatment of Idioms by MT Bar-Hillel (1952): The only way for a machine to treat idioms is - not to have idioms! Power Translator Pro user manual (2000) warns the user to avoid inputting sentences containing idioms! Power Translator Pro, SYSTRAN, T1 Langenscheidt cannot identify discontinuous idioms.

20 20 / 40 Anastasiou Idioms in EBMT Idioms suitable for EBMT Idiomatic expressions are are not suitable for rule-based MT (RBMT), but are suitable for EBMT. Translation of an idiomatic expression can only be used to translate the same idiomatic expression; it cannot be used to translate a similar expression. (Sumita et al., 1990: 210). By contrast, Nomiyama (1992) emphasizes the disadvantage of EBMTs using only thesauri to define a general semantic distance, resulting in over-generalization, which is a major problem in translating idiomatic expressions. Related work: Santos (1990), Wehrli (1998), Ryu et al. (1999), and Gangadharaiah; Balakrishnan (2006):

21 21 / 40 Anastasiou Idioms in EBMT Outlook EBMT: statistical or rule-based MT? Interpretation of idioms Topological field model Treatment of idioms by MT METIS-II idiom resources Idiom lexicon German corpus (annotation), (statistical analysis) Syntactic rules Translation process of METIS-II Evaluation of METIS-II

22 22 / 40 Anastasiou Idioms in EBMT Idiom Resources Bilingual idiom dictionary of 871 entries Monolingual German corpus of 486 sentences Syntactic rules according to the German topological field model

23 23 / 40 Anastasiou Idioms in EBMT METIS-II Project Hybrid MT system (EBMT, RBMT, SMT); Time span: 2004-2007; SLs: Dutch, German, Greek, Spanish; TL: Bristish English; Based on pattern matching; Sources: Huge monolingual TL corpus (BNC); Bilingual dictionaries; Tokenizer; PoS tagger, chunker, lemmatizer; Manually constructed matching rules.

24 24 / 40 Anastasiou Idioms in EBMT Idiom Dictionary 871 entries Entry example {de=den_Bock_zum_Gärtner_machen, mde={c=verb}, en=set_the_fox_to_keep_the_geese, men={c=verb}}. 826 equal PoS45 different PoS (verb/VP-interjection) 598 verbs/ VPs 163 interject- ions 37 NPs 28 PPs

25 25 / 40 Anastasiou Idioms in EBMT Manually constructed (IAI) Idiom Corpus three corpus resources Europarl (EP) Mixture of data sets (MDS) DWDS (Digital lexicon of the German language in the 20th century) Real examples (Internet) 80 MWEs 63 cont. (79%) 17 disc. (21%) 275 MWEs 205 cont. (75%) 70 disc. (25%) 131 MWEs 91 cont. (69%) 40 disc. (31%)

26 26 / 40 Anastasiou Idioms in EBMT Annotation of Idioms in the German Corpus Continuous form: Er will nicht bei den Argumenten ständig den Bock zum Gärtner machen. He does not always want to set the fox to keep the geese during the argumentation. Discontinuous form: Er macht oft den Bock zum Gärtner. He often sets the fox to keep the geese.

27 27 / 40 Anastasiou Idioms in EBMT Statistical Analysis of iVPs Syntactic Patterns Continuous form patterns EP corpusMDS corpusDWDS corpus NP-V86515 PP-V2910660 NP-PP-V4216 Discontinuous form patterns EP corpusMDS corpusDWDS corpus V-NP1813 V-PP162518 V-NP-PP-229

28 28 / 40 Anastasiou Idioms in EBMT Syntactic Rule for Continuous Idioms Er will nicht bei den Argumenten ständig den Bock zum Gärtner machen! En Bloc Pattern = A:match=yes, last idioms word=no, [den Bock,zum Gärtner] B: match=yes, last idioms word=yes [machen] C: mark_as_continuous_iVP. where A: first idiom constituent - before last B: last idiom constituent C: command to identify/match as continuous No alien element between A and B!

29 29 / 40 Anastasiou Idioms in EBMT Syntactic Rule for Discontinuous Idioms Er macht (iV LB ) oft (Adverb) den Bock (iNP MF ) zum Gärtner (iPP MF ). Discontinuous Pattern_LBMF = A: match=yes, field=LB, c=verb, [macht] B: [match=no, field=MF]*, [oft] C: match=yes, field=MF, [den Bock, zum Gärtner] D: mark_as_discontinuous_iVP. where A: idioms verb in the left bracket B: arbitrarily many elements C: matched idioms constituents D: command to identify/match as discontinuous Alien element(s) between A and C!

30 30 / 40 Anastasiou Idioms in EBMT Outlook History of EBMT Interpretation of idioms Topological field model Treatment of idioms by MT METIS-II idiom resources Translation process of METIS-II METIS-II Idiom Matching Process Evaluation of METIS-II

31 31 / 40 Anastasiou Idioms in EBMT METIS-II Translation Process 1) SL analysis (tokenization, PoS-tagging, lemmatization, and chunking or shallow parsing); 2) SL-to-TL matching i) The bilingual idiom dictionary; ii) The syntactic matching rules. 3) TL generation (the main TL resource, BNC, is used as a data-set of examples). The token generator is described in Carl & Schütz (2005).

32 32 / 40 Anastasiou Idioms in EBMT METIS-II Idiom Matching Process Users Store an idiom in the bilingual dictionary; Load the syntactic matching rules; Enter an input sentence/corpus. System The system reads the sentence word by word; If the idiom is continuous and in the same form as stored in the dictionary, it is directly correctly translated; If the idiom is discontinuous, the system reads the syntactic matching rules (rule by rule), until it finds the appropriate one which is then applied.

33 33 / 40 Anastasiou Idioms in EBMT Outlook History of EBMT Interpretation of idioms Topological field model Treatment of idioms by MT METIS-II idiom resources Translation process of METIS-II Evaluation of METIS-II For continuous idioms For discontinuous idioms

34 34 / 40 Anastasiou Idioms in EBMT Evaluation of METIS-II Hit: correct matching/correct translation Miss: no matching/reuse of German input Noise: false matching/literal translation Presicion: Recall: fscore:

35 35 / 40 Anastasiou Idioms in EBMT Evaluation Results for Continuous iVPs RecallPrecisionf-score Europarl Corpus98,3%96,8% Manually constructed examples and examples from the Web 99%96,2%97,4% DWDS98,9%96,7%97,4%

36 36 / 40 Anastasiou Idioms in EBMT Evaluation Results for Discontinuous iVPs RecallPrecisionf-score Europarl Corpus88,2%78,9%83,2% Manually constructed examples and examples from the Web 95,7%84,8%88,8% DWDS92,5%90,2%90,6%

37 37 / 40 Anastasiou Idioms in EBMT Conclusion Continuous idioms: more than 95% recall and precision Discontinuous idioms: Almost more than 90% recall and more than 80% precision. The evaluation figures for continuous idioms of all techniques are higher than these for the discontinuous idioms. This is attributed to the fact that discontinuous idioms are more difficult to identify due to their spread constituents through the sentence.

38 38 / 40 Anastasiou Idioms in EBMT Thank you for your attention! Dimitra Anastasiou www.d-anastasiou.com dimitra@d-anastasiou.de

39 39 / 40 Anastasiou Idioms in EBMT References (1) Bar-Hillel, Y., (1952), The Treatment of idioms by a Translating Machine, presented at the Conference on Mechanical Translation at Massachusetts Institute of Technology, June 1952. Brown, R. D., (1999), Adding Linguistic Knowledge to a Lexical Example- based Translation System, in: 8th TMI 1999, Chester, England 22-32. Carl, M.; Schütz, J., (2005), A Reversible Lemmatizer/Token-generator for English, in: EBMT Workshop 2005, MT Summit X, Phuket, Thailand. Drach, Erich, (1963), Grundgedanken der deutschen Satzlehre, Darmstadt: Wissenschaftliche Buchgesellschaft. DUDEN Redaktion, (1998), Grammatik der deutschen Gegenwartssprache, Mannheim. Dürscheid, C., (2000), Syntax: Grundlagen und Theorien, Wiesbaden. Gangadharaiah, R.; Balakrishnan, N., (2006), Application of Linguistic Rules to Generalized Example Based Machine Translation for Indian Languages, in: Proceedings of the First National Symposium on Modeling and Shallow Parsing of Indian Languages (MSPIL), Mumbai, India. Gazdar, G.; Klein, E.; Pullum, G.; Sag, I., (1985), Generalized Phrase Structure Grammar, Basil Blackwell, Oxford Jackendoff, Ray. 1997. The Architecture of the Language Faculty. Cambridge, Mass.: MIT Press. Krenn, B., (2008), Description of evaluation resource – German PP-verb data, in: MWE Workshop 2009, at LREC Conference, 7-11.

40 40 / 40 Anastasiou Idioms in EBMT References (2) Moon, R., (1998), Fixed Expressions and Idioms in English: A Corpus-based Approach, Oxford, England: Clarendon Press. Ryu, B. R.; Kim Y. K.; Yuh, S. H.; Park S. K., (1999), FromTo K/E: A Korean English Machine Translation system based on idiom recognition and fail softening, in: MT Summit VII, Singapore, 469-475. Santos, D., (1990), Lexical gaps and idioms in Machine Translation, in: Karlgren, H. (Ed.), 13th COLING 1990, Helsinki, Finland, 330-335. Sumita, E.; Iida, H.; Kohyama, H., (1990), Translating with Examples: A New Approach to Machine Translation, in: 3rd TMI 1990, Texas, USA, 203-212. Trawinski, B., Sailer, M., Soehn, J.P., Lemnitzer, L., Richter, F., (2008),Cranberry Expressions in English and German, in: MWE Workshop 2009, at LREC Conference, 35-39. Volk, M., (1998), The Automatic Translation of Idioms. Machine Translation vs. Translation Memory Systems, in: Nico Weber (Ed.): Machine Translation: Theory, Applications, and Evaluation. An assessment of the state of the art. St. Augustin: Gardez-Verlag. Wehrli, E. (1998), Translating Idioms, in: 17th COLING 1998, Vol. 2, 1388- 1392.


Download ppt "Anastasiou 1 Idioms in EBMT Idiom Processing within the EBMT System METIS-II Dimitra Anastasiou Institut für Angewandte Informationsforschung."

Similar presentations


Ads by Google