Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more.

Similar presentations


Presentation on theme: "Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more."— Presentation transcript:

1 Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more info: J&M, chap 21 in 1 st ed, 25 in 2 nd. Also extra notes.

2 Computing Science, University of Aberdeen2 Machine Translation l Automatically translate texts between languages (eg, English to Japanese) »Or assist human translators? l One of the oldest dreams of NLP, AI, and CS (first system in 1954).

3 Computing Science, University of Aberdeen3 Varieties of Machine Translation Translating from a source language to a target language. l (FA)MT – (full automatic) Machine Translation l HAMT – Human Aided MT (aid before or after) l MAHT – Machine Aided Human Translation

4 Computing Science, University of Aberdeen4 Brief History of MT Serious but naïve work in the 1950 ’ s l 1966 ALPAC report (speed, cost, accuracy) terminated most research funding “ Underground ” MT systems developed into products (e.g. SYSTRAN) in the 1970 ’ s More MT products emerged in the 1980 ’ s and 1990 ’ s, though still relatively simple l MT now in everyday widespread use (e.g. for web pages), in spite of its problems

5 Computing Science, University of Aberdeen5 Translation is Hard: Language differences Lexical l Meanings assigned to a word »to know a person »to know a fact l Boundaries on a scale »friend vs acquaintance l Preferences »sibling vs brother vs elder brother l Gaps »Japanese has no word for privacy

6 Computing Science, University of Aberdeen6 Overlaps between word senses (Eng/Fr)

7 Computing Science, University of Aberdeen7 Syntactic differences l Morphology vs word-order »English: John saw Jane »Russian: John[+subject] saw Jane[+object] l Which word orders »English: a cheap car »French: a car cheap l Argument order (e.g. VSO/SVO/SOV languages) »English: John likes apples »Spanish: apples gustar John

8 Computing Science, University of Aberdeen8 Pragmatic differences l Zero pronouns »Bake [] for 20 minutes l Extra distinctions »Relative-status markers in Japanese l Cultural knowledge »mu -> curtains of her bed, not just curtains

9 Computing Science, University of Aberdeen9 Translating from Japanese to English… l dai yu zi zai chuang shang gan nian bao chai you ting jian chuang wai zhu shao xiang ye zhe shang, yu sheng xi li, qing han tou mu, bu jue you di xia lei lai. l Dai-yu alone on bed top think-of-with-gratitude Bao-chai again listen to window outside bamboo tip plantain leaf of on- top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come l As she lay there alone, Dai-yu’s thoughts turned to Bao- chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.

10 Computing Science, University of Aberdeen10 Perfect Translation needs World Knowledge Example: Translating “ it ” into a language which associates grammatical gender with nouns requires identifying the antecedent: »A hollow cylinder … rests on a surface … and an object is suspended so that it … EnglishGermanGenderPronoun SurfaceFlaecheFemininesie CylinderZylinderMasculineer ObjectObjektNeuteres

11 Computing Science, University of Aberdeen11 Approaches to MT

12 Computing Science, University of Aberdeen12 Direct Translation No intermediate representation. Possibly morphological analysis and simple reordering principles l Input: [Japanese text] l After word-by-word translation »I give PAST pen on desk John to l After word-order, det rewrite rules »I give PAST the pen on the desk to John l After morphology »I gave the pen on the desk to John

13 Computing Science, University of Aberdeen13 l Completely tied to a language pair »Complete new system for each pair l Problems dealing with ambiguity: Example (Russian-English) »My trebuem mira »We require world(direct translation) »We want peace(correct translation) Don ’ t need complex NLP »used in cheap translators Useful as a “ default translation ” if more complex techniques fail Direct Translation - Issues

14 Computing Science, University of Aberdeen14 Structural Transfer l Three steps »parse input text (reusable) »rewrite parse tree into parse tree of new language (specific to language pair) –English NP -> Det Adj N becomes –French NP -> Det N Adj »generate output text (reusable) l More in next lecture

15 Computing Science, University of Aberdeen15 Structural Transfer - Issues l Most popular approach (?) »Used in Systran (Altavista translator) l n*(n-1) transfer components needed for translation between n languages l Good for syntax, less good for words, pragmatics »supplement with other techniques, such as statistical translation of individual words?

16 Computing Science, University of Aberdeen16 Interlingua Approach l Two steps »full analysis of input text, into a meaning (interlingua) –eg, know into KnowFact or KnowPerson »full generation of output text, from meaning Can ’ t be done except in a small domain l Preserving ambiguity »if target language uses same word for KnowFact or KnowPerson, no need to disambiguate know

17 Computing Science, University of Aberdeen17 Interlingua Approach - Issues l Interlingua must contain all aspects of meaning needed for all the languages (e.g. gender for Spanish cats) Interlingua must reflect all the different views on how the world is made up (e.g. Japanese “ yasai ” refers mostly to vegetables, but also mint but not carrots) l For this to work, the domain must be restricted and the languages similar l Translation between n languages only needs n analysis components and n generation components

18 Computing Science, University of Aberdeen18 Statistical Approach l Noisy channel model for speech rec: look for Sentence that maximises P(Sig|Sent)*P(Sent) l MT: look for translation Sent that maximises P(Input|Sent)*P(Sent) »faithfulness*fluency?? »P(Sent) - estimated using bigrams/trigrams »P(Input|Sent) - estimated by analysing a corpus of human-translated texts –eg, how often is know translated as savoir (know fact) and how often as connaitre (know person) –Also model reordering, insertions, deletions

19 Computing Science, University of Aberdeen19 Statistical Approach - Issues l P(Input|Sent) »Very hard to model situations where translation reorders material, even if this has a simple syntactic description »How “ faithful ” is a proposed output sentence to the original input text? »Less clear what this means once we go beyond translating individual words »Combine with direct techniques?

20 Computing Science, University of Aberdeen20 l Translating 100 sentences is trivial, the problems are all in the scaling-up. »Good dictionaries are key. l Three uses »Fully automatic rough translation –like Altavista/Systran Babelfish »Draft translations which a human post-edits (humans can postedit quickly as long as less than 20% of words need to be changed) »Tools for translators (MAHT) MT Performance

21 Computing Science, University of Aberdeen21 Another approach to HAMT: Controlled Languages l A controlled (simplified, basic) English is a subset of full English. »Limited vocabulary: repair but not fix »Limited syntax: I ate but not I have eaten l Mainly used for technical documents l Originally intended to make manuals easier for non-native speakers l MT works much better if input is Controlled English

22 Computing Science, University of Aberdeen22 l (Emerging) standard for commercial aerospace industry. l Designed by academic linguists as well as practitioners (technical authors). AECMA Simplified English

23 Computing Science, University of Aberdeen23 AECMA: vocabulary l Fixed vocabulary (2000 words?) with additions limited to specific areas (eg, company names). Goal is “ each word means only one thing ”, and “ each concept is expressed by only one word ”. No ambiguity, no synonyms.

24 Computing Science, University of Aberdeen24 l Above: only use to indicate physical position »Legal: The wing is above the wheel »Illegal: The engine temperature is above normal »Legal: The engine temperature is more than normal l Test: use as noun only »Legal: the system test »Illegal: Test the circuit. »Legal: Do a test on the circuit. Example words

25 Computing Science, University of Aberdeen25 AECMA: Syntax Rule: Forbid “ unusual ” English syntax l Ex: only simple past, present, future tenses »Illegal: Any other information is to be ignored »Legal: Ignore any other information l Ex: No gerunds »Illegal: Changing the light is dangerous. »Legal: It is dangerous to change the light.

26 Computing Science, University of Aberdeen26 l Only two noun-noun modifiers »Illegal: The aircraft door attachment bolt »Legal: The attachment bolt of the aircraft door l Verbs and det. must be included »Illegal: Rotary switch to INPUT »Legal: Set the rotary switch to INPUT AECMA: Syntax Examples (2)

27 Computing Science, University of Aberdeen27 AECMA: Stylistic Rules l Sentences should be 20 words or less l Paragraphs should be 6 sentences or less. l Start warnings with a command »Illegal: The oil used in the engine contains toxic additives which may be absorbed through the skin. »Legal: Do not get the oil on your skin. It is poisonous.

28 Computing Science, University of Aberdeen28 Controlled-Language MT l Much easier »No problems disambiguating words »Hard syntax is forbidden »May also prohibit/restrict pronouns l Authors must write in CE »CE conformance checkers l Lot of commercial interest


Download ppt "Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more."

Similar presentations


Ads by Google