Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Similar presentations


Presentation on theme: "Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1."— Presentation transcript:

1 Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10,

2 Overview Machine Translation (MT) Rule-based MT Statistical MT Hybrid MT 2

3 MT: What is it? Input: text in source language Output text in target language that is a translation of the input text 3

4 MT: What is it? Interlingua Analyzed input  transferAnalyzed output Inputdirect translation Output 4

5 MT: System Types Direct: –Earliest systems (1950s) Direct word-to-word translation –Recent statistical MT systems Transfer –Almost all research and commercial systems <= 1990 Interlingual 5

6 MT: System Types Interlingual –A few research systems in the 1980s Rosetta (Philips), based on Montague Grammar –Semantic derivation trees of attuned grammars Distributed Translation (BSO) –(enriched) Esperanto Sometimes logical representations Hybrid Interlingual/Transfer –Transfer for lexicons; IL for rules 6

7 Rule-Based Systems Most systems –explicit source language grammar –parser yields analysis of source language input –transfer component turns it into target language structure –no explicit grammar of target language (except morphology) 7

8 Rule-Based Systems Some systems (Eurotra) –explicit source and target language grammar sometimes reversible –parser yields analysis of source language input –transfer component turns it into target language structure –generation of translation by target language grammar 8

9 Rule-Based Systems Some systems (Rosetta, DLT) –explicit source and target language grammar in some cases reversible –parser yields interlingual representation –generation of translation by target language grammar from interlingual representation 9

10 MT: Is it difficult? FAHQT: Fully Automatic High Quality Translation –Fully Automatic: no human intervention –High Quality: close or equal to human translation Even acceptable quality is difficult to achieve 10

11 MT: Problems Ambiguity –Real Cannot be resolved by grammar Is much higher than a human can imagine! Require world knowledge modeling or statistics –Temporary Are resolved by the grammar but require large computational resources 11

12 MT: Problems Computational Complexity –Most rule based systems with a context-free base (O(n 3 )) plus extensions (O(?)) –Require large computational resources –Require large memory resources –Sentences with length > 20 hardly processable 12

13 MT: Problems Complexity of language –Many different construction types –All interacting with each other –Full coverage is hard to achieve  often fall back on robustness measures –For many constructions proper analysis is not known –Theoretical linguistics is not going to help because of focus on explanatory adequacy 13

14 MT: Problems Divergences between languages –Lexical categorial: zich ergeren v. (be) annoyed (Verb-Adj) hij zwemt graag vs. he likes to swim –Phrasal categorial I expect her to leave –ik verwacht dat zij vertrekt She is likely to come –het is waarschijnlijk dat zij komt 14

15 Conflational Divergences: prepositional complements –houden van vs. love existential er vs. Ø –er passeerde een auto vs. –a car passed verbal particles –blow (something) up vs. volar 15

16 Conflational Divergences: reflexive verbs –zich scheren vs. shave composed vs. simple tense forms –he will do it vs. lo hará split negatives vs. composed negatives –he does not see anyone vs. –hij ziet niemand 16

17 Functional Divergences: I like these apples –me gustan estas manzanas se venden manzanas aqui –hier verkoopt men appels er werd door de toeschouwers gejuicht –the spectators were cheering 17

18 Divergences: MWEs semi-fixed MWEs –nuclear power plant vs. kerncentrale flexible idioms –de plaat poetsen vs. bolt –de pijp uit gaan v. to kick the bucket 18

19 Divergences: MWEs semi-idioms (collocations) –zware shag vs. strong tobacco semi-idioms (support verbs) –aandacht besteden aan –pay attention to 19

20 MT: Why is it so difficult? Language Competence v. Language Use –Earlier research systems implemented idealized reality –But not the really occurring language use –In some cases focus on theoretically interesting difficult constructions (that do occur in reality) But other constructions are more important to deal with in practical systems 20

21 MT: Why is it so difficult? Large and rich lexicons –Existing human-oriented dictionaries are not suited as such –All information must be available in a formalized way –Much more information is needed than in a traditional dictionary 21

22 MT: Why is it so difficult? Multi-word Expressions (MWEs) –Are in current dictionaries only in a very informal way –No standards on how to represent them lexically –Many different types requiring different treatment in the grammar –Huge numbers!! –Domain and company-specific terminology are often MWEs 22

23 MT: Why is it so difficult? All systems must make approximations: –Ignore certain ambiguities to begin with –Use only limited amount of relevant information –Cut off analysis when there are too many alternatives 23

24 Statistical MT Derives MT-system automatically –From statistics taken from Aligned parallel corpora (  translation model) Monolingual target language corpora (  language model) Being worked since early 90’s Paradigm originates in speech recognition (and these in noisy channel models) 24

25 MT: Can we make it possible? Plus: –No or very limited grammar development –Includes language and world knowledge automatically (but implicitly) –Based on actually occurring data –Currently many experimental and commercial systems Minus: –Requires large aligned parallel corpora –Clearly has problems with longer span dependencies 25

26 Statistical MT Google Translate (statistical MT)Google Translate Hij draagt een pak.  √He wears a suit. Hij draagt schoenen.  √ He wears shoes. Hij draagt bruine schoenen en een pak.  √ He wears a suit and brown shoes. (!!) Hij draagt het pakket  √ He carries the package Hij heeft een pak aan.  *He has a suit. Voert uw bedrijf sloten uit? –  *Does your company locks out? 26

27 Hybrid MT: Can we somehow combine the strengths of rule- based approaches and the statistical approaches –And avoid their disadvantages? Active Research area –Several projects 27

28 Hybrid MT Euromatrix esp. “the Euromatrix”Euromatrixthe Euromatrix –Lists data and tools for European language pairs –Goals Translation systems for all pairs of EU languages Organization, analysis and interpretation of a competitive annual international evaluation of machine translation The provision of open source machine translation technology including research tools, software and data A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs Efficient inclusion of linguistic knowledge into statistical machine translation The development and testing of hybrid architectures for the integration of rule-based and statistical approaches Successor project EuromatrixPlusEuromatrixPlus 28

29 Hybrid MT PACO-MT PACO-MT Investigates hybrid approach to MT –Rule-based and statistical –Uses existing parser for source language analysis –Uses statistical n-gram language models for generation –Uses statistical approach to transfer 29

30 Hybrid MT META-NET (EU-funding)META-NET –Building a community with shared vision and strategic research agenda –Building META-SHARE, an open resource exchange facility –Building bridges to neighbouring technology fields Bringing more Semantics into Translation Optimising the Division of Labour in Hybrid MT Exploiting the Context for Translation Empirical Base for Machine Translation 30

31 Hybrid MT Bringing more Semantics into Translation –Charles University Prague (Jan Hajic) –FBK-Irst, Trento (Marcello Federico) –UiL-OTS, Utrecht (Christer Samuelsson) currently orienting ourselves and trying to determine a concrete topic for investigation 31

32 Hybrid MT: Semantics Possible Topics: –lexical semantics and their resources / Word Sense Disambiguation –knowledge representations –multiword expressions –Syntactic and semantic dependencies / Semantic Role Labeling –Discourse structure –Co-reference resolution –Recognizing Textual Entailment and MT Evaluation 32

33 Semantics resources Lexical Semantics –Resources: WordNet, EuroWordNet, BalkaNet, WordNets for several languagesWordNetEuroWordNetBalkaNetWordNets for several languages –Knowledge Repositories: OpenCyc, Wikipedia, DBpediaOpenCycWikipediaDBpedia MWE Lexica: SAID, DUELMESAIDDUELME 33

34 Semantics Resources CoNLL 2009 Shared Task on syntactic and semantic dependencies – training and development data training and development data – evaluation dataevaluation data PennDiscource TreeBank 34

35 Hybrid MT Tools: SRL and Semantic Parsing: SWIRL, ASSERT, SENNA, C& C (all for Eng), tools developed at LUND University (for Eng and Chn)SWIRLASSERTSENNAC& Ctools developed at LUND University 35

36 Semantics Resources Tools: Co-Reference and Anaphora Resolution: –BART (Eng),BART –COREA (Dut)COREA NER: –BIOS (Eng)BIOS 36


Download ppt "Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1."

Similar presentations


Ads by Google