Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMTEXT: Extraction-based MT for Arabic

Similar presentations


Presentation on theme: "AMTEXT: Extraction-based MT for Arabic"— Presentation transcript:

1 AMTEXT: Extraction-based MT for Arabic
Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

2 Background and Objectives
Full MT of text is problematic: Requires large amounts of resources, long development time Quality of output varies Analysts often are looking for limited concrete information within the text  full MT may not be necessary Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information Text Extraction technology has made much progress in past decade [TIPSTER, TREC, EELD] Research Question: Can Extraction-based MT result in improved accuracy and utility of information for analysts? Nov 14, 2003 ITIC Site Visit

3 Extraction-based MT “Traditional” Approach:
Develop information extraction capability for the source language Runtime Extractor produces a template of extracted feature-value information If desired, English Generator can render the information in the form of text Drawback: Adapting extraction technology to a new foreign language is difficult Requires significant expertise in the foreign language Significant amounts of human development time Not clear that it is an attractive solution Nov 14, 2003 ITIC Site Visit

4 AMTEXT Approach Attempt to leverage from our work on automatic learning of MT transfer rules Develop an elicitation corpus specifically designed for targeted extraction patterns Learn generalized transfer rules for targeted extraction patterns from elicitation corpus Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary Runtime: use partial parser + transfer rules to translate only the matched portions of SL text Nov 14, 2003 ITIC Site Visit

5 AMTEXT Extraction-based MT
Word-aligned elicited data Source Text Learning Module Run Time Transfer System Transfer Rules Partial Parser S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE] ((X1::Y1) (X4::Y4) (X5::Y5)) Extracted Target Text Transfer Engine NE Translation Lexicon Word Translation Lexicon Nov 14, 2003 ITIC Site Visit

6 Elicitation Example Nov 14, 2003 ITIC Site Visit

7 Elicitation Example Nov 14, 2003 ITIC Site Visit

8 Elicitation Example Nov 14, 2003 ITIC Site Visit

9 Elicitation Example Nov 14, 2003 ITIC Site Visit

10 Learning Transfer Rules
Different notion of rule generalization than in our full XFER approach Generalize from examples to NEs that play specific roles in target extraction pattern Verbs and function words may not be generalized Example: Sharon will meet with Bush today sharon yipagesh &im bush hayom Goal Rule: S::S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE] ((X1::Y1) (X4::Y5) (X5::Y6)) Nov 14, 2003 ITIC Site Visit

11 Acquisition of Named Entity Translation Lexicon
Utilize Fei Huang’s work on building Named Entity Translation Lexicons based on transliteration models NE Lexicon will be split into meaningful sub-categories: PNs, Organizations, Locations, etc. NE translation lexicon augmented with NEs from elicited data Goal: High coverage and high accuracy identification of NEs that play a part in the transfer rules Nov 14, 2003 ITIC Site Visit

12 Named Entity Translation Lexicon
English-Arabic lexicon from Fei: Trained on TIDES Newswire Data 7522 entries sorted by transliteration score Example: # XXX # # Israel # AsrAAyl # XXX # # Kabul # kAbwl # XXX # # Paris # bArys # XXX # # Afghanistan # AfgAnstAn # XXX # # Pakistan # bAkstAn # XXX # # Moscow # mwskw # XXX # # Arafat # ErfAt # XXX # # Beirut # byrwt # XXX # # Russia # rwsyA Nov 14, 2003 ITIC Site Visit

13 Named Entity Identification
NE Identifinder for English Available from BBN Will be used for identifying English NEs within elicited data  Arabic NEs from word alignments NE Identifinder for Arabic: Requested from BBN, so far no response Will use if available, can manage without it (naïve identification based on NE translation lexicon) Nov 14, 2003 ITIC Site Visit

14 Acquisition of Limited Word Translation Lexicon
Vocabulary of interest is limited based on specific actions and objects that are of interest  scopeable on the English side Elicitation corpus serves as a high-quality initial source for extracting this translation lexicon Statistical word-to-word translation dictionary from SMT or EBMT can be used as a source for expanding coverage on the foreign language side Experiment if time/resources permit with incorporating expanded vocabulary into transfer rules Nov 14, 2003 ITIC Site Visit

15 Partial Parsing Input: Full text in the foreign language
Output: Translation of extracted/matched text Goal: Extract by effectively matching transfer rules with the full text Identify/parse NEs and words in restricted vocabulary Identify transfer-rule (source-side) patterns Handle expected high-levels of ambiguity Sharon, meluve b-sar ha-xuc shalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today Nov 14, 2003 ITIC Site Visit

16 Scope of Pilot System Arabic-to-English
Newswire text (available from TIDES) Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… Limited translation patterns: <subj-NE> <verb> <obj> <LOC>* <TE>* Limited vocabulary Nov 14, 2003 ITIC Site Visit

17 Evaluation Plan Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage Establish a test set for evaluation Define an appropriate metric: Precision/Recall/F1 of relations and entities Compare performance Nov 14, 2003 ITIC Site Visit

18 Current Status Initial small elicitation corpus translated and aligned
Extraction of elicitation phrases from Penn-TB in advanced stages Identifying scope of coverage: relations, actions, translation patterns Preliminary NE translation lexicon available Nov 14, 2003 ITIC Site Visit

19 Work Plan Creation of full elicitation corpus: Nov-03
Translation/align. of elicitation corpus: Nov/Dec-03 Install and integrate BBN English Identifinder: Dec-03 Acquire initial NE translation lexicon: Dec-03 Acquire initial word translation lexicon: Dec-03 Develop and integrate partial parser: Dec-03/Feb-04 Modify Transfer Engine for AMTEXT configuration: Dec-03/Jan-04 Integration of preliminary complete system: Feb-04 Design of evaluation: Feb-04 System testing and modifications: Feb/Apr-04 Test-set evaluation: Apr-04 Nov 14, 2003 ITIC Site Visit


Download ppt "AMTEXT: Extraction-based MT for Arabic"

Similar presentations


Ads by Google