Presentation is loading. Please wait.

Presentation is loading. Please wait.

Urdu-to-English Stat-XFER system for NIST MT Eval 2008

Similar presentations


Presentation on theme: "Urdu-to-English Stat-XFER system for NIST MT Eval 2008"— Presentation transcript:

1 Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Resources: “Constrained” training condition: only parallel training data released by NIST/LDC on DVD We believe this is the LDC “language pack” for URDU developed under NSA funding 100K words? Includes our recent 3100 sentence Elicitation Corpus (~10K words?) Annotations to the training data can be added and distributed to all participants Most important for us: English parse trees Any other annotations? POS-tags?

2 Training Resources Implications:
No morphology analyzer on the Urdu side Full-form to full-form MT system? Collapse entries on English side only and plug in English morphology generator? Is manual transfer rule development allowed under the training rules? Only rules that are extracted automatically from the parallel data? Human filtering of automatically extracted rules?

3 Building the Urdu-to-English system
Main “pieces” that we need: Bilingual word-to-word translation lexicon Base NPs translation tables XFER grammar rules English Language Model

4 Bilingual Translation Lexicon and Base NPs
Extracted from the parallel corpus Bootstrap with GIZA++ word alignment (IBM-1 or IBM-3) Filtering of the resulting word-to-word translations Combining alignments from both directions Applying various threshold methods New iterative method with extraction of base NPs (Vamshi): Need English parses for the entire parallel corpus We will have accurate “gold” parses for the EC portion of the corpus


Download ppt "Urdu-to-English Stat-XFER system for NIST MT Eval 2008"

Similar presentations


Ads by Google