Vamshi Ambati 14 Sept 2007 Student Research Symposium

Vamshi Ambati 14 Sept 2007 Student Research Symposium
Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection Vamshi Ambati 14 Sept 2007 Student Research Symposium

Agenda Rule Learning for MT Syntax Projection Task Word Alignment Task
Bootstrapping Word Alignment Experiment and Results

Machine Translation for Resource poor Languages
A major portion of the human languages are ‘resource-poor’ Less parallel corpus into Major languages Less monolingual corpus Less annotation tools Less grammarians Less bilingual speakers Machine Translation in such a scenario is extremely difficult

Machine Translation for Resource poor Languages
AVENUE [Alavie’03 et.al] –

Rule Learning for MT NP::NP [PP NP] -> [NP PP] ( (X1::Y2) (X2::Y1)
(X0 = X2) ((Y1 NUM) = (X2 NUM)) ((Y1 NUM) = (Y2 NUM)) ((Y1 PERS) = (Y2 PERS)) (Y0 = Y1) ) PP::PP [ADVP NP POSTP] -> [ADVP PREP NP] (X1::Y1) (X2::Y3) (X3::Y2) (X0 = X3) (Y0 = Y2)

How can such rules be learnt?
Given annotated data for TL we have creative ways to do this Nothing more valuable than annotated data But, these are “resource-poor” languages Can we look from the ‘Source side’ and transfer annotation ?

Syntax Projection Named Entity Projection [Rama] ate an apple
rAma ne ek apple khaya

Syntax Projection Named Entity Projection [Rama] ate an apple
[rAma] ne ek apple khaya

Syntax Projection Base NP Projection [Rama] ate [an apple]
rAma ne ek apple khaya

Syntax Projection Base NP Projection [Rama] ate [an apple]
[rAma] ne [ek apple] khaya

Syntax Projection Constituent Phrase Projection rAma ne ek apple khaya

Syntax Projection Constituent Phrase Projection

Rule Learning Goal English: Rama ate an apple
Hindi: raMa ne apple khaya S::S [NP NP VP] -> [NP VP NP] S::S [NP NP ‘khaya’] -> [NP ‘ate’ NP] S::S [‘rAma’ ‘ne’ NP ‘khaya’] -> [‘Rama’ ‘ate’ NP]

Word Alignment Task Training data Source language Target language
f : source sentence (Hindi) j = 1,2,...,J Target language e : target sentence (English) i = 1,2,...,I

Word Alignment Models IBM1 – lexical probabilities only
IBM2 – lexicon plus absolut position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4 HMM – lexicon plus relative position [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999]

Our Approach Better Syntax Projection requires better Word Alignment
Our Hypothesis: Word Alignment can be improved using Syntax projection Project Base NPs to TL and obtain a clean NP table Perform a constrained alignment in Parallel Corpus using the NP table

Why Base NPs ? NPs are semantically and syntactically cohesive across languages NPs show minimal categorical divergence when compared to its colleagues NPs are building blocks of a sentence and their translation improves MT quality [Philipp Koehn, PhD thesis 2003]

Constrained Alignment
[PESA: Phrase Pair Extraction as Sentence Splitting, Vogel ’05 ]

Ex: Rama ate [an apple] rAma ne [ek apple] khaya

Ex: Rama ate [an apple] rAma ne khaya [ek apple]

NP based Bootstrapping: Algorithm
Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table

Corpus There are quite a large number of Malayalees living here .
is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island malayAlama logoM kI bahuwa badZI saMKyA hE . xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .

Source side Parsed There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island (S1 (S (NP (EX There)) (VP (AUX are) (NP (NP (PDT quite) (DT a) (JJ large) (NN number)) (PP (IN of) (NP (NP (NNS Malayalees)) (VP (VBG living) (ADVP (RB here))))))) (. .))) (S1 (S (VP (AUX is) (VP (VBN found) (PP (IN in) (NP (NP (DT the) (JJ west) (NN coast)) (PP (IN of) (NP (NNP Great) (NNP Nicobar))) (VP (VBN called) (S (NP (DT the) (NNP Magapod) (NNP Island)))))))) (. .))) (S1 (S (NP (NNP Plotemy)) (VP (VBZ calls) (SBAR (S (NP (PRP them)) (VP (POS ') (NP (NP (NNP Nagadip) (POS ')) (, ,) (NP (NP (DT a) (NNP Hindu) (NN name)) (PP (IN for) (NP (JJ naked) (NN island))))))))) (. .)))

NP based Bootstrapping

Aligned Corpus ;;Sentence id = 1
SL:There are quite a large number of Malayalees living here . TL:malayAlama logoM kI bahuwa badZI saMKyA hE . Alignment:((1,8),(2,7),(11,8),(3,1),(4,8),(5,5),(6,6),(7,3),(8,1),(9,1),(10,1)) ;;Sentence id = 2 SL:is found in the west coast of Great Nicobar called the Magapod Island . TL:xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . Alignment:((1,12),(2,10),(11,2),(12,7),(13,1),(14,13),(3,9),(4,2),(5,4),(6,4),(7,2),(8,7),(9,1),(10,11))

Extract Source NPs NP:1:There :. NP:1:quite a large number :.
NP:1:Malayalees : NP:2:the west coast : NP:2:Great Nicobar : NP:2:the Magapod Island : NP:6:Plotemy : NP:6:them : NP:6:Nagadip ' : NP:6:a Hindu name : NP:6:naked island :

Extract NP translation Pairs
NP:1:There :. NP:1:quite a large number :malayAlama logoM kI bahuwa badZI saMKyA hE . NP:1:Malayalees :malayAlama NP:2:the west coast :ke paScimI wata NP:2:Great Nicobar :xvIpa ke paScimI wata para sWiwa mEgApOda NP:2:the Magapod Island :xvIpa ke paScimI wata para sWiwa mEgApOda NP:6:Plotemy :plotemI NP:6:them :inheM NP:6:Nagadip ' :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke NP:6:a Hindu name :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . NP:6:naked island :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna

Feature Extraction for NP Pairs
Features Source length in words Target length in words Absolute length difference Freq Source base np Freq of Target base np Freq of the S-T pair Source 2 Target probability Target 2 Source probability

Calculate Features of NP pairs
NP:1:There:.:1:1:0:2449:2318:258: : NP:1:quite a large number:malayAlama logoM kI bahuwa badZI saMKyA hE .:4:8:4:3:1:1: e-13:0 NP:1:Malayalees:malayAlama:1:1:0:3:2:1: : NP:2:the west coast:ke paScimI wata:3:3:0:15:2:1: e-06: e-11 NP:2:Great Nicobar:xvIpa ke paScimI wata para sWiwa mEgApOda:2:7:5:6:2:1: e-05:0 NP:2:the Magapod Island:xvIpa ke paScimI wata para sWiwa mEgApOda:3:7:4:1:2:1: e-06:0 NP:6:Plotemy:plotemI:1:1:0:1:1:1:1:1 NP:6:them:inheM:1:1:0:2153:27:16: :0 NP:6:Nagadip ':plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke:2:15:13:1:1:1: e-05:0 NP:6:a Hindu name:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .:3:20:17:1:1:1: e-12:0 NP:6:naked island:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna:2:13:11:1:1:1: e-06:0

Prune based on manual thresholds
NP:1:There:. NP:1:Malayalees:malayAlama NP:2:the west coast:ke paScimI wata NP:6:Plotemy:plotemI NP:6:them:inheM

Word Align ‘S’ and ‘T’ using IBM-4 Extracted NP on source side using the Parse Extracted translations by harvesting Viterbi Alignment Calculate features for NP pairs Prune based on thresholds Perform Constrained Word Alignment and Lexicon extraction using the NP table (Folding)

Constrained Alignment: NP Folding
There Malayalees NP are quite a large number of NP living here . the west coast is found in NP of Great Nicobar called the Magapod Island . them Plotemy NP calls NP ' Nagadip ' , a Hindu name for naked island . . malayAlama NP logoM kI bahuwa badZI saMKyA NP ke paScimI wata xvIpa NP para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . them plotemI NP NP nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .

Experiments English Hindi (Resource constrained) English Hindi

Word Alignment Experiments
Training: 5000 sentences Testing: 200 sentences Human Extracted NP table – 21,736

Word Alignment Results (5k corpus)
Experiments with 5k training corpus and 200 test sentences Experiment Prec Rec F- AER Model 4 40.20 30.96 34.98 65.12 Model 4+ NPs ( 1 ) 41.33 32.12 36.15 63.85 Model 4+ NPs ( 2 ) 41.23 32.26 36.02 63.88

NP Projection Results (5k)
Evaluation: NP_Table harvested from 5k test bed corpus Iteration Identified Accuracy (on gold std) After Pruning NPs (1) 21693 33% (7200) 4708 61% (2619) NPs (2) 21690 33.12% (7200) 4819 60.8% (2601)

Word Alignment Experiments
Training: 50K Eng-Hin Corpus Testing: 200 Eng-Hin aligned sentences Human Extracted NP table – 21,736

Word Alignment Results (55k corpus)
Experiments with 55k training corpus and 200 test sentences Experiment Prec Rec F- AER Model 4 48.16 45.91 46.75 53.76 Model 4+ NPs ( 1 ) 48.19 46.50 46.83 53.17 Model 4+ NPs ( 2 ) 48.14 46.49 46.82 53.18

NP Projection Results (55k)
Evaluation: NP_Table created by human alignment Iteration Identified Precision (on gold std) After Pruning NPs (1) 306366 38% (8906) 88352 58.2% (4236) NPs (2) 306294 37% (8124) 93165 59.16%

From here.. Improvements Machine Translation Reliable NP Projection
Hierarchical Word Alignment Machine Translation Rule Learning Refined Probabilistic translation Lexicon Clean Linguistically motivated Phrase table with probabilities

Questions ?

Thanks !

Vamshi Ambati 14 Sept 2007 Student Research Symposium

Similar presentations

Presentation on theme: "Vamshi Ambati 14 Sept 2007 Student Research Symposium"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vamshi Ambati 14 Sept 2007 Student Research Symposium

Similar presentations

Presentation on theme: "Vamshi Ambati 14 Sept 2007 Student Research Symposium"— Presentation transcript:

Similar presentations

About project

Feedback