Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao.

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/17) 1

Outline Background Related Work Proposed Method Experiments Conclusion 2

Bilingual Corpora [Fung+ 2004] TypeDefinitionExample Parallel Sentence-aligned bilingual corporaEuroparl Noisy Parallel Bilingual translations of documentsPatent family Comparable Topic-aligned bilingual documentsWikipedia Quasi-Comparable Very-non-parallel bilingual documentsthis study 4 Lack of parallel corpora Parallel sentences can be extracted from noisy and comparable corpora Quasi-comparable corpora more available, however few parallel sentences exist

Parallel Fragments In quasi-comparable corpora, there could be parallel fragments in comparable sentences Parallel fragments are also helpful for SMT We aim to accurately extract parallel fragments from comparable sentences 应用 / 铅 / 离子 / 选择 / 电极 / 电位 / 滴定 / 法 / 测定 / 甘草 / 及 / 其 / 制品 / 中 / 的 / 甘草 / 酸 (Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) ＜ / 原 / 報 / ＞ / 鉛 / イオン / 選択 / 性 / 電極を / 用いる / 混合 / 試料 / 中 / の /…/ と / 電位 / 差 / 滴定 / 法 / の / 比較 ( lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison) Zh : Ja: 5

Parallel Sub-sentential Fragment Extraction [Munteanu+ 2006] 1.Extract translation lexicon from a parallel corpus 2.Apply a lexicon filter to comparable sentences in two directions independently – Assign initial scores according to the lexicon – Score smoothing to gain new knowledge that does not exist in the lexicon 3.Extract sub-sentential (not exactly parallel) fragment 7

8 应用应用铅离子离子选择选择电极电极电位电位滴定滴定法测定测定甘草甘草及其制品制品中的甘草甘草酸＜原報＞鉛イオンイオン選択選択性電極電極を用いる用いる混合混合試料試料中のと電位電位差滴定滴定法の比較比較 Lexicon Filter on Ja-to-Zh Direction

9 应用应用铅离子离子选择选择电极电极电位电位滴定滴定法测定测定甘草甘草及其制品制品中的甘草甘草酸＜原報＞鉛イオンイオン選択選択性電極電極を用いる用いる混合混合試料試料中のと電位電位差滴定滴定法の比較比較 Lexicon Filter on Zh-to-Ja Direction

System Overview Translated sentences Comparable sentences Parallel fragments Source corpora Target corpora Classifier (2) IR: top N results (1) (3) (4) Alignment Parallel corpus Parallel fragment candidates Lexicon filter (5) SMT 11 Use an alignment model to locate the source and target fragment candidates simultaneously Use a more accurate lexicon filter

Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longest aligned fragments more than 3 tokens 12

Lexicon Filter − Assign Initial Scores 13 Assign scores in two directions to aligned word pairs in the candidates according to translation lexicon

Lexicon Filter − Score Smoothing 14 Only smooth a word with negative score when both the left and right words around it have positive scores

Fragment Extraction 15 Fragments more than 3 tokens with continuous positive scores in both directions

Outline Background Related Work Proposed Method Experiments – Parallel Fragment Extraction – Translation Conclusion 16

Experimental settings (Parallel Fragment Extraction 1/2) Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) Quasi-Comparable Corpora – Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) – Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) Comparable sentences: 30k chemistry domain sentences were extracted 17

Experimental settings (Parallel Fragment Extraction 2/2) Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences – External: together with 11k chemistry domain data in the parallel corpus Translation lexicon – IBM Model 1 [Brown+ 1993] – Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] Compare with [Munteanu+ 2006] 18

Results Method# fragmentsAvg size (Zh/Ja)Accuracy [Munteanu+ 2006]28.4k20.36/21.39(1%) Only (IBM Model 1)18.9k4.03/4.1480% Only (LLR)18.3k4.00/4.1489% Only (SampLEX)18.4k3.96/4.0587% External (IBM Model 1) 28.7k4.18/4.3381% External (LLR)26.9k4.17/4.3385% External (SampLEX)28.0k4.11/4.2382% ※ Accuracy: manually evaluated 100 fragments based on exact match 19

Experimental Settings (Translation) Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) Tuning: 368 sentences of chemistry domain Testing: 367 sentences of chemistry domain Decoder: Moses Language model: 5–gram language model on the Ja side of the parallel corpus using SRILM Compare MT performance by appending the extracted fragments to the baseline training data 20

BLUE-4 for Different Systems 21 ※ “*” denotes that the result is better than “Baseline” significantly at p < 0.05 ** * *

Conclusion We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon Future Work – A method to deal with ordering – Parallel corpus independent method – Try other language pairs and domains 23

Thank you for your attention!

Examples of Extracted Fragment Pairs 25 IDZh FragmentJa Fragment 1 直接甲醇燃料电池直接メタノール燃料電池 2 Ｘ射线光电子能谱（ＸＰＳ）Ｘ線光電子分光法（ＸＰＳ） 3 （ＯＨ）２４（Ｈ２Ｏ）１２］ 4 的原生质体融合のプロトプラスト融合 5 分子动力学（ＭＤ）模拟了分子動力学（ＭＤ）シミュレーションを 6 扫描电子显微镜（ＳＥＭ）、透射电子显微镜（ＴＥＭ）型電子顕微鏡（ＳＥＭ），透過型電子顕微鏡（ＴＥＭ） 7 证明了本算法的から本アルゴリズムの 8 Ｘ射线粉末衍射Ｘ線回折分析 ※ Noise is written in red font Most noise is due to the noisy translation lexicon (Example 5-7) Score smoothing also produces some noise (Example 8)

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao.

Similar presentations

Presentation on theme: "Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao.

Similar presentations

Presentation on theme: "Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, Sadao."— Presentation transcript:

Similar presentations

About project

Feedback