MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English EBMT Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie (http://www.cs.cmu.edu/~joy)

MT Summit VIII, 2001 2 Language Technologies Institute School of Computer Science Carnegie Mellon University Background The Example Based Machine Translation System – EBMT (Brown 96; Brown 99) –A shallow match system –Extract statistical dictionary from bitext –Word-level alignment –Dictionary and glossary are used to fill the gaps –Using target language trigram to generate the “best” translaton (Hogan & Frederking 1998)

MT Summit VIII, 2001 3 Language Technologies Institute School of Computer Science Carnegie Mellon University Data Used Hong Kong Legal Code: – Chinese: 23 MB –English: 37.8 MB Hong Kong News (After cleaning): 7622 Documents – Dev-test: Size: 1,331,915 byte, 4,992 sentence pairs –Final-test: Size: 1,329,764 byte, 4,866 sentence pairs –Training: Size: 25,720,755 byte, 95,752 sentence pairs Corpus Cleaning –Converted from Big5 to GB –Divided into Training set (90%), Dev-test (5%) and test set (5%) –Sentence level alignment, using Church & Gale Method (by ISI) –Cleaned –Convert two-byte Chinese characters to their cognates

MT Summit VIII, 2001 4 Language Technologies Institute School of Computer Science Carnegie Mellon University Chinese Segmentation Our EBMT system is word based Written Chinese has no spaces between words

MT Summit VIII, 2001 5 Language Technologies Institute School of Computer Science Carnegie Mellon University Chinese Segmentation (2) Why not just using characters? –Mis-match between Chinese and English will be worse

MT Summit VIII, 2001 6 Language Technologies Institute School of Computer Science Carnegie Mellon University Chinese Segmentation (3) Segmentation Problem: –Given a sentence with no spaces, break it into words. Segmentation Approaches: –Statistical approach –Dictionary based approach –Combination of dictionary and linguistic knowledge We used forward/backward maximum match, with LDC’s frequency dictionary for baseline –Suffered from the incomplete coverage of the dictionary on corpus

MT Summit VIII, 2001 7 Language Technologies Institute School of Computer Science Carnegie Mellon University Goal Extract Chinese terms from the corpus and add them to the frequency dictionary for segmentation Result of pre-processing: –A segmented/bracketed bilingual corpus –A statistical dictionary

MT Summit VIII, 2001 8 Language Technologies Institute School of Computer Science Carnegie Mellon University Definitions Vague definitions of Chinese words Definition used in this paper –Chinese Characters The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. –Chinese Words A word in natural language is the smallest reusable unit which can be used in isolation. –Chinese Phrases We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. –Terms A term is a meaningful constituent. It can be either a word or a phrase.

MT Summit VIII, 2001 9 Language Technologies Institute School of Computer Science Carnegie Mellon University Tokenization Techniques (1) Collocation measure For two adjacent terms: w 1 and w 2 Where VMI(w 1 :w 2 ) is a variant of average mutual information:

MT Summit VIII, 2001 10 Language Technologies Institute School of Computer Science Carnegie Mellon University Tokenization Techniques (2) Dual-threshold for segmenting

MT Summit VIII, 2001 11 Language Technologies Institute School of Computer Science Carnegie Mellon University Tokenization Procedure Tokenizing on character level cannot produce a highly accurate segmentation –Cross-boundary problem Instead, tokenize on the segmented corpus using LDC’s segmenter

MT Summit VIII, 2001 12 Language Technologies Institute School of Computer Science Carnegie Mellon University Feedback from Statistical Dictionary Monolingual tokenization may lead to over segmentation The statistical dictionary was built from segmented corpus Using the results of statistical dictionary to adjust the segmentation

MT Summit VIII, 2001 13 Language Technologies Institute School of Computer Science Carnegie Mellon University Flowchart of Pre-processing

MT Summit VIII, 2001 14 Language Technologies Institute School of Computer Science Carnegie Mellon University Results With proper parameters for two thresholds: –Average length of Chinese terms increased by 60%, 10% for English –Statistical dictionary gained 30% increase in coverage (with the same precision) –Small boost in EBMT overall performance Automatic evaluation metrics Human evaluations

MT Summit VIII, 2001 15 Language Technologies Institute School of Computer Science Carnegie Mellon University Ongoing and Future work Adding word-clustering and grammar induction features Improving the sub-sentential alignment model by utilizing the bilingual collocation information Change threshold dynamically according to the current segmentation

MT Summit VIII, 2001 16 Language Technologies Institute School of Computer Science Carnegie Mellon University References (partial) Ralf D. Brown. 1996. Example-Based Machine Translation in the PanGloss System. In Proceedings of the Sixteenth International Conference on Computational Linguistics, Pages 169-174, Copenhagen, Denmark. http://www.cs.cmu.edu/~ralf/papers.html Ralf D. Brown. 1997. "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p. 111-118. Santa Fe, July 23-25, 1997 Ralf D. Brown. 1999. Adding Linguistic Knowledge to a Lexical Example-Based Translation System. In Proceedings of the Eighth International Conferences on Theoretical and Methodological Issues in Machine Transaltion (TMI-99), pages 22-32, Chester, England, August. http://www.cs.cmu.edu/~ralf/papers.html Ralf D. Brown. 2000. Automated Generalization of Translation Examples. In Proceedings of the Eighteenth International Conferences on Computational Linguistics (COLING-2000), pages 125-131 Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc. Christopher Hogan and Robert E. Frederking. 1998. An Evaluation of the Multi-engine MT Architecture. In Machine Translation and the Information Soup: Proceedings of the Third Conference of the Association for Machine Translation in Americas (AMTA ’98), volume 1529 of Lecture Notes in Artificial Intelligence, pages 113-123. Springer-Verlag, Berlin, October

MT Summit VIII, 2001 17 Language Technologies Institute School of Computer Science Carnegie Mellon University The End Questions and Comments?

MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Similar presentations

Presentation on theme: "MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Similar presentations

Presentation on theme: "MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English."— Presentation transcript:

Similar presentations

About project

Feedback