MT Summit VIII, 2001 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Information Extraction Lecture 12 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Progress in Chinese EBMT for LingWear Ying Zhang (Joy) Language Technologies Institue Carnegie Mellon University Sep.
Machine Translation via Dependency Transfer Philip Resnik University of Maryland DoD MURI award in collaboration with JHU: Bootstrapping Out of the Multilingual.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Design of a Multi-lingual MT for Real-time Broadcast Captioning Course Project for Ying Zhang (Joy) Advisor: Eric.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Translation with Scarce Resources The Avenue Project.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy)
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Natural Language Processing Expectation Maximization.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Machine translation Context-based approach Lucia Otoyo.
Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University
Statistical Alignment and Machine Translation
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Direct Translation Approaches: Statistical Machine Translation
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
Statistical NLP: Lecture 13
Approaches to Machine Translation
Presentation transcript:

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English EBMT Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie (

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Background The Example Based Machine Translation System – EBMT (Brown 96; Brown 99) –A shallow match system –Extract statistical dictionary from bitext –Word-level alignment –Dictionary and glossary are used to fill the gaps –Using target language trigram to generate the “best” translaton (Hogan & Frederking 1998)

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Data Used Hong Kong Legal Code: – Chinese: 23 MB –English: 37.8 MB Hong Kong News (After cleaning): 7622 Documents – Dev-test: Size: 1,331,915 byte, 4,992 sentence pairs –Final-test: Size: 1,329,764 byte, 4,866 sentence pairs –Training: Size: 25,720,755 byte, 95,752 sentence pairs Corpus Cleaning –Converted from Big5 to GB –Divided into Training set (90%), Dev-test (5%) and test set (5%) –Sentence level alignment, using Church & Gale Method (by ISI) –Cleaned –Convert two-byte Chinese characters to their cognates

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Chinese Segmentation Our EBMT system is word based Written Chinese has no spaces between words

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Chinese Segmentation (2) Why not just using characters? –Mis-match between Chinese and English will be worse

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Chinese Segmentation (3) Segmentation Problem: –Given a sentence with no spaces, break it into words. Segmentation Approaches: –Statistical approach –Dictionary based approach –Combination of dictionary and linguistic knowledge We used forward/backward maximum match, with LDC’s frequency dictionary for baseline –Suffered from the incomplete coverage of the dictionary on corpus

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Goal Extract Chinese terms from the corpus and add them to the frequency dictionary for segmentation Result of pre-processing: –A segmented/bracketed bilingual corpus –A statistical dictionary

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Definitions Vague definitions of Chinese words Definition used in this paper –Chinese Characters The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. –Chinese Words A word in natural language is the smallest reusable unit which can be used in isolation. –Chinese Phrases We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. –Terms A term is a meaningful constituent. It can be either a word or a phrase.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Tokenization Techniques (1) Collocation measure For two adjacent terms: w 1 and w 2 Where VMI(w 1 :w 2 ) is a variant of average mutual information:

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Tokenization Techniques (2) Dual-threshold for segmenting

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Tokenization Procedure Tokenizing on character level cannot produce a highly accurate segmentation –Cross-boundary problem Instead, tokenize on the segmented corpus using LDC’s segmenter

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Feedback from Statistical Dictionary Monolingual tokenization may lead to over segmentation The statistical dictionary was built from segmented corpus Using the results of statistical dictionary to adjust the segmentation

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Flowchart of Pre-processing

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Results With proper parameters for two thresholds: –Average length of Chinese terms increased by 60%, 10% for English –Statistical dictionary gained 30% increase in coverage (with the same precision) –Small boost in EBMT overall performance Automatic evaluation metrics Human evaluations

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Ongoing and Future work Adding word-clustering and grammar induction features Improving the sub-sentential alignment model by utilizing the bilingual collocation information Change threshold dynamically according to the current segmentation

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University References (partial) Ralf D. Brown Example-Based Machine Translation in the PanGloss System. In Proceedings of the Sixteenth International Conference on Computational Linguistics, Pages , Copenhagen, Denmark. Ralf D. Brown "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p Santa Fe, July 23-25, 1997 Ralf D. Brown Adding Linguistic Knowledge to a Lexical Example-Based Translation System. In Proceedings of the Eighth International Conferences on Theoretical and Methodological Issues in Machine Transaltion (TMI-99), pages 22-32, Chester, England, August. Ralf D. Brown Automated Generalization of Translation Examples. In Proceedings of the Eighteenth International Conferences on Computational Linguistics (COLING-2000), pages Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc. Christopher Hogan and Robert E. Frederking An Evaluation of the Multi-engine MT Architecture. In Machine Translation and the Information Soup: Proceedings of the Third Conference of the Association for Machine Translation in Americas (AMTA ’98), volume 1529 of Lecture Notes in Artificial Intelligence, pages Springer-Verlag, Berlin, October

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University The End Questions and Comments?