The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Progress in Chinese EBMT for LingWear Ying Zhang (Joy) Language Technologies Institue Carnegie Mellon University Sep.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy)
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Statistical Alignment and Machine Translation
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Direct Translation Approaches: Statistical Machine Translation
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
Statistical NLP: Lecture 13
Approaches to Machine Translation
Presentation transcript:

The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies Institute School of Computer Science, Carnegie Mellon University 2 Topics Overview of this project –Rapid deploy Machine Translation system between Chinese and English For HLT 2001 (Jun 00-Jan 01) –Augment the segmenter with new words found in the corpus For MT-Summit VIII Paper (Jan 01- May 01) –Two-threshold method used in tokenization code to find new words in corpus For PI meeting (Jun 01- Jul 01) –Accurate ablation experiments –Named-entities added to the training –Multi-corpora experiment After PI meeting (Aug 01) –Study of results reported for PI meeting –Review of evaluation methods –Type-token relations Plan for future research

Language Technologies Institute School of Computer Science, Carnegie Mellon University 3 Overview of Ch-En EBMT Adapting EBMT to Chinese Corpus used –Hong Kong legal code (from LDC) –Hong Kong news articles (from LDC) In this project: –Robert Frederking, Ralf Brown, Joy, Erik Peterson, Stephan Vogel, Alon Lavie, Lori Levin,

Language Technologies Institute School of Computer Science, Carnegie Mellon University 4 Corpus Cleaning Convert from Big5 to GB Divided into Training set (90%), Dev-test (5%) and test set (5%) Sentence level alignment, using Church & Gale Method (by ISI) Cleaned Convert two-byte Chinese characters to their cognates

Language Technologies Institute School of Computer Science, Carnegie Mellon University 5 Corpus Statistics Hong Kong Legal Code: –Chinese: 23 MB –English: 37.8 MB Hong Kong News (After cleaning) –7622 Documents –Dev-test: Size: 1,331,915 byte, 4,992 sentence pairs –Final-test: Size: 1,329,764 byte, 4,866 sentence pairs –Training: Size: 25,720,755 byte, 95,752 sentence pairs –Vocabulary size under LDC segmenter –Dev-test: Total type 8,529Total token 134,749 – Final-test: Total type 8,511Total token 135,372 –Training: Total type 20,451Total token 2,600,095

Language Technologies Institute School of Computer Science, Carnegie Mellon University 6 Chinese Segmentation There are no spaces between Chinese words in written Chinese. The segmentation problem: Given a sentence with no spaces, break it into words

Language Technologies Institute School of Computer Science, Carnegie Mellon University 7 Vague Definition of Words In English, word might be “a group of letters having meaning separated by spaces in the sentence”---- Doesn’t work for Chinese Is the word a single Chinese character?---Not necessarily Is the word the smallest set of characters that can have meaning by themselves? --- Maybe Is the word the longest set of characters that can have meaning by themselves? --- Perhaps

Language Technologies Institute School of Computer Science, Carnegie Mellon University 8 Our Definition of Words/Phrases/Terms Chinese Characters –The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. Chinese Words –A word in natural language is the smallest reusable unit which can be used in isolation. Chinese Phrases –We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. Terms –A term is a meaningful constituent. It can be either a word or a phrase.

Language Technologies Institute School of Computer Science, Carnegie Mellon University 9 Complicated Constructions There are some constructions that can cause problems for segmentation: –Transliterated foreign words and names: Using Chinese characters for the sound of English names. The meaning of each character is irrelevant and can not be relied on. Each Chinese-speaking region will often transliterate the same name differently

Language Technologies Institute School of Computer Science, Carnegie Mellon University 10 Complicated Constructions (2) –Abbreviations: In Chinese abbreviations are formed by taking a character from each word in the phrase being abbreviated. –Virtually any phrase can be abbreviated by taking on a character from each component, and these characters usually have no independent relation to each other

Language Technologies Institute School of Computer Science, Carnegie Mellon University 11 Complicated Constructions (3) –Chinese Names Name = Surname (gen. one character) + Given name (one or two characters) About 100 common surnames, but the number of given names is huge The complication for NLP: the same characters in names can be used in “regular” words. Just like in English: Bill Brown as a name.

Language Technologies Institute School of Computer Science, Carnegie Mellon University 12 Complicated Constructions (4) –Chinese Numbers Similar to English, there are several ways to write numbers in Chinese:

Language Technologies Institute School of Computer Science, Carnegie Mellon University 13 Segmenter Approaches –Statistical approaches: Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur. Cons: –Data sparseness –Cross boundary

Language Technologies Institute School of Computer Science, Carnegie Mellon University 14 Segmenter (2) –Dictionary-based approaches Idea: Use a dictionary to find the words in the sentence Forward maximum match / backward maximum match/ or both direction Cons: –The size and quality of the dictionary used are of great importance: New words, Named-entity –Maximum (greedy) match may cause mis-segmentations

Language Technologies Institute School of Computer Science, Carnegie Mellon University 15 Segmenter (3) –A combination of dictionary and linguistic knowledge Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation Pros: high accuracy (possible) Cons: –Require a dictionary with POS and word-frequency –Computationally expensive

Language Technologies Institute School of Computer Science, Carnegie Mellon University 16 Segmenter (4) We first used LDC’s segmenter Currently we are using a forward/backward maximum match segmenter for baseline. The word frequency dictionary is from LDC Word frequency dictionary from LDC: entries

Language Technologies Institute School of Computer Science, Carnegie Mellon University 17 For HLT 2001 Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001 (HLT-2001).

Language Technologies Institute School of Computer Science, Carnegie Mellon University 18 For MT-Summit VIII Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001) Two-threshold for tokenization

Language Technologies Institute School of Computer Science, Carnegie Mellon University 19 For MT-Summit VIII (2)

Language Technologies Institute School of Computer Science, Carnegie Mellon University 20 For PI Meeting (1) Baseline System Full System Baseline + Named-Entity Multi-corpora System

Language Technologies Institute School of Computer Science, Carnegie Mellon University 21 For PI Meeting (2) Baseline System

Language Technologies Institute School of Computer Science, Carnegie Mellon University 22 For PI Meeting (3) Full System

Language Technologies Institute School of Computer Science, Carnegie Mellon University 23 For PI Meeting (4) Named-Entity

Language Technologies Institute School of Computer Science, Carnegie Mellon University 24 For PI Meeting (5) Multi-Corpora Experiment –Motivation –Corpus Clustering –Experiment

Language Technologies Institute School of Computer Science, Carnegie Mellon University 25 Evaluation Issues Automatic Measures –EBMT Source Match –EBMT Source Coverage –EBMT Target Coverage –MEMT (EBMT+DICT) Unigram Coverage –MEMT (EBMT+DICT) PER Human Evaluations

Language Technologies Institute School of Computer Science, Carnegie Mellon University 26 Evaluation Issues (2) Human Evaluations –4-5 graders each time –6 categories

Language Technologies Institute School of Computer Science, Carnegie Mellon University 27 Evaluation Issues (3)

Language Technologies Institute School of Computer Science, Carnegie Mellon University 28 After PI Meeting (0) Study of results reported in PI meeting ( –The quality of Named-Entity (Cleaned by Erik) –Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation) –How to evaluate the performance of the system Experiment of G-EBMT –Word clustering

Language Technologies Institute School of Computer Science, Carnegie Mellon University 29 After PI Meeting (1) Changing the average length of Chinese token –No bracket on English –Use a subset of LDC’s frequency dictionary for segmentation –Study the performance of EBMT system on different average Chinese token length

Language Technologies Institute School of Computer Science, Carnegie Mellon University 30 After PI Meeting (2)

Language Technologies Institute School of Computer Science, Carnegie Mellon University 31 After PI Meeting (3) Avg. Token Len. vs. StatDict Recall

Language Technologies Institute School of Computer Science, Carnegie Mellon University 32 After PI Meeting (4) Avg. Token Len. vs. Source word match

Language Technologies Institute School of Computer Science, Carnegie Mellon University 33 After PI Meeting (5) Avg. Token Len vs. Source Coverage

Language Technologies Institute School of Computer Science, Carnegie Mellon University 34 After PI Meeting (6) Avg. Token Len. Vs.

Language Technologies Institute School of Computer Science, Carnegie Mellon University 35 After PI Meeting (7) Avg. Token Len. Vs. Src/Tgt Coverage of EBMT in MEMT

Language Technologies Institute School of Computer Science, Carnegie Mellon University 36 After PI Meeting (8) Avg. Token Len. Vs. Translation Unigram Coverage

Language Technologies Institute School of Computer Science, Carnegie Mellon University 37 After PI Meeting (9) Avg. Token Len. Vs. Hypothesis Len (Len of translation) The reference translation’s length is 1163 words

Language Technologies Institute School of Computer Science, Carnegie Mellon University 38 After PI Meeting (10) Avg. Token Len. Vs. PER

Language Technologies Institute School of Computer Science, Carnegie Mellon University 39 After PI Meeting (11) Type-Token curve for Chinese

Language Technologies Institute School of Computer Science, Carnegie Mellon University 40 After PI Meeting (12) Type-Token curve of Chinese and English

Language Technologies Institute School of Computer Science, Carnegie Mellon University 41 Future Research Plan Generalized EBMT –Word-clustering –Grammar Induction Using Machine Learning to optimize the parameters used in MEMT Better Alignment Model: Integrating segmentation, brackting and alignment

Language Technologies Institute School of Computer Science, Carnegie Mellon University 42 New Alignment Model (1) Using both monolingual and bilingual collocation information to segment and align corpus

Language Technologies Institute School of Computer Science, Carnegie Mellon University 43 New Alignment Model (2)

Language Technologies Institute School of Computer Science, Carnegie Mellon University 44 New Alignment Model (3)

Language Technologies Institute School of Computer Science, Carnegie Mellon University 45 New Alignment Model (4)

Language Technologies Institute School of Computer Science, Carnegie Mellon University 46 References Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc.