Presentation is loading. Please wait.

Presentation is loading. Please wait.

The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

Similar presentations


Presentation on theme: "The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001."— Presentation transcript:

1 The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

2 Language Technologies Institute School of Computer Science, Carnegie Mellon University 2 Topics Overview of this project –Rapid deploy Machine Translation system between Chinese and English For HLT 2001 (Jun 00-Jan 01) –Augment the segmenter with new words found in the corpus For MT-Summit VIII Paper (Jan 01- May 01) –Two-threshold method used in tokenization code to find new words in corpus For PI meeting (Jun 01- Jul 01) –Accurate ablation experiments –Named-entities added to the training –Multi-corpora experiment After PI meeting (Aug 01) –Study of results reported for PI meeting –Review of evaluation methods –Type-token relations Plan for future research

3 Language Technologies Institute School of Computer Science, Carnegie Mellon University 3 Overview of Ch-En EBMT Adapting EBMT to Chinese Corpus used –Hong Kong legal code (from LDC) –Hong Kong news articles (from LDC) In this project: –Robert Frederking, Ralf Brown, Joy, Erik Peterson, Stephan Vogel, Alon Lavie, Lori Levin,

4 Language Technologies Institute School of Computer Science, Carnegie Mellon University 4 Corpus Cleaning Convert from Big5 to GB Divided into Training set (90%), Dev-test (5%) and test set (5%) Sentence level alignment, using Church & Gale Method (by ISI) Cleaned Convert two-byte Chinese characters to their cognates

5 Language Technologies Institute School of Computer Science, Carnegie Mellon University 5 Corpus Statistics Hong Kong Legal Code: –Chinese: 23 MB –English: 37.8 MB Hong Kong News (After cleaning) –7622 Documents –Dev-test: Size: 1,331,915 byte, 4,992 sentence pairs –Final-test: Size: 1,329,764 byte, 4,866 sentence pairs –Training: Size: 25,720,755 byte, 95,752 sentence pairs –Vocabulary size under LDC segmenter –Dev-test: Total type 8,529Total token 134,749 – Final-test: Total type 8,511Total token 135,372 –Training: Total type 20,451Total token 2,600,095

6 Language Technologies Institute School of Computer Science, Carnegie Mellon University 6 Chinese Segmentation There are no spaces between Chinese words in written Chinese. The segmentation problem: Given a sentence with no spaces, break it into words

7 Language Technologies Institute School of Computer Science, Carnegie Mellon University 7 Vague Definition of Words In English, word might be “a group of letters having meaning separated by spaces in the sentence”---- Doesn’t work for Chinese Is the word a single Chinese character?---Not necessarily Is the word the smallest set of characters that can have meaning by themselves? --- Maybe Is the word the longest set of characters that can have meaning by themselves? --- Perhaps

8 Language Technologies Institute School of Computer Science, Carnegie Mellon University 8 Our Definition of Words/Phrases/Terms Chinese Characters –The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. Chinese Words –A word in natural language is the smallest reusable unit which can be used in isolation. Chinese Phrases –We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. Terms –A term is a meaningful constituent. It can be either a word or a phrase.

9 Language Technologies Institute School of Computer Science, Carnegie Mellon University 9 Complicated Constructions There are some constructions that can cause problems for segmentation: –Transliterated foreign words and names: Using Chinese characters for the sound of English names. The meaning of each character is irrelevant and can not be relied on. Each Chinese-speaking region will often transliterate the same name differently

10 Language Technologies Institute School of Computer Science, Carnegie Mellon University 10 Complicated Constructions (2) –Abbreviations: In Chinese abbreviations are formed by taking a character from each word in the phrase being abbreviated. –Virtually any phrase can be abbreviated by taking on a character from each component, and these characters usually have no independent relation to each other

11 Language Technologies Institute School of Computer Science, Carnegie Mellon University 11 Complicated Constructions (3) –Chinese Names Name = Surname (gen. one character) + Given name (one or two characters) About 100 common surnames, but the number of given names is huge The complication for NLP: the same characters in names can be used in “regular” words. Just like in English: Bill Brown as a name.

12 Language Technologies Institute School of Computer Science, Carnegie Mellon University 12 Complicated Constructions (4) –Chinese Numbers Similar to English, there are several ways to write numbers in Chinese:

13 Language Technologies Institute School of Computer Science, Carnegie Mellon University 13 Segmenter Approaches –Statistical approaches: Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur. Cons: –Data sparseness –Cross boundary

14 Language Technologies Institute School of Computer Science, Carnegie Mellon University 14 Segmenter (2) –Dictionary-based approaches Idea: Use a dictionary to find the words in the sentence Forward maximum match / backward maximum match/ or both direction Cons: –The size and quality of the dictionary used are of great importance: New words, Named-entity –Maximum (greedy) match may cause mis-segmentations

15 Language Technologies Institute School of Computer Science, Carnegie Mellon University 15 Segmenter (3) –A combination of dictionary and linguistic knowledge Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation Pros: high accuracy (possible) Cons: –Require a dictionary with POS and word-frequency –Computationally expensive

16 Language Technologies Institute School of Computer Science, Carnegie Mellon University 16 Segmenter (4) We first used LDC’s segmenter Currently we are using a forward/backward maximum match segmenter for baseline. The word frequency dictionary is from LDC Word frequency dictionary from LDC: 43959 entries

17 Language Technologies Institute School of Computer Science, Carnegie Mellon University 17 For HLT 2001 Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001 (HLT-2001).

18 Language Technologies Institute School of Computer Science, Carnegie Mellon University 18 For MT-Summit VIII Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001) Two-threshold for tokenization

19 Language Technologies Institute School of Computer Science, Carnegie Mellon University 19 For MT-Summit VIII (2)

20 Language Technologies Institute School of Computer Science, Carnegie Mellon University 20 For PI Meeting (1) Baseline System Full System Baseline + Named-Entity Multi-corpora System

21 Language Technologies Institute School of Computer Science, Carnegie Mellon University 21 For PI Meeting (2) Baseline System

22 Language Technologies Institute School of Computer Science, Carnegie Mellon University 22 For PI Meeting (3) Full System

23 Language Technologies Institute School of Computer Science, Carnegie Mellon University 23 For PI Meeting (4) Named-Entity

24 Language Technologies Institute School of Computer Science, Carnegie Mellon University 24 For PI Meeting (5) Multi-Corpora Experiment –Motivation –Corpus Clustering –Experiment

25 Language Technologies Institute School of Computer Science, Carnegie Mellon University 25 Evaluation Issues Automatic Measures –EBMT Source Match –EBMT Source Coverage –EBMT Target Coverage –MEMT (EBMT+DICT) Unigram Coverage –MEMT (EBMT+DICT) PER Human Evaluations

26 Language Technologies Institute School of Computer Science, Carnegie Mellon University 26 Evaluation Issues (2) Human Evaluations –4-5 graders each time –6 categories

27 Language Technologies Institute School of Computer Science, Carnegie Mellon University 27 Evaluation Issues (3)

28 Language Technologies Institute School of Computer Science, Carnegie Mellon University 28 After PI Meeting (0) Study of results reported in PI meeting (http://pizza.is.cs.cmu.edu/research/internal/ebmt/tokenLen/index.htm) –The quality of Named-Entity (Cleaned by Erik) –Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation) –How to evaluate the performance of the system Experiment of G-EBMT –Word clustering

29 Language Technologies Institute School of Computer Science, Carnegie Mellon University 29 After PI Meeting (1) Changing the average length of Chinese token –No bracket on English –Use a subset of LDC’s frequency dictionary for segmentation –Study the performance of EBMT system on different average Chinese token length

30 Language Technologies Institute School of Computer Science, Carnegie Mellon University 30 After PI Meeting (2)

31 Language Technologies Institute School of Computer Science, Carnegie Mellon University 31 After PI Meeting (3) Avg. Token Len. vs. StatDict Recall

32 Language Technologies Institute School of Computer Science, Carnegie Mellon University 32 After PI Meeting (4) Avg. Token Len. vs. Source word match

33 Language Technologies Institute School of Computer Science, Carnegie Mellon University 33 After PI Meeting (5) Avg. Token Len vs. Source Coverage

34 Language Technologies Institute School of Computer Science, Carnegie Mellon University 34 After PI Meeting (6) Avg. Token Len. Vs.

35 Language Technologies Institute School of Computer Science, Carnegie Mellon University 35 After PI Meeting (7) Avg. Token Len. Vs. Src/Tgt Coverage of EBMT in MEMT

36 Language Technologies Institute School of Computer Science, Carnegie Mellon University 36 After PI Meeting (8) Avg. Token Len. Vs. Translation Unigram Coverage

37 Language Technologies Institute School of Computer Science, Carnegie Mellon University 37 After PI Meeting (9) Avg. Token Len. Vs. Hypothesis Len (Len of translation) The reference translation’s length is 1163 words

38 Language Technologies Institute School of Computer Science, Carnegie Mellon University 38 After PI Meeting (10) Avg. Token Len. Vs. PER

39 Language Technologies Institute School of Computer Science, Carnegie Mellon University 39 After PI Meeting (11) Type-Token curve for Chinese

40 Language Technologies Institute School of Computer Science, Carnegie Mellon University 40 After PI Meeting (12) Type-Token curve of Chinese and English

41 Language Technologies Institute School of Computer Science, Carnegie Mellon University 41 Future Research Plan Generalized EBMT –Word-clustering –Grammar Induction Using Machine Learning to optimize the parameters used in MEMT Better Alignment Model: Integrating segmentation, brackting and alignment

42 Language Technologies Institute School of Computer Science, Carnegie Mellon University 42 New Alignment Model (1) Using both monolingual and bilingual collocation information to segment and align corpus

43 Language Technologies Institute School of Computer Science, Carnegie Mellon University 43 New Alignment Model (2)

44 Language Technologies Institute School of Computer Science, Carnegie Mellon University 44 New Alignment Model (3)

45 Language Technologies Institute School of Computer Science, Carnegie Mellon University 45 New Alignment Model (4)

46 Language Technologies Institute School of Computer Science, Carnegie Mellon University 46 References Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc.


Download ppt "The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001."

Similar presentations


Ads by Google