Progress in Chinese EBMT for LingWear Ying Zhang (Joy) Language Technologies Institue Carnegie Mellon University Sep.

Slides:



Advertisements
Similar presentations
A complete citation, notecard, and outlining tool
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 3: Modules, Hierarchy Charts, and Documentation
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
1 ADVANCED MICROSOFT POWERPOINT Lesson 5 – Using Advanced Text Features Microsoft Office 2003: Advanced.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Memory Management (II)
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Tutorial 6 & 7 Symbol Table
Chapter 2: Algorithm Discovery and Design
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Memory Management Chapter 5.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Chapter 2: Algorithm Discovery and Design
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
English Word Origins Grade 3 Middle School (US 9 th Grade) Advanced English Pablo Sherman The etymology of language.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Machine translation Context-based approach Lucia Otoyo.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Invitation to Computer Science, Java Version, Second Edition.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Programming With C.
© 2014 by McGraw-Hill Education. This is proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Final Presentation Industrial project Automatic tagging tool for Hebrew Wiki pages Supervisors: Dr. Miri Rabinovitz, Supervisors: Dr. Miri Rabinovitz,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
PZAPR Parallel Zip Archive Password Recovery CSCI High Perf Sci Computing Univ. of Colorado Spring 2011 Neelam Agrawal Rodney Beede Yogesh Virkar.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
 Programming - the process of creating computer programs.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science.
Introduction to Computer Programming using Fortran 77.
OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Component 1.6.
Memory Management Virtual Memory.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Module 11: File Structure
Supervised Time Series Pattern Discovery through Local Importance
The Selection Structure
Dynamic Coverage In Wireless Ed-Hoc Sensor Networks
Taking Secondary Source Notes: A Research Method
Chapter 15 QUERY EXECUTION.
Presentation transcript:

Progress in Chinese EBMT for LingWear Ying Zhang (Joy) Language Technologies Institue Carnegie Mellon University Sep 08, 2000

Introduction Lingwear Multi-engine Machine Translation EBMT corpus Chinese EBMT Segmentation Re-ordering

Tasks in Project 1. Data Collection Corpus Glossary 2. Data Preprocess Convert code Segmentation for Chinese Bracketing English Align bilingual corpus

Task in Project (Cont.) 3. Indexing glossary 4. Building dictionary 5. Building corpus 6. Creating statistical dictionary

Data Collection (Corpus) Hong Kong bilingual legal code collected by LDC ( the Linguistics Data Consortium ) 24 Chinese files in Big5; 24 English files *a small portion of English is not the correspondent translation of Chinese source Average size: 1.5 M Bytes/file for English 1.0 M Bytes/file for Chinese, 10,000 lines each, >400,000 Chinese characters Total corpus: 37.8 M Byte English 23 M Byte Chinese

Data Collection (Corpus) Cont. Each paragraph in the corpus is a line. Id tag ( ) added by LDC There are English definitions for legal terms

Data Collection (Corpus) Cont. To consolidate and amend the law relating to the construction, application and interpretation of laws, to make general provisions with regard thereto, to define terms and expressions used in laws and public documents, to make general provision with regard to public officers, public contracts and civil and criminal proceedings and for purposes and for matters incidental thereto or connected therewith. [31 December 1966] L.N. 88 of 1966 PART I SHORT TITLE AND APPLICATION This Ordinance may be cited as the Interpretation and General Clauses Ordinance. Remarks: Amendments retroactively made - see 26 of 1998 s. 2 (1)Save where the contrary intention appears either from this Ordinance or from the context of any other Ordinance or instrument, the provisions of this Ordinance shall apply to this Ordinance and to any other Ordinance in force, whether such other Ordinance came or comes into operation before or after the commencement of this Ordinance, and to any instrument made or issued under or by virtue of any such Ordinance.

Data Collection (Corpus) Cont. Glossary From LDC Chinese-English dictionary Seems to be a combination of several printed dictionary Punctuation Dictionary (by Joy) Definition from corpus

Data Preprocess Convert code Coding System: There are two main coding schemes for Chinese: Big5 (Hong Kong, Taiwan, Southeastern Aisa) GB2312, GBK (Mainland China) Tool NJStar Universal converter Problems HKSCS (Hong Kong Supplementary Character Set)

Data Preprocess (Cont.) Segmentation for Chinese Why does Chinese need to be segmented? Because Chinese is written without any space between words, word segmentation is a particular important issue for Chinese language processing. e.g.

Data Preprocess (Cont.) Segmenter LDC Segmenter Based on the word frequency dictionary, using Dynamic programming to find the path which has the highest multiple of word probability, the next word is selected from the longest phrase. Errors: Miss-segmentation: There are no such word in freq. Dict, so segmenter just segment every character. Incorrect-segmentation:

Data Preprocess (Cont.) Miss-segmentation is much more than incorrect-segmentation e.g. From a sample with 6960 words, LDC Segmenter miss-segmented 57 words(100 cases, 1.43%), incorrect- segmented 9 words(10 cases, 0.143%). The reason for this is because of the dictionary used by the segmenter does not have entries for words in legal domain.

Segmenter Improvement Longer chunks are better for EBMT Improve Chinese segmenter by extracting ‘words’ from corpus and added them to the dictionary of segmenter To find out corresponding translation for segmented Chinese ‘words’, English corpus need to be ‘bracketed’ for phrases

Example of Improvement

Basic Ideas Searching patterns appeared in corpus as candidates for words Refine patterns and create words

Challenges Memory concerns If all patterns are kept in memory until the end of the scan process, there will be explosive requirement of memory Length of patterns to be searched (how about the word with 7 characters?) Whether a pattern is a ‘word’ Distinguish patterns that are not words Construct longer words from patterns Performance---Speed

Solutions Memory concerns “Sliding-window”: dump the patterns to file dynamically Scan only patterns with length 2,3,4 (2,3,4,5 for English) Whether a pattern is a ‘word’ Using mutual information to decide whether a pattern is a word Merging shorter patterns to longer “word” if shorter patterns have the same appearing times and appear in the same range.

Assumptions used in sliding-window 1 Assumption1: Localization: One word appears more frequently in a certain region, rather than distributed evenly among the whole corpus

Assumptions used in sliding-window 2 Assumption2: If there will be another pattern appear, it should appear in a range related to the average distance of appeared patterns ExpectationRange = 30 * averageDistance

Sliding-window For every 50 clauses{ check patterns if it can be dumped } Check_patterns_if_it’s_a_would_be_word{ if(isAWordFinal($_[0], $thisWord)){ recycleMem; return 0; } else{ if($distance==0){ #appear only in once clause now if($scanRange<$rangeLimit){return 1; }else{recycleMem; } else{ if($notAppearRange>(($appearRange/$times)*$niceRate)){ recycleMem; return 0; } else{return 1; } } }}

Refine Patterns for Words Step1: Add info. for the same pattern (because of sliding window) Step2: Choose longest pattern among patterns have the same info (appearing times and range) e.g. ab7390 abc7390 abcd7390 Choose ‘abcd’ and give up ‘ab’, ‘abc’

Refine Patterns for Words (cont.) Step3: Split words according to “mutual info” e.g. For word like Abc, the “mutual info” is

Refine Patterns for Words (cont.) Step4: Construct longer words. As only patterns of length 2,3,4 are extracted, longer words need to be constructed based on pattern with 4 characters. Step5: Adding the words to the segmenter’s dictionary

Evaluation Word Extraction: In average: New words file is 20K for each 2M corpus About 1,700 Chinese words found; Running on Oslo (dual 296 MHz UltraSPARC processors, 512 MB RAM): for HK00 (1.1M) Pattern extraction program runs for 5:46 minutes Memory used:3456K Pattern file is 967K Word Refinement Running time: 00:13 Memory used: 6952 K New word file: 21K

Evaluation (cont.) Evaluated on HK00 (first 5 pages) Total Chinese characters: 2172 Original Segmenter: miss-segmentation: 120 cases (5.5%) incorrect-segmentation: 5 cases (0.23%) Improved Segmenter: miss-segmentation: 38 cases (1.75%) incorrect-segmentation: 7 cases (0.32%)

Bracketer for English Using the same algorithm for Chinese. English is easier than Chinese (esp. for refinement) Using underscore to concatenating English words to form phrase e.g. joint_creditors joint_estate journalistic_material judge_by judge_of judgment_creditor judgment_debtorjudgment_debtors

Creating aligned bilingual corpus After the segmentation of Chinese and bracketing of English:

Creating Statistical Dict. Ralf’s program can generate a statistical bilingual dictionary for words based on the bilingual corpus. With bracketed English corpus, this program can generate bilingual dictionary for phrase now. In this dictionary, there are entries are generated for English phrase bracketed, the other 7680 entries are for words or phrases from LDC dictionary

Conclusion By improving Chinese segmenter and English bracketer, the quality of EBMT system has been improved.

Problems and future work As there is no deep analysis of the semantic info on words, some of the words generated are not real words: e.g. Adjust the parameters of Chinese segmenter and English bracketer, so that they can find more coherent patterns.

Problems for EBMT Purify the glossary and add preference information to word entries; Improved Chinese segmenter and English bracketer need to be augmented to provide more accurate segmentations; Re-ordering translation in English; Modify the language model for better translation

Thank you! Questions Questions and comments?

Enjoy your weekend