Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Slides:

Advertisements

Similar presentations

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: Hichem.

Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.

Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor ：

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.

The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien Shing Chen Author: Wei-Hao.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Manoranjan.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Novel Density-Based Clustering Framework by Using Level.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: YU-SHENG.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Qing.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien Shing Chen Author: Wei-Hao.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A new data clustering approach- Generalized cellular automata.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Wei Xu,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author ： Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering ： integrating data clustering over optimization.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Gustavo.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Michael.

Statistical NLP: Lecture 13

Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language Groups Pascale Fung, Kathleen McKeown Machine Translation, 1997

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Related work Noisy parallel corpora across language groups Algorithm overview Experiments Conclusion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  The difficult task, technical term translation ─ Translators quality and domain specific terminology. ─ Not adequately covered by printed dictionaries. ─ Terms from noisy parallel corpora, especially.  Ex: ─ Hong Kong Governor / 香港總督 ─ Basic Law / 基本法 ─ Green Paper / 綠皮書

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  This paper describes an algorithm for ─ translating technical words and ─ terms from noisy parallel corpora across language groups. 2 to 1

Intelligent Database Systems Lab N.Y.U.S.T. I. M Introduction  Technical terms ─ often cannot be translated on a word by word basis. ─ The individual words of the term may have many possible translations. ─ Example: Governor 總督, 主管 (top manager), 總裁 (chief), 州長 (of a State) Hong Kong Governor – 香港總督  Domain-specific terms ─ Basic Law / 基本法 ─ Green Paper / 綠皮書

Intelligent Database Systems Lab N.Y.U.S.T. I. M Introduction  An algorithm for translating technical terms given a noisy parallel corpus as input ─ Notion similar words won’t occur at the exact same position in each half of the corpus distances between instances of the same word will be similar across languages ─ Method To find word correlations and then builds technical terms translations. Dynamic time warping algorithm. Reliable anchor points.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Related work  Sentence alignment  Segment alignment  Word and term translation  Word alignment  Phrase translation

Intelligent Database Systems Lab N.Y.U.S.T. I. M Sentence alignment  Two main approaches ─ Text-based: use of lexical information (dictionary) Use paired lexical indicators across the languages to find matching sentences. ─ Length-based: use of the total number of characters (words) Make the assumption that translated sentences in the parallel corpus will be of approximately the same, or constantly related, length.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Segment alignment  Church(1993) show that we can align a text by using delimiters.  Segment alignment is more appropriate for aligning noisy corpora.  The problem is finding reliable anchor points that can be used for Asian/Romance language pairs.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Word and term translation  Some algorithms used for alignment produce a small bilingual lexicon.  Some others use sentence-aligned parallel text.  Most of the following algorithms require clean, sentence-aligned parallel text input.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Word alignment  [Brown et al. 1990, Brown et al. 1993]  [Gale & Church 1991]  [Dagan et al. 1993]  [Wu & Xia 1994]  Various filtering techniques are used to improve the matching.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Phrase translation  [Kupiec1993]  [Smadja & McKeown1993]  [Dagan & Church1994]  All the work described in this section assumes a clean, parallel corpus as input.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Noisy parallel corpora across language groups  Previous approaches are lack of robustness ─ Against structural noise in parallel corpora. ─ Against language pairs which don’t share etymological roots.  Still exist problems ─ Bilingual texts which are translations of each other but are not translated sentence by sentence. ─ Language robustness.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Noisy parallel corpora across language groups  Two noisy parallel corpora ─ English version of the AWK manual and its Japanese translation. ─ Parts of the HKUST English-Chinese Bilingual Corpora.  Two noisy parallel corpora ─ English version of the AWK manual and its Japanese translation. ─ Parts of the HKUST English-Chinese Bilingual Corpora.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Algorithm overview  Treat the domain word translation problem as a pattern matching problem ─ Each word shares some common features with its counterpart in the translated text. ─ To find the best representations of these features and the best ways to match them.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Compile non-linear segment boundaries with high frequency word pairs 6. Compile bilingual word lexicon 7. Suggest a word list for each technical term to the translator Algorithm overview English Chinese Tag English word list Tokenize Japanese and Chinese texts, and form a word list 1 – 4 Corpus 1. Primary lexicon 2. Anchor points for alignment 3. Align the text 4. Secondary lexicon

Intelligent Database Systems Lab N.Y.U.S.T. I. M Extracting technical terms from English text  To find domain-specific terms, we tagged the English part of the corpus by a modified POS tagger ─ Extracted noun phrases which are most likely to be technical terms. ─ To find the translations for words which are part of these terms only.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Tokenization of Chinese and Japanese texts  Tokenization of the Chinese text is done by using a statistically augmented dictionary- based tokenizer which is able to recognize frequent domain words. ─ Example: 基本法 /Basic Law  The Japanese text is tokenized by JUMAN without domain word augmentation.

Intelligent Database Systems Lab N.Y.U.S.T. I. M A rough word pair based alignment  Treat translation as a pattern matching task.  The task is to find a representation and similarity measurement which can find word pairs to serve as anchor points.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Dynamic Recency Vectors  Governor ─ The word position of length 212. ─ Recency vector  總督 ─ The word position of length 254. ─ Recency vector

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Recency vector signals Governor.chGovernor.en Bill.ch President.en

Intelligent Database Systems Lab N.Y.U.S.T. I. M Matching Recency Vectors  Dynamic time warping, DTW ─ Takes two vectors of lengths N and M, finds an optimal path through the N by M trellis, starting from (1,1) to (N,M). Governor 總督

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 DTW algorithm  Initialization ─ Costs are initialized according to recency vector values Governor 總督

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 DTW algorithm  Recursion ─ To accumulate cost of the DTW path Governor 總督

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 DTW algorithm  Termination ─ Final cost of the DTW path is normalized by the length of the path. Governor 總督

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 DTW algorithm  Path reconstruction ─ Reconstruct the DTW path and obtain the points on the path. ─ For finding anchor points and eliminating noise use. Governor 總督

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 DTW algorithm  For each word vector in language A, the word vector in language B which has lowest DTW score is taken to be its translation.  We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Statistical filters  To avoid the complexity, we incorporated constraints to filter the set of possible pairs ─ Starting point constraints, i.e., position constraint. ─ Length constraint, i.e., frequency constraint. ─ Means/standard deviation constraint

Intelligent Database Systems Lab N.Y.U.S.T. I. M Finding anchor points and eliminating noise  Primary lexicon is used for aligning the segments in the corpus ─ To find anchor points on the DTW paths which divide the texts into multiple aligned segments for the secondary lexicon.  We only keep an anchor point (i,j) if it satisfies the following ─ (slope constraint) ─ (continuity constraint) ─ (window size constraint) ─ (offset constraint)

Intelligent Database Systems Lab N.Y.U.S.T. I. M Finding anchor points and eliminating noise All word pairs After filtering AWK HKUST Text alignment path

Intelligent Database Systems Lab N.Y.U.S.T. I. M Finding bilingual word pair matches  To obtain the secondary and final bilingual word lexicon ─ A non-linear K segment binary vector representation for each word. ─ A similarity measure to compute word pair correlations.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Non-Linear K segments  The anchor points divide a bilingual corpus into k+1 non-linear segments, where i in text1 and j in text2.  The algorithm then proceeds to obtain a secondary bilingual lexicon, considering words of both high and low frequency.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Non-Linear segment binary vectors  The occurrences of a pair of translated words in a bilingual corpus, i.e., to compute the correlation between two words.  Pr(w s, w t ) occurring in the same place in the corpus.  Binary vector where the i-th bit is set to 1 if both words are found in the i-th segment. 1 0 … 1 K segments governor

Intelligent Database Systems Lab N.Y.U.S.T. I. M Non-Linear segment binary vectors ─ If the source and target words are good translations of one another, then a should be large. T F TFTF

Intelligent Database Systems Lab N.Y.U.S.T. I. M Binary vector correlation measure  Similarity measure, weighted mutual information

Intelligent Database Systems Lab N.Y.U.S.T. I. M Word translation results

Intelligent Database Systems Lab N.Y.U.S.T. I. M Term translations from word groups

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 38 Term translation aid result

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 39 Conclusion  A technique to align noisy parallel corpora by segments, and to extract a bilingual word lexicon from it. ─ Substitute the sentence alignment step with a rough segment alignment. ─ No sentence boundary information and with noise. ─ Highly reliable anchor points using DTW to serve as segment delimiters.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 40 Personal opinion  Valuable idea ─ Treat the domain word translation problem as a pattern matching problem.  Contribution ─ Language robustness and noisy parallel corpora.  Drawback ─ Too long and too complex.