A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Slides:



Advertisements
Similar presentations
Word Spotting DTW.
Advertisements

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Shape and Dynamics in Human Movement Analysis Ashok Veeraraghavan.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Principal Component Analysis
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Computing motion between images
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Research methods in corpus linguistics Xiaofei Lu.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Mathematical Morphology Set-theoretic representation for binary shapes
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Multi-Layer Filtering algorithm Bilingual Chunk Alignment In Statistical Machine Translation An introduction of Multi-Layer Filtering (MLF) algorithm Dawei.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Exact indexing of Dynamic Time Warping
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Single Document Key phrase Extraction Using Neighborhood Knowledge.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Correspondence and Stereopsis. Introduction Disparity – Informally: difference between two pictures – Allows us to gain a strong sense of depth Stereopsis.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
CSE 4705 Artificial Intelligence
Statistical NLP: Lecture 13
Presentation transcript:

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department University of California - Riverside

Overview Pattern matching method for compiling a bilingual lexicon of nouns Information tagging of languages Word frequency and position information for low and high frequency words are represented in two forms for pattern matching Anchor points and noise elimination techniques are introduced Compilation of domain-specific noun phrases

Bilingual lexicon compilation without sentence alignment Automatically compiling a bilingual lexicon of nouns and proper nouns can contribute significantly to breaking bottlenecks: –Machine translation –Machine-aided translation Domain-specific terms are hard to translate because they often do not appear in dictionaries

Algorithm Abstract 1.Tag the English half of the parallel text 2.Compute the positional difference vector of each word 3.Match pairs of positional difference vectors, giving scores 4.Select a primary lexicon using the scores 5.Find anchor points using the primary lexicon 6.Compute a position binary vector for each word using the anchor points 7.Match binary vectors to yield a secondary lexicon

Problems with finding high frequency bilingual word pairs Sentence alignment between languages is not exact Chunks of text may appear in one language but not the other Dynamic Time Warping techniques may be used for pattern recognition but may be to slow.

Positional difference signals Sliding window

Solution to finding high frequency bilingual word pairs Tagging to identify nouns –Nouns tend to have consistent translations Positional difference vectors –A word and its translated counterparts usually have some correspondence to their frequency and positions but it may not be linear Matching positional difference vectors –DTW was found to be a good way to match word vectors of shifted or warped forms Statistical filters

Positional difference vectors Each word is represented as a binary variable –1 = word/phrase match –0 = word/phrase non-match Corpora represented as a bit vector/string –Example: Noun: water Text: “I like to drink water. Water is …” Bit String: “ …”

Statistical filters Statistical filters. –To improve computation speed use Euclidean distance to measure pairs If distance is higher then a certain threshold then filter pair out Look only at the Euclidean distances of the mean and standard deviations: –Low frequency words are not considered

Finding low frequency bilingual word pairs Secondary lexicons need to be computed Find anchor points on the DTW paths which divide the texts into multiple aligned segments Anchor points are more reliable than tracking all of the words in a given text Eliminate noise by keeping highly reliable points and discard the rest

Dynamic time warping path The line can be thought of as a text alignment path Its departure from the diagonal illustrates that the texts of this corpus are not identical not linearly aligned

DTW path reconstruction & anchor points obtained

Unsupervised algorithm The constraints in the below conditions are chosen roughly in proportion to the corpus size so that the filtered picture looks close to clean, diagonal line If chosen then supervised scenario

Finding low frequency bilingual word pairs Many nouns and proper nouns were not translated in the previous stages of the algorithm Frequency to low Non-linear segment binary vectors –Represent positional and frequency information of low frequency words by a binary vector for fast matching –Segments are smaller then entire text –Example: The the lexicon for the word “prosperity” Position Vectors: –English: –Chinese: Find segments each occur: –English: i = 20, 27, 41, 47, 193, 321, 360 –Chinese: i = 14, 29, 41, 47, 193, 275, 321, 360

Finding low frequency bilingual word pairs Binary vector correlation measure –Confidence measure: –Example: From previous … Equation: m = mutual information score t = confidence measure

Methods/Results Evaluation of three human judges. (E1 - E3) –E1 = Cantonese –E2 = Mandarin –E3 = Both Languages –Accuracy: Algorithm: 73.1% Human Judges: 66.0 – 87.5%

Results (Cont.) Finding Chinese words Compound noun translations Slang Collocations e.g. houses & housing project Proper names e.g. Benjamin Arai Tagging errors caused translation mistakes Many mistakes due to insufficient data

Summary The algorithm bypasses the sentence alignment step to find a bilingual lexicon of nouns and proper nouns Compared to other word alignment algorithms, it does not need a priori information

Future work The automated searching of valid lexicon matches has great potential for language translation –Noun and proper noun matching using subsets of bit vectors –Noun and proper noun filtering for translation gaps Automated noun phrase and compound word identification is potential for increasing lexicon matching accuracy Increase total text translations accuracy without human intervention

Any questions?