Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Improved TF-IDF Ranker
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Natural Language Processing Expectation Maximization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Improved Word Alignments Using the Web as a Corpus
Statistical Machine Translation Papers from COLING 2004
Presentation transcript:

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School of Computer Science, CMU NAACL 2004

Introduction In the 2001 C-E translation evaluation test data, 20% of NEs are not included in the 50K LDC C-E translation lexicon. Most previous studies focused only on phonetic information There are NEs not translated in phonetic values (e.g. “ 南懷仁, Ferdinand Verbiest”) Combining phonetic similarities (transliteration) and semantic similarities (context) to cover these non- transliterated NEs. Source language: Chinese Target language: English

Surface String Transliteration Training data: LDC C-E dictionary Bootstrapping unsupervised learning Learning transliterating probabilities between pinyin and English letters Pre-processing: Romanizing Chinese word into pinyin. 0th iteration: Using editing distances to generate mappings between Chinese and English word pairs.. Using 3,000 word translations with minimum editing distance of the 0th iteration to estimate new transliterating probabilities. Repeating generating new translation mappings using new transliterating probabilities. In each iteration, additional 500 pairs with a minimum transliterating cost are added into the existing NE pair list to update new transliterating probabilities. Repeat until adding more NE pairs does not improve the extraction accuracy further.

Contextual Semantic Similarity Training data: a subset of English Xinhua News corpus Context Vector Selection: POS Phi-Square: Weight of POS: Distance Weight of Location: Weight Vector:

Contextual Semantic Similarity

Semantic Similarity between Context Vectors: Semantic similarity: P(v f |v e ) is computed with a modified IBM translation model-2 [Brown et al. 1993]: I : the length of the source vector J : the length of the target vector p(e|f) : the word translation probability estimated from a C-E aligned corpus with IBM model1 P(v e |v f ) is estimated in the similar way

Cross-lingual Retrieval for NE translations

English NEs in the retrieved text are automatically tagged by IdentiFinder TM from BBN (Bikel et al.,1997). Overall similarity score: The NE pairs with the highest overall similarity scores are considered translations. Since NE can be translated in several different ways, and there are typos at times, from among the top NE hypothesis with similar spelling, the one with the highest frequency are chosen as the translation.

Cross-lingual Retrieval for NE translations Sentence-based or Document-based? Test data: Chinese newswire documents 114 Chinese NEs are selected and translated manually Indexed Corpus: 963,478 English documents from the Xinhua News Agency Retrieval Model: TF-IDF Top 1000 results are regarded as the relevant text The recall of document-based indexing is better. (70% comparing with 60%)

Experiment Results Test dataset: NIST 2002 Machine Translation Evaluation test data 100 Chinese documents, 878 sentences, words 2469 NEs are automatically tagged (PER: 20%, LOC: 60%, ORG: 20%) Only PER and LOC are focused Among 1,898 tagged PERs and LOCs, 338 of them are true NEs and not covered by the LDC lexicon. Baseline system: The CMU statistical MT system. (Vogel et al., 2003)

Experiment Results