Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Orthogonal Drawing Kees Visser. Overview  Introduction  Orthogonal representation  Flow network  Bend optimal drawing.
“Using Weighted MAX-SAT Engines to Solve MPE” -- by James D. Park Shuo (Olivia) Yang.
Fast Algorithms For Hierarchical Range Histogram Constructions
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Edited by Malak Abdullah Jordan University of Science and Technology Data Structures Using C++ 2E Chapter 12 Graphs.
CS171 Introduction to Computer Science II Graphs Strike Back.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
1 Fuchun Peng Microsoft Bing 7/23/  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Information Retrieval in Practice
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Graph Theory Topics to be covered:
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
CS 361 – Chapter 16 Final thoughts on minimum spanning trees and similar problems Flow networks Commitment: –Decide on presentation order.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
The Big Picture Chapter 3. A decision problem is simply a problem for which the answer is yes or no (True or False). A decision procedure answers a decision.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Timetable Problem solving using Graph Coloring
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Information Retrieval in Practice
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
The Taxi Scheduling Problem
ICS 353: Design and Analysis of Algorithms
Problem Solving 4.
Feature Selection for Ranking
Presentation transcript:

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems Engineering & Engineering Management The Chinese University of Hong Kong Shatin, Hong Kong SIGIR /09/09

2 Abstract  A novel name entity matching model which considers both semantic and phonetic clues.  The matching model is formulated as an optimization bipartite weighted graph matching problem.  Investigate three learning algorithm for obtaining the similarity information of basic phoneme units based on training examples.

3 Introduction  Using bilingual dictionaries System will encounter difficulties The OOV problem(the new or unseen terms) Submitted Queries for news search consist of named entities or proper nouns Automatic identification of word translation from unrelated English and German corpora A method called Convec was developed to generate bilingual lexicon from comparable corpus Mining term translations from Web anchor.2002 Mining parallel documents form parallel Web sites.1999 Sigir A similarity-based backward transliteration approach.2002 Consider phonetic information

4 Named Entity Matching Model (2.1) - Problem Nature -  Given a pair of named entities which are translation of each other, it is to find part of the entity is matched. To computer the similarity between two given named entities written in two language. Note that this is a different problem form cross- language transliteration.  Example: University of Akron  阿克倫大學 Palo Alto Chamber of Commerce  帕洛阿爾商會  Two issue: 阿克倫 大學 帕洛阿爾 商 會 Tokenization, Partial matching

5 Named Entity Matching Model (2.2) - Matching Model Investigation - English entity E: Chinese entity C: Bilingual dictionary: Linguistic Data Consortium Three learning algorithm for phoneme units Weight associated with each word segment

6 Named Entity Matching Model (2.3) - Tokenization -  Consider a pair : English entity E: Chinese entity C: For each t j is looked up in the bilingual dictionary. Scanned Chinese entity to get word segment which can maximally match. The degree of matching : Treated as separate tokens : If the degree of matching exceeds or reaches a certain threshold.  Group adjacent terms which do not involve in the dictionary mapping. ex: 帕洛 阿爾  帕洛阿爾

7 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 1/4  Let English entity, E, be represented as token Chinese entity, C, be represented as token  Let undirected bipartite weighted graph with vertex set V and edge set L. The vertex set V is set to {V E U V C } Where V E ={e 1,…,e m } and V C ={c 1,…,c n } If there is a mapping found semantically or phonetically between an English token e i and Chinese token c j, there will be an edge.

8 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 2/4  Let edge weight be  (e i,c j ) For semantic mapping  (e i,c j ) =  For phonetic mapping  (e i,c j ) = (0,1]. (describe below)  After the edges and weights of the graph have been constructed : The matching problem is reduced to finding a set of edges such that the total weight is maximized and each token can only be mapped to a single token on the other side. This requirement can be formulated as a bipartite weighted graph matching problem.

9 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 3/4  Formal description of the problem : This is a NP-Complete problem.

10 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 4/4  Formulated maximum cost assignment problem as a minimum cost assignment problem. The Hungarian search algorithm can solve it efficiently. Step1: remove no edge token Step2: add dummy vertices Step3: add dummy edge with weight zero Step4: transformation each edge  (e i,c j ) to the cost  -  (e i,c j ), where  =

11 Phonetic Matching Model Generating Phonetic Representation  Similarity of two term based on pronunciation.  Phonetic generation procedure: English terms : using PRONLEX, resource provided by LDC For example : “father”  “faDR” A letter-to-phoneme tagging lexicon and a set of transformation rules are used. 458 basic phoneme units. Chinese terms : using Pin-Yin symbols For example : “ 港 ”  “gang3” 791 basic phoneme units. Cantonese terms : using Jyut-Ping symbols For example : “ 爸 ”  “baa1” 1139 basic phoneme units.

12 Phonetic Matching Model Phonetic Matching Algorithm  Given an English term and a Chinese term: For calculating similarity score need prepare a phoneme pronunciation similarity (PPS) table. English-Mandarin : 348,831 entries English-Cantonese : 502,299 entries In Particular, the number of entries for English-Mandarin : 35,077 entries English-Cantonese : 39,981 entries

13 Phonetic Matching Model Phonetic Matching Algorithm  Suppose : An English term,A, is represented by basic phoneme unit sequence An Chinese term,B, is represented by basic phoneme unit sequence Let S i,j be the optimal longest common subsequence similarity score,and the recursive formula as follow:

14 Learning phonetic similarity The Windrow-Hoff algorithm  The Widrow-Hoff algorithm: (Learning PPS Table) Y k : similarity score. Z k : 1 positive training example, 0 negative example U k, i, j be a binary variable. Phoneme unit involving unit i (English) and j (Chinese). V i, j score, where i and j refer to a specific English and Chinese phoneme unit in PPS table V. m a (English) and m b (Chinese) the number of phoneme units.

15 Learning phonetic similarity The Exponentiated-Gradient Algorithm  EG requires that the elements in V are nonnegative and sum to 1.  Each element in V is divided by Max i,j (V i,j ). Let We define as : where κ > 0 is the learning rate. Ψ is a normalization expression which is the sum of the updated V i, j.

16 Learning phonetic similarity The Genetic Algorithm Object function:.Initial population.Fitness Function.Selection.Crossover.Mutation

17 Experiments on Named Entity Matching Model  20,000 Chinese-English person name pairs as training data.  2,000 person name pairs different from training to evaluate the learning performance.  The average reciprocal rank (ARR) is used to measure the performance: Manual : 0.78

18 Experiments on Named Entity Matching Model  Evaluated the performance of the overall named entities matching model. 1,000 named entities from the same corpus.

19 Mining New Entity Translations From News  Bilingual comparable news: Online daily Web news stories. To discover new,unseen named entity

20 Mining New Entity Translations From News

21 Experiments on Ming New Translations

22 Experiments on Ming New Translations

23 Conclusions  A novel named entity matching model Consider both semantic and phonetic Three learning algorithm on training the phonetic similarity information.  Flexible and Comprehensive Hybrid model can handle named entity matching.  Bilingual comparable news: Online daily Web news stories. To discover new,unseen named entity