Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems Engineering & Engineering Management The Chinese University of Hong Kong Shatin, Hong Kong {wlam,rzhuang,pscheung}@se.cuhk.edu.hk SIGIR 04 2004/09/09

2 Abstract  A novel name entity matching model which considers both semantic and phonetic clues.  The matching model is formulated as an optimization bipartite weighted graph matching problem.  Investigate three learning algorithm for obtaining the similarity information of basic phoneme units based on training examples.

3 Introduction  Using bilingual dictionaries System will encounter difficulties The OOV problem(the new or unseen terms) Submitted Queries for news search consist of named entities or proper nouns. 1997 Automatic identification of word translation from unrelated English and German corpora. 1999 A method called Convec was developed to generate bilingual lexicon from comparable corpus. 1998 Mining term translations from Web anchor.2002 Mining parallel documents form parallel Web sites.1999 Sigir A similarity-based backward transliteration approach.2002 Consider phonetic information

4 Named Entity Matching Model (2.1) - Problem Nature -  Given a pair of named entities which are translation of each other, it is to find part of the entity is matched. To computer the similarity between two given named entities written in two language. Note that this is a different problem form cross- language transliteration.  Example: University of Akron  阿克倫大學 Palo Alto Chamber of Commerce  帕洛阿爾商會  Two issue: 阿克倫大學帕洛阿爾商會 Tokenization, Partial matching

5 Named Entity Matching Model (2.2) - Matching Model Investigation - English entity E: Chinese entity C: Bilingual dictionary: Linguistic Data Consortium Three learning algorithm for phoneme units Weight associated with each word segment

6 Named Entity Matching Model (2.3) - Tokenization -  Consider a pair : English entity E: Chinese entity C: For each t j is looked up in the bilingual dictionary. Scanned Chinese entity to get word segment which can maximally match. The degree of matching : Treated as separate tokens : If the degree of matching exceeds or reaches a certain threshold.  Group adjacent terms which do not involve in the dictionary mapping. ex: 帕洛阿爾  帕洛阿爾

7 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 1/4  Let English entity, E, be represented as token Chinese entity, C, be represented as token  Let undirected bipartite weighted graph with vertex set V and edge set L. The vertex set V is set to {V E U V C } Where V E ={e 1,…,e m } and V C ={c 1,…,c n } If there is a mapping found semantically or phonetically between an English token e i and Chinese token c j, there will be an edge.

8 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 2/4  Let edge weight be  (e i,c j ) For semantic mapping  (e i,c j ) =  For phonetic mapping  (e i,c j ) = (0,1]. (describe below)  After the edges and weights of the graph have been constructed : The matching problem is reduced to finding a set of edges such that the total weight is maximized and each token can only be mapped to a single token on the other side. This requirement can be formulated as a bipartite weighted graph matching problem.

9 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 3/4  Formal description of the problem : This is a NP-Complete problem.

10 Named Entity Matching Model (2.4) Hybrid semantic and Phonetic Matching Algorithm – 4/4  Formulated maximum cost assignment problem as a minimum cost assignment problem. The Hungarian search algorithm can solve it efficiently. Step1: remove no edge token Step2: add dummy vertices Step3: add dummy edge with weight zero Step4: transformation each edge  (e i,c j ) to the cost  -  (e i,c j ), where  =

11 Phonetic Matching Model Generating Phonetic Representation  Similarity of two term based on pronunciation.  Phonetic generation procedure: English terms : using PRONLEX, resource provided by LDC For example : “father”  “faDR” A letter-to-phoneme tagging lexicon and a set of transformation rules are used. 458 basic phoneme units. Chinese terms : using Pin-Yin symbols For example : “ 港 ”  “gang3” 791 basic phoneme units. Cantonese terms : using Jyut-Ping symbols For example : “ 爸 ”  “baa1” 1139 basic phoneme units.

12 Phonetic Matching Model Phonetic Matching Algorithm  Given an English term and a Chinese term: For calculating similarity score need prepare a phoneme pronunciation similarity (PPS) table. English-Mandarin : 348,831 entries English-Cantonese : 502,299 entries In Particular, the number of entries for English-Mandarin : 35,077 entries English-Cantonese : 39,981 entries

13 Phonetic Matching Model Phonetic Matching Algorithm  Suppose : An English term,A, is represented by basic phoneme unit sequence An Chinese term,B, is represented by basic phoneme unit sequence Let S i,j be the optimal longest common subsequence similarity score,and the recursive formula as follow:

14 Learning phonetic similarity The Windrow-Hoff algorithm  The Widrow-Hoff algorithm: (Learning PPS Table) Y k : similarity score. Z k : 1 positive training example, 0 negative example U k, i, j be a binary variable. Phoneme unit involving unit i (English) and j (Chinese). V i, j score, where i and j refer to a specific English and Chinese phoneme unit in PPS table V. m a (English) and m b (Chinese) the number of phoneme units.

15 Learning phonetic similarity The Exponentiated-Gradient Algorithm  EG requires that the elements in V are nonnegative and sum to 1.  Each element in V is divided by Max i,j (V i,j ). Let We define as : where κ > 0 is the learning rate. Ψ is a normalization expression which is the sum of the updated V i, j.

16 Learning phonetic similarity The Genetic Algorithm Object function:.Initial population.Fitness Function.Selection.Crossover.Mutation

17 Experiments on Named Entity Matching Model  20,000 Chinese-English person name pairs as training data.  2,000 person name pairs different from training to evaluate the learning performance.  The average reciprocal rank (ARR) is used to measure the performance: Manual : 0.78

18 Experiments on Named Entity Matching Model  Evaluated the performance of the overall named entities matching model. 1,000 named entities from the same corpus.

19 Mining New Entity Translations From News  Bilingual comparable news: Online daily Web news stories. To discover new,unseen named entity

20 Mining New Entity Translations From News

21 Experiments on Ming New Translations

22 Experiments on Ming New Translations

23 Conclusions  A novel named entity matching model Consider both semantic and phonetic Three learning algorithm on training the phonetic similarity information.  Flexible and Comprehensive Hybrid model can handle named entity matching.  Bilingual comparable news: Online daily Web news stories. To discover new,unseen named entity

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Similar presentations

Presentation on theme: "Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Similar presentations

Presentation on theme: "Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems."— Presentation transcript:

Similar presentations

About project

Feedback