Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors : Wai Lam, Ruizhang Huang, Pik-Shan Cheung 2004 ACM.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Named entity matching model Phonetic matching model Learning phonetic similarity Experiments on named entity matching model Mining new entity translations from news Experiments on mining new translations Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Many existing systems dealing with cross-language documents make use of bilingual dictionaries. In all these systems, a fixed dictionary is used throughout the process implying that only those terms exist in the dictionary can be handled.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective We propose a novel named entity matching model which considers both semantic and phonetic clues. We also develop a mining framework for discovering new, unseen named entity translations from online daily web news.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction  Many existing systems dealing with cross-language documents encounter difficulties when they process new or unseen terms which are common especially for named entities. Exploits similarity at the phoneme level We investigate three learning algorithms for obtaining the similarity information of basic phoneme units based on a set of training data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Introduction(cont.) This framework comparable news in different language based on an unsupervised learning technique using an existing bilingual dictionary. A major advantage of our proposed is that our approach analyzes both semantic and phonetic information and formulates the problem as a number of optimization models.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Named entity matching model The objective of our named entity matching model is to compute the similarity between two given named entities written in two languages. Named Entity Matching Model Tokenization Process LDC a set of Chinese entity translations Hybrid semantic & phonetic matching Phonetic matching model 1. looked up adjacent terms which do not involve in the dic Input a pair of entities 2. generated Find matched Chinese entity 3. scanning 4. group

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Named entity matching model (cont.) Problem nature Given a pair of named entities which are translation of each other, it is common to find part of the entity is matched based on semantic and the remaining part is based on phonetic clues. Example English entity “University of Akron” Chinese entity ” 阿克倫大學 ” Semantic clues  we can match the term “University” with “ 大學 ” Phonetic clues  we can match the term “Akron” with “ 阿克倫 ”

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Named entity matching model (cont.) Matching model investigation An English entity E represented by terms and a Chinese entity C represented by Chinese characters Let the matched word segments be represented as Let the phonetically matched word segments be represented as

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 The objective is to find a set of mapping between English terms and Chinese word segments such that the total weight is maximized.  Example English Terms E Chinese Entity C Named entity matching model (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Tokenization If the degree of this maximal matching exceeds or reaches a certain threshold, this word segment are treated as separate tokens. Example Commerce matches with the term “ 商 ”  p= 商 (1)/ 商業 (2)=0.5 Named entity matching model (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Tokenization Group adjacent terms which do not involve in the dictionary mapping. Example: 帕洛阿爾托商會 “Palo” and “Alto”  ”Palo Alto” “ 帕洛阿爾托 ” is a single token Chinese tokens  ” 帕洛阿爾托 ” ， ” 商 ” ， ” 會 ” English tokens  ”Palo Alto”, “Chambe”, “of”, “commerce” Named entity matching model (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Hybrid Semantic and Phonetic Matching Algorithm We can formulate the matching poblem via an undirected bipartite weighted graph with vertex set V and edge set L. Let the English entity E, be represented as tokens and the Chinese entity, C, be represent as tokens V is set to {V E ∪ V C } where V E = { e 1,…,e m }and V C ={c 1,…,c n } Edge construction Process Starts with considering the semantic mapping Next, we consider phonetic mapping between tokens Named entity matching model (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Hybrid Semantic and Phonetic Matching Algorithm The edge construction process First, considering the semantic mapping as described in the tokenization process. Next, we consider phonetic mapping between tokens. Named entity matching model (cont.) English tokens Chinese tokens u(e i,c j )

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Phonetic matching model Generate phonetic representation for each term Pin-Yin TableJyut- Ping Table 北京話廣東話 Generate basic phoneme units English Mandarin PPS Table English Cantonese PPS Table Example “ 港 ”= gang3” “ 爸 ”=baa1” “Beckham”  ”bE kx m” “ 貝克漢姆 ”->”bei ke kx m” A basic phoneme units consists of a consonant followed by a vowel. If there is no consonant-vowel pattern, we extract the consonant. If there is no consonant, the vowel will be extracted.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Phonetic matching model (cont.) Phonetic Matching Algorithm We prepare a phoneme pronunciation similarity(PPS) table capturing the pronunciation similarity value between each possible English-Chinese basic phoneme unit pair. Suppose an English term, A, is represented by basic phoneme unit sequence. B, is represented by basic phoneme unit sequence

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Learning phonetic similarity We investigate several learning algorithms for obtaining the similarity values in the PPS table using a set of training data. The goal is to obtain V such that this similarity score is as high as possible for each correct name pair.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Learning phonetic similarity (cont.) The Widrow-Hoff Algorithm Consider the difference of the computed similarity score Y k and the actual one Z k for the k-th name pair. If the performance of the latest trained PPS table is not improved for three full iterations, the terminating condition is met.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Learning phonetic similarity (cont.) The Exponentiated-Gradient Algorithm It processes one training name pair at a time and updates the PPS table entries immediately. Let We define as: The updating formula is given by :

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Learning phonetic similarity (cont.) The Genetic Algorithm One way to view the learning problem is to formulate it as an optimization problem as follows: Each gene in a chromosome corresponds to a particular element in the table. 0100101001 1010011010 …

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Experiments on named entity matching model (cont.) The first set of experiments is to evaluate the phonetic similarity learning. The second set of experiments is to evaluate the performance of the overall named entity matching model. The average reciprocal rank (ARR) is used to measure the performance as follows:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Mining new entity translations from news System architecture

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Mining new entity translations from news (cont.) News Preprocessing Let S be a news story. The story representation comprises of four components, namely, people name component R p (S), place name component R l (S), organization name component R o (S), and content term component R c (S).

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Mining new entity translations from news (cont.) Gloss Translation For each Chinese term, we look up a bilingual lexicon for the English translation. The translated English terms replace the original Chinese terms to represent the story. Term weights are computed so that more likely translated terms will receive higher weights.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Mining new entity translations from news (cont.) Event Discovery An event is also represented by a four-dimensional vector similar to the story representation. Nearest neighbor clustering is used for processing the stories. We use a kind of cosine similarity measure to compute the similarity between an event and a story.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Mining new entity translations from news (cont.) Name entity cognate generation The candidate weight is designed to reflect the importance of the name in the corresponding event. Consider a particular English cognate G, the cognate weight, n(p l ), of each people name p l in the English named entity cognate is calcuated as follows:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 Mining new entity translations from news (cont.) Entity Matching The matching makes use of the named entity matching model as well as cognate weight. For a given Chinese name, the corresponding English names in the cognate will be returned according to the final similarity scores and sorted in descending order.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 Experiments on mining new translations

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 Conclusions We have developed a novel named entity matching model which considers both semantic and phonetic information. The experimental results show that our hybrid model can handle named entity matching in a more flexible and comprehensive way. We have also applied our named entity matching model on mining new unseen named entity. Translation not found in the dictionary can be effectively discovered.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 Personal Opinion …

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity."— Presentation transcript:

Similar presentations

About project

Feedback