Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005

Introduction NEs maybe origin from various languages. If we use a generalized model to translate all different languages, we might miss some important language-specific features. For example:  藍道夫  Randolph  桑多夫  Sundov No previous studies had dealt with this issue.

Introduction Ideally, we should construct a transliteration model and a language model for each origin. However, some origins lack enough name translation pairs for reliable model training. So, a cluster-specific NE transliteration framework is proposed.  Several origins from the same language family may share similar transliteration patterns

Name Clustering A set of NEs from origin i: from which the following model is trained: The distance between origin i and origin j can be symmetrically defined as: where, assuming name pairs are generated independently,

Name Clustering These origins are clustered with group- average agglomerative clustering.  The distance between clusters is defined as the average distance between all origin pairs in each cluster.  Set each origin as a single cluster.  Recursively merge the closest cluster pair into one cluster until an optimal number of clusters is formed.

Name Clustering Among all possible cluster configurations, we select the optimal cluster number based on the model perplexity.  Given a held-out data set L, a list of name translation pairs from different origins, the probability of generating L from a cluster configuration Θ ω is the product of generating each name pair from its most likely origin cluster: We calculate the language model perplexity: and the model configuration with the smallest perplexity is selected.

Name Clustering

Name Classification A Bayesian classifier based on N-gram source character language models are trained. Names are assigned to the cluster with the highest LM probability.

Transliteration Model Before applying a transliteration model, we should find the transliteration unit first.  Transliteration characters are not one-to-one mapping. Such source transliteration phrases based on a character collocation likelihood ratio test (Manning and Schutze 1999). c 1 c 2 c 12 are the frequencies of f 1, f 2 and f 1 ^ f 2, N is the total number of characters.

Transliteration Model The likelihood ratios for any adjacent source character pairs are calculated. Those pairs whose ratios are higher than a predefined threshold are selected. Adjacent character bigrams with one character overlap can be recursively concatenated to form longer source transliteration phrases. All these phrases and single characters are combined to construct a cluster-specific phrase segmentation vocabulary list, T.

Transliteration Model For each name pair in that cluster: 1. Segment the Chinese character sequence into a source transliteration unit sequence based on maximum string matching using T. 2. Convert Chinese characters into their romanization form, pinyin, then align the pinyin with English letters via phonetic string matching, as described in (Huang et. al., 2003). 3. Identify the initial phrase alignment path based on the character alignment path. 4. Apply a beam search around the initial phrase alignment path, searching for the optimal alignment which minimizes the overall phrase alignment cost, defined as: The alignment cost D is defined as the linear interpolation of the phonetic transliteration cost and the semantic translation cost:

Transliteration Model

A source NE is segmented into a sequence of transliteration units, and each source unit is associated with a set of target candidate translations with corresponding probabilities. A transliteration lattice is constructed to generate all transliteration hypotheses, among which the one with the minimum transliteration and language model costs is selected as the final hypothesis.

Experiment Settings 62K Chinese-English person name translation pairs provided by the LDC are selected for experiments. They are divided into three parts:  System training(90%)  Development( 5%)  Testing( 5%) In the development and test data, names from each cluster followed the same distribution as in the training data.

NE Classification Evaluation 45 cluster-specific N-gram source character language models are trained. The classification accuracy are evaluated on a held-out test set with 3K NE pairs. It is also experimented with different N values. Some classification errors are due to the inherent uncertainty of some names, e.g. “ 駱家輝 (Gary Locke).”

NE Transliteration Evaluation

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Similar presentations

Presentation on theme: "Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Similar presentations

Presentation on theme: "Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005."— Presentation transcript:

Similar presentations

About project

Feedback