An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Word Clustering Grouping of words capturing syntactic, semantic and distributional regularities Iran US A India Paris 11 13.4 22,000 play London laugh eat run 100 good nice better awesome cool fight

Bilingual Word Clustering What ? Clustering words of two languages simultaneously Inducing a dependence between the two clusterings Why ? To obtain better clusterings (hypothesis) How ? By using cross-lingual information

Bilingual Word Clustering Assumption: Aligned words convey information about their respective clusters

Bilingual Word Clustering Existing: Monolingual ModelsProposed: Monolingual + Bilingual Hints

Related Work Bilingual Word Clustering (Och, 1999) Language model based objective for monolingual component Word alignment count-based similarity function for bilingual Linguistic structure transfer (Täckstrom et al. 2012) Maximize the correspondence between clusters of aligned words Alternate optimization of mono & bi objective Clustering of only top 1 million words POS tagging (Snyder & Barzilay, 2010) Word sense disambiguation (Diab, 2003) Bilingual graph based projections (Das and Petrov, 2011)

Monolingual Objective S P(S;C) = P(c 1 ) * P(w 1 |c 1 ) * P(c 2 |c 1 ) * P(w 2 |c 2 ) * … (Brown, 1992) c1c1 c1c1 c4c4 c4c4 c3c3 c3c3 c2c2 c2c2 w1w1 w2w2 w3w3 w4w4 H(S;C) = E [ -log P(S;C) ] C Maximize the likelihood of the word sequence given the clustering Minimize the entropy (surprisal) of the word sequence given the clustering

Bilingual Objective Maximize the information we know about one clustering given another 1 1 1 1 Language 1Language 2 2 2 3 3 2 2 3 3 Word alignments

Bilingual Objective 1 1 1 1 Language 1Language 2 2 2 3 3 2 2 3 3 Minimize the entropy of one clustering given the other Word alignments

Bilingual Objective For aligned words x in clustering C and y in clustering D, The association between C x and D y can be written as: p(C x |D y ) + p (D y |C x ) CxCx CxCx DyDy DyDy DzDz DzDz p(D y |C x ) = a / (a + b) a b Where, CwCw CwCw c

Bilingual Objective Thus for the two clusterings, AVI (C, D) = E (i, j) [ -log p(C i |D j ) – log p (D j |C i ) ] Aligned Variation of Information Captures the mutual information content of the two clusterings Has distance metric properties Non-negative: AVI (C, D) > 0 Symmetric: AVI (C, D) = AVI (D, C) Triangle Inequality: AVI (C, E) ≤ AVI (C, D) + AVI (D, E) Identity of Indiscernibles: AVI (C, D) = 0, iff C ≅ D Aligned Variation of Information

Joint Objective α [ H (C) + H (D) ] + ß AVI (C, D) BilingualMonolingual α, ß are the weights of the mono and bi objectives resp. Word sequence information Cross lingual information

Inference Bilingual Monolingual Monolingual & Bilingual Word Clustering We want to do a MAP inference on the factor graph

Inference Optimization Optimal solution is a hard combinatorial problem (Och, 1995) Greedy hill climbing word exchange (Martin et al., 1995) Transfer word to the cluster with max improvement Initialization Round-robin based on frequency Termination No. of words exchanged < 0.1% (vocab 1 + vocab 2 ) At least 5 complete iterations

Evaluation Named Entity Recognition (NER) Evaluation Core information extraction task Very sensitive to word representations Word clusters are useful for downstream tasks (Turian et al, 2010) Can be directly used as features for NER English (Finkel & Manning, 2009), German (Faruqui & Padó, 2010)

Data and Tools German NER Training & Test data: CoNLL 2003 220,000 and 55,000 tokens resp. Corpora for clustering: WIT-3 (Cettolo et al., 2012) Collection of TED talks {Arabic, English, French, Korean, Turkish} – German Around 1.5 million German tokens for each pair Stanford NER for training (Finkel and Manning, 2009) In-built functionality to use word clusters for generalization cdec for unsupervised word alignments (Dyer et al., 2013)

Experiments Baseline: No clusters 1.Bilingual Information Only α = 0, ß = 1 Objective: AVI (C, D) 2.Monolingual Information Only α = 1, ß = 0 Objective: H (C) + H (D) 3.Monolingual + Bilingual Information α = 1, ß = 0.1 Objective: H (C) + H (D) + 0.1 AVI (C, D) α [ H (C) + H (D) ] + ß AVI (C, D)

Alignment Edge Filtering Word alignments are not perfect We filter out alignment edges between two words (x, y) if: xy a b cd 2 * b / ( (a + b + c) + (b + d) ) ≤ η Training η for different language pairs: English0.1 French0.1 Arabic0.3 Turkish0.5 Korean0.7

Results F 1 scores of German NER trained using different word clusters on the Training set

Results F 1 scores of German NER trained using different word clusters on the Test set

Ongoing Work Bilingual Monolingual Multilingual Word Clustering

Ongoing Work Current work: Parallel Data Mono 1 + Parallel Data Mono 1 + Parallel Data + Mono 2

Conclusion Novel information theoretic model for bilingual clustering The bilingual objective has an intuitive meaning Joint optimization of the mono + bi objective Improvement in clustering quality over monolingual clustering Extendable to any number of languages incorporating both monolingual and parallel data

Thank You!

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Similar presentations

Presentation on theme: "An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Similar presentations

Presentation on theme: "An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU."— Presentation transcript:

Similar presentations

About project

Feedback