An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.
Patch to the Future: Unsupervised Visual Prediction
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Robust Textual Inference via Graph Matching Aria Haghighi Andrew Ng Christopher Manning.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
WORD-PREDICTION AS A TOOL TO EVALUATE LOW-LEVEL VISION PROCESSES Prasad Gabbur, Kobus Barnard University of Arizona.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Multilingual Synchronization focusing on Wikipedia
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Graphical models for part of speech tagging
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
Relation Alignment for Textual Entailment Recognition Cognitive Computation Group, University of Illinois Experimental ResultsTitle Mark Sammons, V.G.Vinod.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.
SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Self-taught Clustering – an instance of Transfer Unsupervised Learning † Wenyuan Dai joint work with ‡ Qiang Yang, † Gui-Rong Xue, and † Yong Yu † Shanghai.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR
Cross-lingual Models of Word Embeddings: An Empirical Comparison
Statistical Machine Translation Part II: Word Alignments and EM
Measuring Monolinguality
Statistical NLP: Lecture 13
Prototype-Driven Learning for Sequence Models
Multi-modality image registration using mutual information based on gradient vector flow Yujun Guo May 1,2006.
Statistical NLP: Lecture 9
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Word Clustering Grouping of words capturing syntactic, semantic and distributional regularities Iran US A India Paris ,000 play London laugh eat run 100 good nice better awesome cool fight

Bilingual Word Clustering What ? Clustering words of two languages simultaneously Inducing a dependence between the two clusterings Why ? To obtain better clusterings (hypothesis) How ? By using cross-lingual information

Bilingual Word Clustering Assumption: Aligned words convey information about their respective clusters

Bilingual Word Clustering Existing: Monolingual ModelsProposed: Monolingual + Bilingual Hints

Related Work Bilingual Word Clustering (Och, 1999) Language model based objective for monolingual component Word alignment count-based similarity function for bilingual Linguistic structure transfer (Täckstrom et al. 2012) Maximize the correspondence between clusters of aligned words Alternate optimization of mono & bi objective Clustering of only top 1 million words POS tagging (Snyder & Barzilay, 2010) Word sense disambiguation (Diab, 2003) Bilingual graph based projections (Das and Petrov, 2011)

Monolingual Objective S P(S;C) = P(c 1 ) * P(w 1 |c 1 ) * P(c 2 |c 1 ) * P(w 2 |c 2 ) * … (Brown, 1992) c1c1 c1c1 c4c4 c4c4 c3c3 c3c3 c2c2 c2c2 w1w1 w2w2 w3w3 w4w4 H(S;C) = E [ -log P(S;C) ] C Maximize the likelihood of the word sequence given the clustering Minimize the entropy (surprisal) of the word sequence given the clustering

Bilingual Objective Maximize the information we know about one clustering given another Language 1Language Word alignments

Bilingual Objective Language 1Language Minimize the entropy of one clustering given the other Word alignments

Bilingual Objective For aligned words x in clustering C and y in clustering D, The association between C x and D y can be written as: p(C x |D y ) + p (D y |C x ) CxCx CxCx DyDy DyDy DzDz DzDz p(D y |C x ) = a / (a + b) a b Where, CwCw CwCw c

Bilingual Objective Thus for the two clusterings, AVI (C, D) = E (i, j) [ -log p(C i |D j ) – log p (D j |C i ) ] Aligned Variation of Information Captures the mutual information content of the two clusterings Has distance metric properties Non-negative: AVI (C, D) > 0 Symmetric: AVI (C, D) = AVI (D, C) Triangle Inequality: AVI (C, E) ≤ AVI (C, D) + AVI (D, E) Identity of Indiscernibles: AVI (C, D) = 0, iff C ≅ D Aligned Variation of Information

Joint Objective α [ H (C) + H (D) ] + ß AVI (C, D) BilingualMonolingual α, ß are the weights of the mono and bi objectives resp. Word sequence information Cross lingual information

Inference Bilingual Monolingual Monolingual & Bilingual Word Clustering We want to do a MAP inference on the factor graph

Inference Optimization Optimal solution is a hard combinatorial problem (Och, 1995) Greedy hill climbing word exchange (Martin et al., 1995) Transfer word to the cluster with max improvement Initialization Round-robin based on frequency Termination No. of words exchanged < 0.1% (vocab 1 + vocab 2 ) At least 5 complete iterations

Evaluation Named Entity Recognition (NER) Evaluation Core information extraction task Very sensitive to word representations Word clusters are useful for downstream tasks (Turian et al, 2010) Can be directly used as features for NER English (Finkel & Manning, 2009), German (Faruqui & Padó, 2010)

Data and Tools German NER Training & Test data: CoNLL ,000 and 55,000 tokens resp. Corpora for clustering: WIT-3 (Cettolo et al., 2012) Collection of TED talks {Arabic, English, French, Korean, Turkish} – German Around 1.5 million German tokens for each pair Stanford NER for training (Finkel and Manning, 2009) In-built functionality to use word clusters for generalization cdec for unsupervised word alignments (Dyer et al., 2013)

Experiments Baseline: No clusters 1.Bilingual Information Only α = 0, ß = 1 Objective: AVI (C, D) 2.Monolingual Information Only α = 1, ß = 0 Objective: H (C) + H (D) 3.Monolingual + Bilingual Information α = 1, ß = 0.1 Objective: H (C) + H (D) AVI (C, D) α [ H (C) + H (D) ] + ß AVI (C, D)

Alignment Edge Filtering Word alignments are not perfect We filter out alignment edges between two words (x, y) if: xy a b cd 2 * b / ( (a + b + c) + (b + d) ) ≤ η Training η for different language pairs: English0.1 French0.1 Arabic0.3 Turkish0.5 Korean0.7

Results F 1 scores of German NER trained using different word clusters on the Training set

Results F 1 scores of German NER trained using different word clusters on the Test set

Ongoing Work Bilingual Monolingual Multilingual Word Clustering

Ongoing Work Current work: Parallel Data Mono 1 + Parallel Data Mono 1 + Parallel Data + Mono 2

Conclusion Novel information theoretic model for bilingual clustering The bilingual objective has an intuitive meaning Joint optimization of the mono + bi objective Improvement in clustering quality over monolingual clustering Extendable to any number of languages incorporating both monolingual and parallel data

Thank You!