Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Similar presentations


Presentation on theme: "Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU."— Presentation transcript:

1 Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

2 Multilinguality Using more than one language at a time Image source: https://buffy.eecs.berkeley.edu/PHP/resabs/images/2006//101268-1.png

3 Multilinguality Why ? Bank Images: http://www.realestategolfodulce.com/, http://thetrustadvisor.com/http://www.realestategolfodulce.com/http://thetrustadvisor.com/ बैंक तट Cross lingual Word Sense Disambiguation (Diab and Resnik, 2002)

4 Multilinguality Why ? Bilingual Word Clustering (Faruqui & Dyer, 2013)

5 Multilinguality Why ? Bilingual Word Clustering (Faruqui & Dyer, 2013)

6 Multilinguality Using data from other languages DirectIndirect Assume foreign = original language Extract information from foreign language

7 Direct Information Transfer NLP System Language 1 data Language 2 data Output

8 Direct Information Transfer Why would it work ? Works for specific tasks like NER Many NEs retain their “orthographic” form Across languages that use the same “alphabet” English, German, French, Spanish Hindi, Marathi, Bihari Specially proper nouns Names of Locations USA, London, New York, Pittsburgh Names of People Obama, William, Roger

9 Barack Obama hat 2012 mit dieser Strategie die Präsidentschaftswahlen gewonnen. The Obama administration has poured billions of dollars into expanding the reach of the Internet. Pour finir, en défendant les bonus et en tentant de faire dérailler les nouvelles règles prudentielles, ce démocrate s'est mis à dos Barack Obama. Direct Information Transfer... sagte Jimmy Wales dem Wall Street Journal in einem Interview in Hongkong. Mads Refslund, executive chef at Acme, forages in the overgrown spaces and hidden markets of Hongkong for regional delicacies. Les sacs de luxe, nouvelle monnaie d'échange à Hongkong.

10 Direct Information Transfer Semantic Generalization Deutschland (100) Ostdeutschland (5) Westdeutschland (0) Deutschland (100) Ostdeutschland (5) Westdeutschland (0) LOC

11 Direct Information Transfer How ? NER System Language 1 Training data Language 1 Training data Language 2 Word Clusters Language 2 Word Clusters NE-tagged Text Input

12 Evaluation Tools Stanford NER for training (Finkel and Manning, 2009) In-built functionality to use word clusters for generalization Word clustering software (distributional + morphological) (Clark., 2003) Data NER training data German, English: CoNLL 2003 Dutch, Spanish: CoNLL 2002 Generalization data WMT-2012 news commentary: 200 million tokens English, German, French, Spanish, Czech

13 Results

14

15 Improvement in F 1 scores by NE type

16 Quick Takeaways Multilingual data can be put to use for monolingual benefits The amount of help depends on how similar the two languages are “orthographically”

17 Indirect Information Transfer NLP System Language 1 data Language 2 data Output + +

18 Vector Space Word Models Image: http://www.emeraldinsight.com

19 Vector Space Models Image: http://d1avok0lzls2w.cloudfront.net/

20 Vector Space Models Monolingual Word Vectors 1 Monolingual Word Vectors 2 + + Better Monolingual Word Vectors 1 ??

21 Indirect Information Transfer + + = Canonical Correlation Analysis n n k d2d2 n n d1d1 k + +

22 wxwx wxwx d1d1 k wywy wywy d2d2 k x x y y n n d2d2 d1d1 Canonical Correlation Analysis * * nn k k

23 Indirect Information Transfer Word Vectors in Language 1 Word Vectors in Language 1 Word Vectors in Language 2 Word Vectors in Language 2 Obtain 1-to-1 mapping using word alignments Word Vectors in Language 1 Word Vectors in Language 1 Word Vectors in Language 2 Word Vectors in Language 2 + + Word Vectors in Language 1 Word Vectors in Language 1 Word Vectors in Language 2 Word Vectors in Language 2

24 Experiments Task: Word Pair Reranking Rank a list of word pairs according to semantic similarity Datasets WS-353: 353 word pairs RG-65: 65 noun pairs Truncation Maybe the correlation introduces noise Keep only the top k% of correlated dimensions

25 Evaluation Tools Word vectors: RNNLM Toolkit (Mikolov, 2009) Word alignments: cdec (Dyer et al, 2013) CCA: Matlab Toolkit Data Word vector monolingual training data WMT news commentary: 2011, 2012 English, French, Spanish, German Word alignment data WMT news commentary 2010, 09, 08. 07, 06 {French, Spanish, German} - English

26 Results

27

28 Original English Vectors

29 German Projected on English

30 Conclusion Word vector quality can be improved using multilingual data At least for lexical semantic tasks The amount of help provided by these languages depend on how similar they are to each other A task like NER can use data from multiple languages in a simple framework

31 Thank You!


Download ppt "Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU."

Similar presentations


Ads by Google