COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala(10305906) Brijesh Bhatt(10405301) 1.

COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala(10305906) Brijesh Bhatt(10405301) 1 Guide: Dr. Pushpak Bhattacharya

Outline  Motivation  Comparable Corpora (Non-parallel Corpora)  Basic Architecture  Geometrical view  Improvements 2

Motivation  Corpus is holy grail in NLP  Bilingual Dictionary Generation  Parallel Corpora  One to one correspondence in content  Parallel corpora is rare  Resource constraint language (Punjabi - Spanish)  Monolingual corpus readily available  World Wide Web(Non-parallel corpus)  Techniques to work on non-parallel corpus 3

Non-parallel corpora  Characteristics  No parallel sentence  No parallel paragraphs  Fewer overlapping terms and words  Four dimension  Author  Domain  Topic  Time Finding terminology translations from non parallel corpora, Fung et al, 1997 4

Comparable Corpora 5 OneIndia.in

Comparable Corpora Navbharat Times 6

Postulates for non-parallel corpora  Basic postulate (Fung et al. 1997) 7 If a domain specific term A is related to another term B in some text T then its counterpart A' is related to B' in some other text T' E A D C B E’ D’ C’ B’ A’ TT’ Finding terminology translations from non parallel corpora, Fung et al, 1997

Using non-parallel corpora  Basic postulate (Fung et al. 1997) 8 If A is less associated with E then A' is less associated with E' E A D C B E’ D’ C’ B’ A’ TT’ Finding terminology translations from non parallel corpora, Fung et al, 1997

Using non-parallel corpora  Basic postulate (Fung et al. 1997) 9 Given a large set of words, a words is only associated with some of the words. E A D C B E’ D’ C’ B’ A’ TT’ Finding terminology translations from non parallel corpora, Fung et al, 1997

Using non-parallel corpora  Basic postulate (Fung et al. 1997) 10 If A is closely associated with word B, C in varying degree then A' is also closely associated with the same varying degrees to B’ and C’. E A D C B E’ D’ C’ B’ A’ TT’ Finding terminology translations from non parallel corpora, Fung et al, 1997

Histogram (Debenture) 11 Seed Words Frequency Corpus1 Corpus2 Seed Words

Histogram (Administration) 12 Seed Words Frequency Corpus1Corpus2

Co-occurrence Relation 13 Known seed words of both languages (Online dictionary) book किताब EnglishHindi Library पुस्तककालय knowledge ज्ञान school पाठशाला

Co-occurrence matrix 14 Base Lexicon/ Dictionary Words in Corpus Book Knowledge 1 … 0 … 1 TreeLibrary Word in source language 1 0 1 Co-occurence vector Base Lexicon/ Dictionary Target Language Matrix किताब ज्ञानपुस्तककालयपेड़

Improvements on Basic Architecture  Co-occurrence Counts  Similarity Measure  Window Size  Is it same for all words ?  Dictionary  Polysemous and Synonym Words  What if dictionary is not available ? 15

Context vector 16 A X B A B X X B A B X A Window Size : 3 Word A B occurs in dictionary X is any word Word Co-occurrence Count Automatic Identification of Word Translations, Rapp, 1999

Co-occurrence Counts  Mutual Information (Church et al 1989)  Conditional Probability (Fung et al 1996)  Chi-Square Test (Dunning et al 1993)  Log-likelihood Ratio (Rapp 1998)  TF-IDF (Fung et al 1997) 17

Conditional Probability  k 11 = frequency of common occurrence of word w s and word w t  k 12 = corpus frequency of word w s – k 11  k 21 = corpus frequency of word w t – k 11  k 22 = size of corpus (no. of tokens) – corpus frequency of w s - corpus frequency of w t  Marginal and joint probability 18 Finding terminology translations from non parallel corpora, Fung et al, 1997

Co-occurrence Counts  Mutual information  TF-IDF 19 Finding terminology translations from non parallel corpora, Fung et al, 1997

Co-occurrence Counts  Log Likelihood  k 11 = frequency of common occurrence of word w s and word w t  k 12 = corpus frequency of word w s – k 11  k 21 = corpus frequency of word w t – k 11  k 22 = size of corpus (no. of tokens) – corpus frequency of w s - corpus frequency of w t 20 Automatic Identification of Word Translations, Rapp, 1999 where

Similarity Measures  Cosine Similarity  Jaccard Similarity  Euclidian\L2  Manhattan\L1\City-Block 21

Window Size  What is ideal context size ?  Same window size  “amount” : more frequent  “debenture” : less frequent Window Size 22

Dependency Tree 23 Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al., 2009

Modeling context using dependency tree 24  The four vectors for positions are mapped as follows: -1 – Immediate parent +1 – Immediate child -2 – grand parent +2 – grand child Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al., 2009

Context vector v/s dependency parsing 25 Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al., 2009

Dependency Tree  Context is better captured in dependency information rather than adjacent words  Long distance dependencies capture associated words  Languages with different word orders : parent and child relationship  Higher Accuracy 26 Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al., 2009

Dictionary as seed word list (issues) 27  Multiple translation  Polysemous words  Words in one text may not be present in other  Word may not be in dictionary format Finding terminology translations from non parallel corpora, Fung et al, 1997

Geometrical View (translation) A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora, E. Gaussier,J M Renders, I Matveeva, C Goutte, H Dejean,2004 28

Geometric View (Extended Approach) 29 (translation) A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora, E. Gaussier,J M Renders, I Matveeva, C Goutte, H Dejean,2004

Translation without dictionary  What if dictionary is not available?  Find language for which dictionary is available.  Use that language as intermediate language between source and target language. 30

Use of pivot language Unavailability of bilingual lexicon Use pivot language for which bilingual lexicon is available. 31 Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al, X Y Z Source LanguagePivot LanguageTarget Language What if Y is polysemous???

Use of pivot language Source : Hindi Pivot: English X = प्रकाश Y = light 32 Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al, X Y Z Source LanguagePivot LanguageTarget Language Lexicons are intransitive. Results in noisy translation.

Corpus to handle intransitivity C1 : Source Corpus C2: Target Corpus 33 Pivot X Z S(X) {Z1,Z2} C1 C2 S(X) = Signature of X Z1, Z2 Target signature NAS(s,t) = Z = Winning signature Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

Limitation of Context-based Approach  Lexical context around translation candidates.  Words may appear in similar context but are not translation of each other. So leads to false translation.  E.g.# using Chinese English comparable corpus we get (using definition of Fung 1995)  Distance between vector 1 & 2 is 0.084 > distance between vector 1 and 3 which is 0.075  Does not use rich syntactic information other than bag- of-words. 34 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009 NoWordContext Heterogeneity Vector 1 经济学 (economics) (0.185, 0.006) 2economics(0.101, 0.013) 3medicine(0.113,0.028)

Dependency Heterogeneity  Dependency Heterogeneity phenomena: a word in source language shares similar head and modifiers with its translation in target language, no matter whether they occur in similar context or not.  Uses rich syntactic information.  E.g.# big(MOD) brown(MOD) dog(HEAD) Bird(MOD) song(HEAD) Song(MOD) bird(HEAD) 35 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

Does it work? 36 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009 Frequently used ModifierFrequently used Head 经济学 (economics) economicsmedicine 微观 /micro keynesianphysiology 宏观 /macro new Chinese 计量 /computation institutionaltraditional 新 /new positivebiology 政治 /politics classical internal 大学 /university laborscience 古典派 /classicists development clinical 发展 /development engineeringveterinary 理论 /theory financewestern 实证 /demonstration internationalagriculture 经济学 (economics) economicsmedicine 是 /is is 均衡 /average hastends 毕业 /graduate wasinclude 承认 /admit emphasizes moved 能 /can non-rivaledmeans 分化 /split becamerequires 剩下 /leave assumeincludes 比 /compare relieswere 成为 /become canhas 偏重 /emphasize replacesmay

Comparable Corpora Preprocessing 37 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009 Raw corpora: Chinese and English pages from Wikipedia with inter- language links Morphological Analyzer POS tagger MaltParser to get syntactic dependency. Refinements 1. Stemming on translation candidate. 2. Removal of stop words. 3. Sentences having more than k (= 30) words are removed. Focus is on Chinese-English bilingual dictionary extraction for single-nouns Refinement to get preprocessed corpora.

Dependency Heterogeneity Vector Calculation Where: NMOD : noun modifier SUB : subject OBJ : object, are the dependency labels produced by MaltParser. 38 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009 No Bilingual Dictionary is needed

Bilingual Dictionary Extraction (contd) 39 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009  From this method distance(D H ) between  D H ( 经济学, economics) = 0.222 &  D H ( 经济学, medicine) = 0.496. WordDependency Heterogeneity Vector 经济学 (economics) (0.398, 0.677, 0.733, 0.471) economics(0.466, 0.500, 0.625, 0.432) medicine(0.748, 0.524, 0.542, 0.220)

Results of Bilingual Dictionary Extraction 40 Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009  Performed on 250 Chinese/English single-noun pairs Average, accuracy for ContextDependencyonly-modonly-headonly-NMODs Top 50.132 0.208 ( ↑ 57.58%) 0.1560.1760.200 Top 100.296 0.380 ( ↑ 28.38%) 0.336 0.364  only-mod: (H NMODMod )  only-head: (H NMODHead,H SUBHead,H OBJHead )  only-NMOD: (H NMODHead,H NMODMod )

Result PaperMethodCorpusAccuracy Fung et al 1996Best candidateEnglish/Japanese29% Rapp 1998100 test wordsEnglish /French72% Gaussier et alAvg. PrecisionEnglish/French44% Morin et al 2007Top 20French/Japanese42% Yu et al 2009Top 10English/Chinese38% 41

Conclusion 42  Use of non-parallel corpora is inevitable and reduces the efforts of development of parallel corpora.  Modern techniques achieve accuracy upto 70% with non-parallel corpora.  Polysemy and sense disambiguation remains major challenge.  It becomes difficult to compare different implementation due to different nature of language and corpus.

References 43  Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora,Hong Kong, 192-202.  Fung, P.; Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of COLING-ACL 1998,Montreal, Vol. 1,414-420.  R. Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and German Corpora. In Proc. of the ACL-99. pp. 1–17. College Park, USA.

References  Gaussier, Eric, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and Herve Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain.  X.Robitaille, Y.Sasaki, M.Tonoike, S.Sato and T.Utsuro. 2006. Compiling French Japanese Terminologies from the Web. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics.  E.Morin, B.Daille, K.Takeuchi and K.Kageura. 2007. Bilingual Terminology Mining – Using Brain, not Brawn Comparable Corpora. Proceedings of the 45 th Annual Meeting of the Association for Computational Linguistics. pp. 664-671. 44

References  Nikesh Garera, Chris Callison-Burch, and David Yarowsky. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Conference on Natural Language Learning (CoNLL), Boulder, Colorado.  K.Yu and J.Tsujii. 2009. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT 2009).  Daphna Shezaf, Ari Rappoport, Bilingual lexicon generation using non- aligned signatures Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 98–107, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics 45

46  THANK YOU  Questions ?

COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala(10305906) Brijesh Bhatt(10405301) 1.

Similar presentations

Presentation on theme: "COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala(10305906) Brijesh Bhatt(10405301) 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala(10305906) Brijesh Bhatt(10405301) 1.

Similar presentations

Presentation on theme: "COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala(10305906) Brijesh Bhatt(10405301) 1."— Presentation transcript:

Similar presentations

About project

Feedback