Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Mutual Information Mathematical Biology Seminar
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Natural Language Processing Expectation Maximization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Machine translation Context-based approach Lucia Otoyo.
Statistical Alignment and Machine Translation
Text Analysis Everything Data CompSci Spring 2014.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt( ) Janardhan Singh( ) Ashutosh Nirala( ) Brijesh Bhatt( ) 1.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Outline Problem Background Theory Extending to NLP and Experiment
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Sequence Alignment.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
January 2012Spelling Models1 Human Language Technology Spelling Models.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
Ch 4. Language Acquisition: Memoryless Learning 4.1 ~ 4.3 The Computational Nature of Language Learning and Evolution Partha Niyogi 2004 Summarized by.
Statistical Machine Translation Part II: Word Alignments and EM
Measuring Monolinguality
Statistical NLP: Lecture 7
Statistical NLP: Lecture 13
Statistical NLP: Lecture 10
Presentation transcript:

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

Outline  Motivation  Introduction: Comparable Corpora  Types of corpora  Methods to extract information from comparable corpora  Bilingual dictionary  Parallel sentences  Conclusion

Motivation  Corpus: the most basic requirement in statistical NLP  Large amount of bilingual text on web  Bilingual Dictionary generation  One to one correspondence between words  Parallel Corpus generation  One to one correspondence between sentences  Very rare resource (Hindi – Chinese)

Comparable corpora [7]  “A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.”  Characteristics of Comparable corpora  No parallel sentences  No parallel paragraphs  Fewer overlapping terms and words Definition by EAGLES

Spectrum of Corpora Unrelated corpora Comparable corporaParallel corporaTranscription - sentence by sentence aligned

A comparable corpora

Application of comparable corpora  Generating bilingual lexical entries (dictionary)  Creating parallel corpora

Generating bilingual lexical entries

Basic postulates [1]  Words with productive context in one language translate to word with productive context in second language e.g., table  मेज़  Words with rigid context translate into words with rigid context e.g., Haemoglobin  रक्ताणु  Correlation: between co-occurrence pattern in different languages Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus, Fung, 1995

Co-occurrence patterns [4]  If a term A co-occurs with another term B in some text T then its translation A' also co-occurs with B‘(translation of B) in some other text T' Automatic Identification of Word Translations from Unrelated English and German Corpora. R. Rapp, 1999 T T’ A B A’ B’

Co-occurrence Histogram [2]  For the word ‘Debenture’ Words Count Finding terminology translations from non-parallel corpora. Fung, 1997

Basic Approach [3]  Calculate the co-occurrence matrix for all the words in source language L 1 and target language L 2  Word order of the L 1 matrix is permuted until the resulting pattern is most similar to that of the L 2 matrix Identifying word translations in nonparallel, Rapp, R.,1995

English co-occurrence matrix  L 1 matrix Book1 Garden2 Plant3 School4 Sky5 Teacher6

Hindi co-occurrence matrix  L 2 matrix आकाश 1 पाठशाला 2 शिक्षक 3 बगीचा 4 किताब 5 पौधा 6

Hindi co-occurrence matrix  L 2 matrix: after permutations किताब 5 बगीचा 4 पौधा 6 पाठशाला 2 आकाश 1 शिक्षक 3

Result  Comparing the order of the words in L 1 matrix and permuted L 2 matrix L 1 indexWord in L 1 Word in L 2 L 2 index 1Book किताब 5 2Garden बगीचा 4 3Plant पौधा 6 4School पाठशाला 2 5Sky आकाश 1 6Teacher शिक्षक 3

Problems  Permuting co-occurrence matrix is expensive  Size of the vector # of unique terms in the language

A New Method [2]  Dictionary entries are used as seed words to generate correlation matrices  Algorithm:  A bilingual list of known translation pairs (seed words) is given  Step-1: For every word ‘e’ in L 1, find its correlation vector (M 1 ) with every word of L 1 in the seed words  Step-2: For every words ‘c’ in L 2, find its correlation vector (M 2 ) with every word of L 2 in the seed words  Step-3: Compute correlation(M 1, M 2 ); if it is high, ‘e’ and ‘c‘ are considered as a translation pair Finding terminology translations from non-parallel corpora. Fung, 1997

Co-occurrence EnglishHindi Garden बगीचा …… Plant पौधा …… …… Sky आकाश …… …… …… Seed word List Flower फूल

Crux of the Algorithm  Two main steps:  Formation of co-occurrence matrix  Measuring the similarity between vectors  Different possible methods to calculate above two steps  Advantage: vector size reduces to # of unique words in the seed list

Improvements  Window Size for co-occurrence calculation [2]  Should it be same for all the words ?  Co-occurrence Counts  Similarity Measure Finding terminology translations from non-parallel corpora. Fung, 1997

Co-occurrence count  Mutual Information (Church & Hanks, 1989)  Conditional Probability (Rapp, 1996)  Chi-Square Test (Dunning, 1993)  Log-likelihood Ratio (Dunning, 1993)  TF-IDF (Fung et al 1998) Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

Mutual Information [2] (1/2) k 11 = # of segments where both, w s and w t occur k 12 = # of segments where only w s occur k 21 = # of segments where only w t occur k 22 = # of segments where neither words occur  Segments: sentences, paragraphs, or string groups delimited by anchor paints Finding terminology translations from non-parallel corpora. Fung, 1997

Mutual Information [2] (2/2)  Weighted mutual information

Similarity Measures(1/2)  Cosine similarity (Fung and McKeown,1997)  Jaccard similarity (Grefenstette,1994)  Dice similarity (Rapp, 1999)  L1 norm / City block distance (Jones & Furnas, 1987)  L2 norm / Euclidean distance (Fung, 1997) Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

Similarity Measures(2/2)  L1 norm / City block distance  L2 norm / Euclidian distance  Cosine Similarity  Jaccard Similarity

Problems with the approach [5]  Coverage: only few corpus words are covered by the dictionary  Synonymy / Polysemy: several entries have the same meaning (synonymy), or an entry has several meanings (polysemy)  Similarities w.r.t. synonyms should not be independent  Improvements in the form of Geometric approaches  Projects the co-occurrence vectors of source and target word on a dictionary entries  Measures the similarity between the projected vectors A geometric view on bilingual lexicon extraction from comparable corpora. Gaussier, et al., 2004

Results PaperApproachMethodCorpusAccuracy Fung et al. 1996Word list based Best candidateEnglish/Japanese29% Word list based Top 20 candidate output English/Japanese50.9% Gaussier et al GeometricAvg. PrecisionEnglish/French44% R. Rapp et al Word list based 100 Test words English /French72%

Generating parallel corpora

Generating Parallel Corpora  Involves aligning the sentences in the comparable corpora to form a parallel corpora  Ways to do:  Dictionary matching  Statistical methods

Ways to do alignment  Dictionary matching  If the words in given two sentences are translation of each other, it is most likely that the sentences are translation of each other  Process is very slow  Accuracy is high but cannot be applied to large corpus  Statistical methods  To predict the alignment, these methods make use of distribution of length of sentence in corpus either in terms of words (Brown, 1996) or characters (Gale and Church, 1991)  Makes no use of any lexical resources  Fast and accurate

Length based statistical approach  Preprocessing  Segment the text into tokens  Combine the token into groups (nothing but sentences)  Find anchor point  Find points in corpus, where we are sure that start and end points in one language of the corpus aligns to start and end points in other language of the corpus  Finding these points require analysis of corpus

Example  Brown et al., already had anchors in their corpus  Used UK parliament proceedings ‘Hansards‘ as a parallel corpus  Each proceeding start with a comment, time of proceeding, who was giving the speech etc.  This information provides the anchor points. Sample text from Aligning Sentences in parallel corpora, P. Brown, Jeniffer Lei and Robert Mercer,1996

Aligning anchor points  Anchor points are not always perfect  Some may be missing  Some may be garbled  To find the alignment between these anchors, Dynamic programming technique is used  We find an alignment of the major anchors in the two corpora with the least total cost

Beads  Upper level view can be that corpus is sequence of sentence lengths occasionally separated by paragraph markers  Each of these groupings is called a bead  Bead is a type of sentence grouping Sample text from Aligning Sentences in parallel corpora, P. Brown, Jeniffer Lei and Robert Mercer,1996

Beads Bead typeContent eOnly one English sentence fOnly one French sentence efOne English and one French sentence eefTwo English and one French sentence effOne English and two French sentence ¶eOne English paragraph ¶fOne French paragraph ¶e¶fOne English and one French paragraph Example of beads

Problem Formulation  Sentences between the anchored points get generated by two random processes 1. Producing a sequence of beads 2. Choosing the length of the sentence(s) in each bead  Bead generation can be modeled using a two state Markov model  One sentence can align to zero, one or two sentence in the other side  Allows any of the eight beads as shown in previous table  Assumptions,

Modeling length of sentence  Model probability of length of sentence given its bead  Assumptions are made:  e-beads and f-beads: Probability of l e or l f is same as probability of l e or l f in the whole corpus  ef-bead:  English sentence: length l e with probability Pr(l e )  French sentence: that log ratio of French to English sentence length is normal distributed with mean µ and variance Where r = log(l f | l e )

Contd..  eef-bead:  English sentence: drawn from Pr(l e )  French sentence: r is distributed according to same normal distribution  eff-bead: same uniform distribution holds with r as  English sentence: drawn from Pr(l e )  French sentence: r is distributed according to same normal distribution  Given the sum of lengths of French sentences, probability for particular pair l f1 and l f2 is proportional to

Parameter Estimation  Using EM Algorithm, estimate the parameters of the Markov model  Following results were obtained Sample text from Aligning Sentences in parallel corpora, P. Brown, Jeniffer Lei and Robert Mercer,1996

Results  In a random sample of 1000 sentences, only 6 were not translation of each other  Brown et al. have also studied the effect of anchors points  According to them,  with paragraph marker but no anchor points, 2.0% error rate is expected  with anchor points but no paragraph marker, 2.3% error rate is expected  with neither anchor point nor paragraph marker, 3.2% error is rate expected

Conclusion  Comparable corpora can be used to generate bilingual dictionary and parallel corpora  Generating bilingual dictionary  Polysemy and sense disambiguation still remains a major challenge  Generating parallel corpora  Given the aligned points, aligner is likely to give good results  The experiments were very specific to corpora, hard to generalize the accuracy  The sentences of length which has a highest chance to get aligned but with completely wrong translation might confuse the aligner

References 1. Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora, Boston, Massachusetts, Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora, Hong Kong, R. Rapp (1995). Identifying word translations in nonparallel texts. In: Proceedings of the 33rd Meeting of the Association for Computational Linguistics. Cambridge, Massachusetts, R. Rapp. (1999). Automatic Identification of Word Translations from Unrelated English and German Corpora. Proceedings of the ACL-99. pp. 1–17. College Park, USA.

References 5. Gaussier, Eric, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and Herve Dejean. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain. 6. Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (ACL '91)