Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.

Slides:



Advertisements
Similar presentations
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Statistical Alignment and Machine Translation
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Statistical NLP: Lecture 13
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna

Translating Collocations for Bilingual Lexicons: A Statistical Approach Frank Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou CL-1996

3 Overview – Champollion Translates collocations from English into French using an aligned corpus (Hansards) The translation is constructed incrementally, adding one word at a time Correlation method: the Dice coefficient Accuracy between 65% and 78%

4 The Similarity Measure Dice coefficient (Dice, 1945) where p(X, Y), p(X), and p(Y) are the joint and marginal probability of X and Y If the probabilities are estimated using maximum likelihood, then where f X, f Y, and f XY are the absolute frequencies of appearance of “1”s for X and Y

5 Algorithm - Preprocessing Source and target language sentences must be aligned (Gale and Church 1991) List of collocations to be translated must be provided (Xtract, Smadja 1993)

6 Algorithm 1/3 1. Champollion identifies a set S of k words highly correlated with the source collocation The target collocation is in the powerset of S These words have a Dice-measure  T d ( = 0.10) and appear  T f ( = 5 ) times 2. Form all pairs of words from S 3. Evaluate the correlation between each pair and the source collocation (Dice)

7 Algorithm 2/3 4. Keep pairs that score above the threshold T d 5. Construct 3–word elements containing one of the highly correlated pairs plus a member of S 6. … 7. Until for some n ≤ k, no n –word scores above the threshold

8 Algorithm 3/3 8. Champollion selects the best translation among the top candidates 9. In case of ties, the longer collocation is preferred 10. Determine whether the selected translation is a single word, a flexible, or a rigid collocation, in case of multiword translations Are the words used consistently in the same order and at the same distance?

9 Experimental Setup DB1 = 3.5*10 6 words (8 months of 1986) DB2 = 8.5*10 6 words (1986 and 1987) C1 = 300 collocations from DB1 of mid- range frequency C2 = 300 collocations from 1987 C3 = 300 collocations from 1988 Three fluent bilingual speakers Canadian French vs. continental French

10 Results

11 Future Work Translating the closed class words Tools for the target language Separating corpus-dependent translations from general ones Handling low frequency collocations Analysis of the effects of thresholds Incorporating the length of the translation into the score Using nonparallel corpora

12 Comments

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Pascal Fung ACL-1995

14 Goal of the Paper Create bilingual lexicon of nouns and proper nouns From unaligned, noisy parallel texts of Asian/Indo-European language pairs Pattern matching method

15 Introduction Previous research on sentence-aligned, parallel texts Alignment not always practical Unclear sentence boundaries in corpora Noisy text segments present in only one language Two main steps Find small bilingual primary lexicon Compute a better secondary lexicon from these partially aligned texts

16 Algorithm 1. Tag the English half of the parallel text Nouns and proper nouns (they have consistent translations over the entire text) Tagged English part with a modified POS tagger Find translations for nouns, plural nouns and proper nouns only

17 Algorithm 2. Positional Difference Vectors Correspondence between a word and its translated counterpart In their frequency In their positions Correspondence need not be linear Calculation p – position vector of a word V – positional difference vector V[i-1] = p[i] – p[i-1]

18 Algorithm

19 Algorithm 3. Match pairs of positional difference vectors, giving scores Dynamic Time Warping (Fung & McKeown, 1994) For non-identical vectors Trace correspondence between all points in V1 and V2 No penalty for deletions and insertions Statistical filters

20 Dynamic Time Warping Given V1 and V2, which point in V1 corresponds to which point in V2?

21 Algorithm

22 Algorithm 5. Finding anchor points and eliminating noise Every word pair selected to run DTW Obtain DTW score Obtain DTW path Plot DTW paths of all such word pairs Keep highly reliable points and discard rest Point (i,j) is noise if

23 Algorithm

24 Algorithm 6. Finding low frequency bilingual word pairs Non-linear segment binary vectors V 1 [i] = 1 if word occurs in i th segment Binary vector correlation measure

25 Results

26 Comments

Automated Dictionary Extraction for “Knowledge- Free” Example-Based Translation Ralf D. Brown TMIMT-1997

28 Goal of the Paper Extract a bilingual dictionary Using a aligned bilingual corpus Perform tests to compare the performance of PanEBMT using Collins Spanish-English dictionary + WordNet English root/synonym list Various automatically extracted bilingual dictionaries

29 Introduction

30 Extracting Bilingual Dictionary Extracted from corpus using Correspondence table Threshold Schema Correspondence Table Two dimensional array Indexed by source language words Indexed by target language words Cross-product word entries of each sentence pair are incremented

31 Extracting Bilingual Dictionary Similar word orders language pairs biased Threshold setting A step function Unreachably high for co-occurrence < MIN Constant otherwise A sliding scale Start at 1.0 for co-occurrence = 1 Slide smoothly to MIN threshold value

32 Extracting Bilingual Dictionary Filtering Symmetric threshold Asymmetric threshold Any elements of Correspondence table which fail both tests set to zero Non-zero elements added to dictionary

33 Extracting Bilingual Dictionary - Results

34 Extracting Bilingual Dictionary - Errors High-frequency Error-ridden terms Short list high frequency words (all words which appear in at least 20% of source sentences) Short list sentence pairs containing extactly one or two high frequency words Results in 7 of 16 words – Zero error Merge with results from first pass

35 Experimental Setup Manually created tokenization – 47 equivalence classes, 880 words and translations of each word Two test texts 275 UN corpus sentences : in-domain 253 Newswire sentences : out-of-domain

36 Results

37 Comments

Extracting Paraphrases from a Parallel Corpus Regina Barzilay and Kathleen R. McKeown ACL-2001

39 Overview Corpus-based unsupervised learning algorithm for paraphrase extraction Lexical paraphrases (single and multi-word) (refuse, say no) Morpho-syntactic paraphrases (king’s son, son of the king) (start to talk, start talking) Phrases which appear in similar contexts are paraphrases

40 Data Multiple English translations of literary texts written by foreign authors Madam Bovary, Fairy Tales, Twenty Thousand Leagues Under the Sea, etc. 11 translations

41 Preprocessing Sentence alignment Translations of the same source contain a number of identical words 42% of the words in corresponding sentences are identical (average) Dynamic programming (Gale & Church, 1991) 94.5% correct alignments (127 sentences) POS tagger and chunker  NP and VP

42 Algorithm – Bootstrapping Co-training method: DLCoTrain (Collins & Singer, 1999) Similar contexts surround two phrases  paraphrase Having good paraphrase predictor contexts  new paraphrases 1. Analyze contexts surrounding identical words in aligned sentence pairs 2. Use these contexts to learn new paraphrases

43 Feature Extraction Paraphrase features Lexical: tokens for each phrase in the paraphrase pair Syntactic: POS tags Contextual features: left and right syntactic contexts surrounding the paraphrase (POS n-grams) tried to comfort her  left 1 =“VB 1 TO 2 ”, right 1 =“PRP$ 3 ” tried to console her  left 2 =“VB 1 TO 2 ”, right 2 =“PRP$ 3 ”

44 Algorithm Initialization Identical words are the seeds (positive paraphrasing examples) Negatives are created by pairing each word with all the other words in the sentence Training of the context classifier Record contexts around positive and negative paraphrases of length ≤ 3 Identify the strong predictors based on their strength and frequency

45 Algorithm Keep the most frequent k = 10 contexts with a strength > 95% Training of the paraphrasing classifier Using the context rules extracted previously, derive new pairs of paraphrases When no more paraphrases are discovered, stop

46 Results 9483 paraphrases, 25 morpho-syntactic rules Out of 500: 86.5% (without context), 91.6% (with context) correct paraphrases 69% recall evaluated on 50 sentences

47 Future Work Extract paraphrases from comparable corpora (news reports about the same event) Improve the context representation

48 Comments

49 Thank You !