Catch the Link! Combining Clues for Word Alignment Jörg Tiedemann Uppsala University

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Stemming, tagging and chunking Text analysis short of parsing.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1/17 Probabilistic Parsing … and some other approaches.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
NATURAL LANGUAGE PROCESSING
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
CULTURAL WORDS AND PHRASES Language and Culture. Culture: the whole way of life of a certain linguistic community. This includes not only material aspects.
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
Statistical NLP: Lecture 13
Statistical NLP: Lecture 9
Vamshi Ambati 14 Sept 2007 Student Research Symposium
LING/C SC 581: Advanced Computational Linguistics
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Chunk Parsing CS1573: AI Application Development, Spring 2003
Natural Language Processing
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
By Hossein Hematialam and Wlodek Zadrozny Presented by
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Catch the Link! Combining Clues for Word Alignment Jörg Tiedemann Uppsala University

Outline yBackground xWhat do we want? xWhat do we have? xWhat do we need? yClue Alignment xWhat is a clue? xHow do we find clues? xHow do we use clues? xWhat do we get?

automatically language independent What do we want? Source Trans- lation 1 Sentence aligner Parallel corpus Trans- lation 2 Word aligner Token links Type links Aligned corpus

What do we have? ytokeniser (ca 99%) yPOS tagger (ca 96%) ylemmatiser (ca 99%) yshallow parser (ca 92%), parser (> 80%) ysentence aligner (ca 96%) yword aligner x75% precision x45% recall

zWord alignment challenges: ynon-linear mapping ygrammatical/lexical differences ytranslation gaps ytranslation extensions yidiomatic expressions ymulti-word equivalences What’s the problem with Word Alignment? (1) Our Hasid is in his late twenties. (2) Vår chassid är bortåt de trettio. (Saul Bellow “To Jerusalem and back: a personal account”) (1) I take the middle seat, which I dislike, but I am not really put out. (2) Jag tar mittplatsen, vilket jag inte tycker om, men det gör mig inte så mycket. (Saul Bellow “To Jerusalem and back: a personal account”) (1) Armén kommer att reformeras och effektiviseras. (2) The army will be reorganized with the aim of making it more effective. (The Declarations of the Swedish Government, 1988) (1) Neutralitetspolitiken stöds av ett starkt försvar till värn för vårt oberoende. (2) Our policy of neutrality is underpinned by a strong defence. (The Declarations of the Swedish Government, 1988) (1) Alsop says, "I have a horror of the bad American practice of choosing up sides in other people's politics,..." (2) Alsop förklarar: "Jag fasar för den amerikanska ovanan att välja sida i andra människors politik,...” (Saul Bellow “To Jerusalem and back: a personal account”)

So what? What are the real problems? zWord alignment yuses simple, fixed tokenisation yfails to identify appropriate translation units yignores contextual dependencies yignores relevant linguistic information yuses poor morphological analyses

What do we need? zflexible tokenisation zpossible multi-word units zlinguistic tools for several languages zintegration of linguistic knowledge zcombination of knowledge resources zalignment in context

Let’s go! zClue Alignment! finding clues combining clues aligning words

Word Alignment Clues y The United Nations conference has started today. y Idag började FN-konferensen. DT NNP NNP NN VBZ VBN RB RGOS NP VP ADVP [ ][ ][ ] ADVP VC NP conference konferensen

Word Alignment Clues  Def.: A word alignment clue C i (s,t) is a probability which indicates an association between two lexical items, s and t, from parallel texts. zDef.: A lexical item is a set of words with associated features attached to it.

How do we find clues? (1) zClues can be estimated from association scores: yC i (s,t) = w i * A i (s,t) xco-occurrence: Dice coefficient: A 1 (s,t) = Dice (s,t) Mutual information: A 2 (s,t) = I (s;t) xstring similarity longest common sub-seq.ratio: A 3 (s,t) = LCSR (s,t)

How do we find clues? (2) zClues can be estimated from training data: yC i (s,t) = w i * P (f t |f s )  w i * freq(f t,f s )/freq(f s )  f s, f t are features of s and t, e.g. part-of-speech sequences of s, t phrase category (NP, VP etc), syntactic function word position context features

How do we use clues? (1) yClues are simply sets of association measures yThe crucial point: we have to combine them! If C i (s,t) = P(a i ), define the total clue as  C all (s,t) = P(A) = P(a 1  a 2 ...  a n ) Clues are not mutually exclusive! Ù P(a 1  a 2 ) = P(a 1 ) + P(a 2 ) - P(a 1  a 2 ) Assume independence! Ù P(a 1  a 2 ) = P(a 1 ) * P(a 2 )

How do we use clues? (2) zClues can refer to any set of tokens from source and target language segments. Ù overlaps Ù inclusions zDef.: A clue shares its indication with all member tokens! Ù allow clue combinations at the level of single tokens

Clue overlaps - an example xThe United Nations conference has started today. xIdag började FN-konferensen. Clue 1 (co-occurrence) United Nations FN-konferensen 0.4 Nations conference FN-konferensen 0.5 United FN-konferense 0.3 Clue 2 (string similarity) conferenceFN-konferensen0.57 NationsFN-konferensen0.29 Clue all UnitedFN-konferensen0.58 NationsFN-konferensen0.787 conferenceFN-konferensen0.785

The Clue Matrix Idag började FN-konferensen The United Nations Conference has started today 0.5 Clue 2 (string similarity) conferenceFN-konferensen0.57 NationsFN-konferensen0.29 todayidag0.4 Clue 1 (co-occurrence) The United NationsFN-konferensen0.5 United NationsFN-konferensen0.4 hasbörjade0.2 startedbörjade0.6 started todayidag0.3 Nations conferencebörjade

Clue Alignment (1) ygeneral principles: xcombine all clues and fill the matrix xhighest score = best link xallow overlapping links only if there is no better link for both tokens if tokens are next to each other xlinks which overlap at one point form a link cluster

Clue Alignment (2) zthe alignment procedure: 1. find the best link 2. remove the best link (set its value to 0) 3. check for overlaps accept: add to set of link clusters dismiss otherwise 4. continue with 1 until no more links are found (or all values are below a certain threshold)

Clue Alignment (3) Idag började FN-konferensen The United Nations conference has started today 0.5 Best link: NationsFN-konferensen0.787 Link clusters: NationsFN-konferensen Best link: startedbörjade Link clusters: NationsFN-konferensen startedbörjade Best link: UnitedFN-konferensen0.7 Link clusters: United NationsFN-konferensen startedbörjade Best link: todayidag0.58 Link clusters: United NationsFN-konferensen startedbörjade todayidag 0 Best link: conference FN-konferensen 0.57 Link clusters: United Nations conference FN-konferensen started började today idag 0 Best link: TheFN-konferensen 0.5 Link clusters: The United Nations conference FN-konferensen started började today idag Link clusters: The United Nations conference FN-konferensen has started började today idag Best link: hasbörjade 0.2 0

Bootstrapping zagain: clues can be estimated from training data zself-training: use available links as training data zgoal: learn new clues for the next step zrisk: increased noise (lower precision)

Learning Clues yPOS-clue: xassumption: word pairs with certain POS-tags are more likely to be translations of each other than other word pairs xfeatures: POS-tag sequences yposition clue: xassumption: translations are relatively close to each other (esp. in related languages) xfeatures: relative word positions

So much for the theory! Results?! yThe setup: Corpus and basic tools: Saul Bellow’s “To Jerusalem and back: a personal account ”, English/Swedish, about 170,000 words English POS-tagger (Grok), trained on Brown, PTB English shallow parser (Grok), trained on PTB English stemmer, suffix truncation Swedish POS-tagger (TnT), trained on SUC Swedish CFG parser (Megyesi), rule-based Swedish lemmatiser, database taken from SUC

Results!?! … not yet xbasic clues: Dice coefficient (  0.3) LCSR (0.4),  3 characters/string xlearned clues: POS clue position clue xclue alignment threshold = 0.4 xuniform normalisation (0.5)

Results!!! Come on! Preliminary results (… work in progress …) zEvaluation: 500 random samples have been linked manually (Gold standard) zMetrics: precision PWA & recall PWA (Ahrenberg et al, 2000)

Give me more numbers! zThe impact of parsing. zHow much do we gain? yAlignment results with n-grams, (shallow) parsing, and both:

One more thing. zStemming, lemmatisation and all that … yDo we need morphological analyses for Swedish and English?

Conclusions yCombining clues helps to find links yLinguistic knowledge helps xPOS tags are valuable clues xword position gives hints for related languages xparsing helps with the segmentation problem xlemmatisation gives higher recall yWe need more experiments, tests with other language pairs, more/other clues yrecall & precision is still low

POS clues - examples scoresourcetarget VBZ WRB RH0S VBP RB RG0S VBD DT NNP NN PRP VBZ NNS NNP VB 0.6 RBR RGCS 0.5 DT JJ JJ AQP0SNDS

Position clues - examples scoremapping x -> x -> x -> x -> x -> x -> x -> 6 7 8

Open Questions yNormalisation!  How do we estimate the w i ’s? yNon-contiguous phrases xWhy not allow long distance clusters? yIndependence assumption xWhat is the impact of dependencies? yAlignment clues xWhat is a bad clue, what is a good one? xContextual clues

Clue alignment - example be ko var ställ scher min fru undrar road för jag de en lunch. amused , my wife asks why i ordered the kosher lunch

Alignment - examples the Middle East Mellersta Östern afford kosta på at least åtminstone an American satellite en satellit common sense sunda förnuftet Jerusalem area Jerusalemområdet kosher lunch koscherlunch leftist anti-Semitism vänsterantisemitism left-wing intellectuals vänsterintellektuella literary history litteraturhistoriska manuscript collection handskriftsamling Marine orchestra marinkårsorkester marionette theater marionetteatern mathematical colleagues matematikkolleger mental character mentalitet far too alldeles

Alignment - examples a banquet en bankett a battlefield ett slagfält a day dagen the Arab states arabstaterna the Arab world arabvärlden the baggage carousel bagagekarusellen the Communist dictatorships kommunistdiktaturerna The Fatah terrorists Al Fatah-terroristerna the defense minister försvarsministern the defense minister försvarsminister the daughter dotter the first President förste president

Alignment - examples American imperial interests amerikanska imperialistintressenas Chicago schools Chicagos skolor decidedly anti-Semitic avgjort antisemitiska his identity sin identitet his interest sitt intresse his interviewer hans intervjuare militant Islam militanta muhammedanismen no longer inte längre sophisticated arms avancerade vapen still clearly uppenbarligen ännu dozen Russian dussin ryska exceedingly intelligent utomordentligt intelligent few drinks några drinkar goyish democracy gojernas demokrati industrialized countries industrialiserade länderna has become har blivit

Gold standard - MWUs link: Secretary of State -> Utrikesminister link type: regular unit type: multi -> single source text: Secretary of State Henry Kissinger has won the Middle Eastern struggle by drawing Egypt into the American camp. target text: Utrikesminister Henry Kissinger har vunnit slaget om Mellanöstern genom att dra in Egypten i det amerikanska lägret.

Gold standard - fuzzy links link: unrelated -> inte tillhör hans släkt link type: fuzzy unit type: single -> multi source text: And though he is not permitted to sit beside women unrelated to him or to look at them or to communicate with them in any manner (all of which probably saves him a great deal of trouble), he seems a good-hearted young man and he is visibly enjoying himself. target text: Och fastän han inte får sitta bredvid kvinnor som inte tillhör hans släkt eller se på dem eller meddela sig med dem på något sätt (alltsammans saker som utan tvivel besparar honom en mängd bekymmer) verkar han vara en godhjärtad ung man, och han ser ut att trivas gott.

Gold standard - null links link: do -> link type: null unit type: single -> null source text:"How is it that you do not know English?" target text:"Hur kommer det sig att ni inte talar engelska?"

Gold standard - morphology link: the masses -> massorna link type: regular unit type: multi -> single source text: Arafat was unable to complete the classic guerrilla pattern and bring the masses into the struggle. target text: Arafat har inte kunnat fullborda det klassiska gerillamönstret och föra in massorna i kampen.

Evaluation metrics z C src – number of overlapping source tokens in (partially) correct link proposals, C src =0 for incorrect link proposals z C trg – number of overlapping target tokens in (partially) correct link proposals, C trg =0 for incorrect link proposals z S src – number of source tokens proposed by the system z S trg – number of target tokens proposed by the system z G src – number of source tokens in the gold standard z G trg – number of target tokens in the gold standard

Evaluation metrics - example

Corpus markup (Swedish) Det är som ett besök i barndomen

Corpus markup (English) It is my childhood revisited.

… is that all? zHow good are the new clues? yAlignment results with learned clues only: (neither LCSR nor Dice)