Information Extraction Lecture 12 – Multilingual Extraction CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS.

Slides:

Advertisements

Similar presentations

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Advertisements

Statistical Machine Translation Part I - Introduction

1 Statistical Machine Translation Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 8 October 27, 2004.

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Statistical Machine Translation Part I - Introduction Alex Fraser Institute for Natural Language Processing University of Stuttgart EMA Summer.

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Statistical Machine Translation Part I - Introduction Alexander Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Information Extraction Lecture 9 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser.

A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Jimmy Lin College of Information Studies University of Maryland

Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Natural Language Processing Expectation Maximization.

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Machine translation Context-based approach Lucia Otoyo.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Survey of Semantic Annotation Platforms

Statistical Machine Translation Part I - Introduction Alexander Fraser Institute for Natural Language Processing (IMS) Universität Stuttgart

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Word Sense Disambiguation and Machine Translation Part I - Introduction Alexander Fraser CIS, LMU München WPCom 1: Seminar on WSD and MT.

MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Statistical and Neural Machine Translation Part I - Introduction

Neural Machine Translation

Statistical Machine Translation Part II: Word Alignments and EM

RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,

Alexander Fraser CIS, LMU München Machine Translation

Statistical NLP: Lecture 13

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Statistical Machine Translation Papers from COLING 2004

Statistical Machine Translation Part IIIb – Phrase-based Model

Statistical Machine Translation Part VI – Phrase-based Decoding

Presentation transcript:

Information Extraction Lecture 12 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS

Outline Up until today: basics of information extraction Primarily based on named entities and relation extraction Last lecture was on sentiment analysis There are some other tasks associated with the basic idea of information extraction Two important tasks are terminology extraction and bilingual dictionary extraction I will talk very briefly about terminology extraction (one slide) and then focus on bilingual dictionary extraction 2

Terminology Extraction Terminology extraction tries to find words or sequences of words which have a domain-specific meaning For instance "rotator blade" refers to a specialized concept in helicopters or wind turbines To do terminology extraction, we need domain-specific corpora Terminology extraction is often broken down into two phases: 1.First a very large list of types using a linguistic pattern (such as noun phrase types) is made by extracting matching tokens from the domain-specific corpus 2.Then statistical tests are used to determine if the presence of this term in the domain-specific corpus implies that it is domain- specific terminology The challenge here is to separate terminology from general language A "blue helicopter" is not a technical term, it is a helicopter which is blue "rotator blade" is a technical term 3

Bilingual Dictionaries Extracting bilingual information Easiest to extract if we have a parallel corpus This consists of text in one language and the translation of the text in another language Given such a resource, we can extract bilingual dictionaries Mostly used for machine translation, cross- lingual retrieval and other natural language processing applications But also useful for human lexicographers and linguists 4

5 Parallel corpus Example from DE-News (8/1/1996) EnglishGerman Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues. Die Diskussion um die vorgesehene grosse Steuerreform dauert an. The FDP economics expert, Graf Lambsdorff, today came out in favor of advancing the enactment of significant parts of the overhaul, currently planned for Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus, wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen. Modified from Dorr, Monz

Availability of parallel corpora European Documents 24 Languages of the EU (Croatian most recent one) For two European languages (e.g., English and German), European documents such as the proceedings of the European parliament are often used United Nations Documents Official UN languages: Arabic, Chinese, English, French, Russian, Spanish For any two languages out of the 6 United Nations languages we can obtain large amounts of parallel UN documents For other language pairs (e.g., German and Russian), it can be problematic to get parallel data 6

Alex Fraser IMS Stuttgart AMTA 2006Overview of Statistical MT 7 u Most statistical machine translation research has focused on a few high-resource languages (European, Chinese, Japanese, Arabic). FrenchArabic Chinese ( ~ 200M words) Uzbek Approximate Parallel Text Available (with English) German Spanish Finnish { Various Western European languages: parliamentary proceedings, govt documents (~30M words) … Serbian Kasem Chechen { …… { Bible/Koran/ Book of Mormon/ Dianetics (~1M words) Nothing/ Univ. Decl. Of Human Rights (~1K words) Modified from Schafer&Smith Tamil Pwo

Document alignment In the collections we have mentioned, the document alignment is given We know which documents contain the proceedings of the UN General Assembly from Monday June 1st at 9am in all 6 languages It is also possible to find parallel web documents using cross-lingual information retrieval techniques Once we have the document alignment, we first need to "sentence align" the parallel documents 8

Alex Fraser IMS Stuttgart 9 Sentence alignment If document D e is translation of document D f how do we find the translation for each sentence? The n-th sentence in D e is not necessarily the translation of the n-th sentence in document D f In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignments In European Parliament proceedings, approximately 90% of the sentence alignments are 1:1 Modified from Dorr, Monz

Alex Fraser IMS Stuttgart 10 Sentence alignment There are several sentence alignment algorithms: –Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works well –Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains –K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign- English word pairs –Cognates (Melamed): Use positions of cognates (including punctuation) –Length + Lexicon (Moore; Braune and Fraser): Two passes, high accuracy, freely available Modified from Dorr, Monz

Alex Fraser IMS Stuttgart 11 Word alignments Given a parallel sentence pair we can link (align) words or phrases that are translations of each other: Modified from Dorr, Monz

Word alignment is annotation of minimal translational correspondences Annotated in the context in which they occur Not idealized translations! (solid blue lines annotated by a bilingual expert)

Automatic word alignments are typically generated using a model called IBM Model 4 No linguistic knowledge No correct alignments are supplied to the system Unsupervised learning (red dashed line = automatically generated hypothesis)

14 Uses of Word Alignment Multilingual – Machine Translation – Cross-Lingual Information Retrieval – Translingual Coding (Annotation Projection) – Document/Sentence Alignment – Extraction of Parallel Sentences from Comparable Corpora Monolingual – Paraphrasing – Query Expansion for Monolingual Information Retrieval – Summarization – Grammar Induction

Extracting Word-to-Word Dictionaries Given a word aligned corpus, we can extract word- to-word dictionaries We do this by looking at all links to "das". If there are 1000 links to "das", and 700 of them are from "the", then we get a score of 70% Example from Koehn 2008

Word-to-word dictionaries are useful – For example, they are used to translate queries in cross-lingual retrieval Given the query "das Haus", the two query words are translated independently (we use all translations and the scores) However, they are too simple to capture larger units of meaning, they link exactly one token to one token

"Phrase" dictionaries Consider the links of two words that are next to each other in the source language The links to these two words are often next to each other in the target language too If this is true, we can extract a larger unit, relating two words in the source language to two words in the target language We call these "phrases" – WARNING: we may extract linguistic phrases, but much of what we extract is not a linguistic phrase!

Slide from Koehn 2008

Using phrase dictionaries The dictionaries we extract like this are the key technology behind statistical machine translation systems Google Translate, for instance, uses phrase dictionaries for many language pairs There are further generalizations of this idea – We can introduce gaps in the phrases Like: "hat GAP gemacht | did GAP" The gaps are processed recursively – We can labels the rules (and gaps) with syntactic constituents to try to control what goes inside the gap Like: S/S -> "NP hat es gesehen | NP saw it"

Slide from Koehn 2008

Decoding Goal: find the best target translation of a source sentence Involves search – Find maximum probability path in a dynamically generated search graph Generate English string, from left to right, by covering parts of Foreign string – Generating English string left to right allows scoring with the n-gram language model Here is an example of one path

Slide from Koehn 2008

More on Statistical Machine Translation I teach a course on Statistical Machine Translation, not sure when it will be offered next Other resources: Philipp Koehn’s book -> Kevin Knight’s tutorial on word alignment is long, but it is good!

Extracting Multilingual Information Word-aligned parallel corpora are one valuable source of bilingual information Other interesting multilingual extraction tasks include: – Translating words such as names between scripts ("transliteration") – Extracting the translations of technical terminology from comparable corpora – Extracting parallel sentences (or smaller units) from comparable corpora – Projecting linguistic annotation (such as syntactic treebank annotation) from one language to another

Slide sources The slides today are mostly from Philipp Koehn's course Statistical Machine Translation and original slides I created (but see also attributions on individual slides) 41

Thank you for your attention! 42