Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Word Sense Disambiguation for Machine Translation Han-Bin Chen
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Statistical Alignment and Machine Translation
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
COMPARABLE CORPORA AND ITS APPLICATION Presented by Srijit Dutt( ) Janardhan Singh( ) Ashutosh Nirala( ) Brijesh Bhatt( ) 1.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Measuring Monolinguality
Approaches to Machine Translation
Terminology problems in literature mining and NLP
Statistical NLP: Lecture 13
Statistical NLP: Lecture 9
Approaches to Machine Translation
Presentation transcript:

Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni

The Scenario Lexicons are the bread and butter of many NLP areas So far, techniques to acquire automatically bilingual lexical data have been using mainly parallel corpora Availability of parallel corpora is limited Pure statistical approaches applied to comparable corpora failed to produce consistent results Results must hold for a great range of text types Need for a robust and extensible method The Motivation Employing a hybrid approach (statistical + rule-based) Drawing upon recent developments in the area of monolingual lexical acquisition (WSD, named entity recognition, term extraction) Investigate further exploitations of comparable corpora for lexical acquisition purposes by: New methodology for bilingual lexicon acquisition from comparable corpora (BLACC)

The Present – Methodology (1) COGNATE MATCHING LEMMATIZATION TERM EXTRACTION POS TAGGING TOKENIZATION LEMMATIZATION TERM EXTRACTION POS TAGGING TOKENIZATION ???? RULE-BASED METHODS L1L1 L2L2 COOCCURRENCE SIMILARITY CONTEXT HETEROGENEITY OTHER STATISTICAL METHODS

TOP 5 TRANSLATION CANDIDATES TOP 5 TRANSLATION CANDIDATE TOP 5 TRANSLATION CANDIDATES STATISTICAL METHODS RULE-BASED METHODS COMPARISON = RERANKING (weights stats/rule-based to be defined) LEXICON L1 – L2 The Present – Methodology (2)

The Past – Previous Work Lexical Acquisition From Parallel Corpora Statistical Co-Occurrence Frequencies + Length or Positional Statistics: Dagan et al. (1993), Kupiec (1993), Smadja & McKeown (1994), Kumano & Hirakawa (1994), Wu & Xia (1994) LA for Machine Translation: Sato & Nagao (1990), Brown et al. (1993), Melamed (1997) Concordancing: Gale & Church (1991), Catizone et al. (1993) Tools for Translators: Melamed (1996) Lexical Acquisition from Comparable Corpora Statistical Use of Multilingual Thesauri: Dejean et al. (2002) Co-occurrence Assumption: Fung & Church (1994), Rapp (1995), Rapp (1997), Fung & McKeown (1997), Fung & Yee (1998) Positional Difference Vector: Fung & McKeown (1994) Context Heterogeneity: Fung (1995) Rule-Based Cognates: Bourigault (1992), Ananiadou (1994), Jacquemin & Royaute (1994), Dagan & Church (1995), Oueslati et al. (1996), Koehn & Knight (2002) Context and Semantic Information: Lauriston (1996), Dubuc & Lauriston (1997)

The Future – Way Ahead Consider modeling procedures on the basis of a parallel corpus Implement the possibility of exploiting already available tagged corpora Analyse the possible application of clustering procedures to reduce polysemy Investigate the ethimological issue (closest common root) to fill the gap between distant languages