Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop.

Slides:



Advertisements
Similar presentations
An Introduction to GATE
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
An innovative platform to allow translation and indexing of internet sites Localization World
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Automatic Readability Evaluation Using a Neural Network Vivaek Shivakumar October 29, 2009.
Presenter: Shanshan Lu 03/04/2010
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
POS Tagger and Chunker for Tamil
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Information Retrieval in Practice
Language Identification and Part-of-Speech Tagging
Statistical Machine Translation Part II: Word Alignments and EM
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Topics in Linguistics ENG 331
CS 430: Information Discovery
Statistical NLP: Lecture 13
Chapter 7 Lexical Analysis and Stoplists
Presentation transcript:

Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop

Introduction Alinea is an aligning tool that uses language- independent techniques Alinea has obtained good results on closely related language pairs : EN, FR, ES, IT, … -> Is it possible to use it for languages further apart ? -> What kind of tuning is involved when dealing with a new language pair ? -> What kind of language-specific knowledge could be used in order to improve the results provided? ► Introduction Application Features I/O formats Language specificities

A corpus-based bilingual dictionary Corpus being scanned: Ismail Kadare’s, published in both languages in Paris (Ed. Fayard), other sources: IIRCA (International initiative for a reference corpus of Albanian) Indexing used to retrieve word forms NOT yet recorded in dictionaries Concordancing to enlarge the phraseological content of the dictionary Aligned concordancing used to correlate acceptions in context in the two languages Introduction ► Application Features I/O formats Language specificities

Dictionary in the making: sample Introduction ► Application Features I/O formats Language specificities

Items not yet recorded Examples in letter dh dhimbje (var.) dhimbsur (var.) dhimbsuri (var.) dhjavolos (loanword) dhjetra dhrahmi dhëmbëjashtë (comp) Why? - variants - foreign loanwords - local colour terms - compounds Introduction ► Application Features I/O formats Language specificities

Specific features of the language pair The Albanian « phonetic principle »: Albanian script converts foreign words: shofer/chauffeur, konti/comte, incl. proper nouns: Nju-Jork/New York, Ballkan/Balkans; The French graphemic preservation principle: Gjergj Balsha, Gjin Bue Shpata Introduction ► Application Features I/O formats Language specificities

French-Albanian stoplist Stoplist based on most frequent words asmesplotta ç'midisportate dotmuretri fortnariti jenesatua leosesetue maparasivend mepassot Introduction ► Application Features I/O formats Language specificities

Albanian alphabetical order A, B, C, Ç, D, DH, E, Ë, F, G, GJ, H, I, J, K, L, LL, M, N, NJ, O, P, Q, R, RR, S, SH, T, TH, U, V, X, XH, Y, Z, ZH 36 letters: 29 consonants, 7 vowels, 9 digraphs and 2 letters with diacritics count as separate graphemic unit Introduction ► Application Features I/O formats Language specificities

Alinea features Aligning in three steps –Anchor point extraction –Full sentence alignment –Lexical correspondences extraction Introduction Application ► Features I/O formats Language specificities

Alinea features Step 1 : Anchor point extraction –Relies on identical chains (transparent words -- Fr. transfuges) : numbers, proper nouns, other such chains. –Implements a "safest clues first" heuristic within an iterative framework –Usually yields precision close to 100%, and recall over 10%. Introduction Application ► Features I/O formats Language specificities

Alinea features –After identical chains, cognate pairs can be used to supply further anchor points Il y avait plusieurs années qu ' on avait planté de tels écriteaux un peu partout, non seulement dans les possessions de notre seigneur, le comte Stres des Gjika, ou Stres Gjikondi, mais aussi plus loin, au - delà des frontières de l ' État d ' Arberie, dans les autres contrées des Balkans. Ka shumë vite që kësi pllakash janë venë kudo dhe jo vetëm në viset e kryezotit tonë, kontit Stres të Gjikëve, ose Stres Gjikondit, siç e thërresin shkurt, por edhe më tutje, madje edhe përtej kufijve të shtetit të Arbrit, në pjesët e tjera të gadishullit. Introduction Application ► Features I/O formats Language specificities

Alinea features Step 2 : Full alignment computation –Extracts a sequence of sentence grouping: (1-0) (0-1) (1-1) (1-2) (2-1) (1-3) (3-1) … –Uses a combination of various clues: sentence lengths (Gale & Church, 1992) cognateness (Simard, 1992) word to word correspondences (requires training from a large corpus) Introduction Application ► Features I/O formats Language specificities

Alinea features Step 3 : Lexical correspondence extraction –Extracts word to word correspondences (except for words in the stoplist) –Requires a large amount of parallel texts (> words) in order to compute reliable statistics –Takes into account a combination of clues: word positions cognateness distributions across the training corpus -> Has obtained more than 90% of precision and recall on a literary corpus (Kraif & Chen, Coling 2004) Introduction Application ► Features I/O formats Language specificities

3 steps I. Anchor points II. Full alignment III. Lexical correspondances Introduction Application ► Features I/O formats Language specificities

Bi-text browsing and edition Introduction Application ► Features I/O formats Language specificities

Input / output format Input files –raw texts (Iso-Latin-1, UTF-8) –cesAna texts with sentence segmentationcesAna texts –xml tagged textsxml tagged texts –cesAligncesAlign Output files –kwic –aligned raw texts –cesAlign –htmlhtml Introduction Application Features ► I/O formats Language specificities

Alinea features Bilingual concordancer –Implements queries using xml tags and regular expressions at token level. –Example (using tagged corpora) : to search the verb être as an auxiliary followed by a past participle (French passé composé) : <>? Introduction Application ► Features I/O formats Language specificities

Alinea features

Language specific knowledge Minimal tuning –language pair -> sentence length average ratio Language specific knowledge is optional –stoplists to eliminate function words and false friends (faux-amis) –occurrence/cooccurrence statistics for lexical correspondence extraction –forthcoming : bilingual lexicon Introduction Application Features I/O formats ► Language specificities

References about Alinea Kraif O., Chen B. (2004) Combining clues for lexical level aligning using the Null hypothesis approach, in Proceedings of Coling 2004, Geneva, August 2004, pp Kraif O. (2001) Exploitation des cognats dans les systèmes d’alignement bi-textuel : architecture et évaluation, TAL 42 :3, ATALA, Paris, pp Kraif O. (2001) Constitution et exploitation de bi-textes pour l’Aide à la traduction, PhD dissertation, dir. by Henri Zinglé, Université de Nice Sophia Antipolis, Kraif O. (2000) Evaluation of statistical measures for automatic extraction of French-English bilingual lexicons, in Proceedings of Comlex 2000, Patras, Greece, september 2000, pp Alinea is distributed freely for research purposes. Please contact :