Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop.

Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop

Introduction Alinea is an aligning tool that uses language- independent techniques Alinea has obtained good results on closely related language pairs : EN, FR, ES, IT, … -> Is it possible to use it for languages further apart ? -> What kind of tuning is involved when dealing with a new language pair ? -> What kind of language-specific knowledge could be used in order to improve the results provided? ► Introduction Application Features I/O formats Language specificities

A corpus-based bilingual dictionary Corpus being scanned: Ismail Kadare’s, published in both languages in Paris (Ed. Fayard), other sources: IIRCA (International initiative for a reference corpus of Albanian) Indexing used to retrieve word forms NOT yet recorded in dictionaries Concordancing to enlarge the phraseological content of the dictionary Aligned concordancing used to correlate acceptions in context in the two languages Introduction ► Application Features I/O formats Language specificities

Dictionary in the making: sample Introduction ► Application Features I/O formats Language specificities

Items not yet recorded Examples in letter dh dhimbje (var.) dhimbsur (var.) dhimbsuri (var.) dhjavolos (loanword) dhjetra dhrahmi dhëmbëjashtë (comp) Why? - variants - foreign loanwords - local colour terms - compounds Introduction ► Application Features I/O formats Language specificities

Specific features of the language pair The Albanian « phonetic principle »: Albanian script converts foreign words: shofer/chauffeur, konti/comte, incl. proper nouns: Nju-Jork/New York, Ballkan/Balkans; The French graphemic preservation principle: Gjergj Balsha, Gjin Bue Shpata Introduction ► Application Features I/O formats Language specificities

French-Albanian stoplist Stoplist based on most frequent words asmesplotta ç'midisportate dotmuretri fortnariti jenesatua leosesetue maparasivend mepassot Introduction ► Application Features I/O formats Language specificities

Albanian alphabetical order A, B, C, Ç, D, DH, E, Ë, F, G, GJ, H, I, J, K, L, LL, M, N, NJ, O, P, Q, R, RR, S, SH, T, TH, U, V, X, XH, Y, Z, ZH 36 letters: 29 consonants, 7 vowels, 9 digraphs and 2 letters with diacritics count as separate graphemic unit Introduction ► Application Features I/O formats Language specificities

Alinea features Aligning in three steps –Anchor point extraction –Full sentence alignment –Lexical correspondences extraction Introduction Application ► Features I/O formats Language specificities

Alinea features Step 1 : Anchor point extraction –Relies on identical chains (transparent words -- Fr. transfuges) : numbers, proper nouns, other such chains. –Implements a "safest clues first" heuristic within an iterative framework –Usually yields precision close to 100%, and recall over 10%. Introduction Application ► Features I/O formats Language specificities

Alinea features –After identical chains, cognate pairs can be used to supply further anchor points Il y avait plusieurs années qu ' on avait planté de tels écriteaux un peu partout, non seulement dans les possessions de notre seigneur, le comte Stres des Gjika, ou Stres Gjikondi, mais aussi plus loin, au - delà des frontières de l ' État d ' Arberie, dans les autres contrées des Balkans. Ka shumë vite që kësi pllakash janë venë kudo dhe jo vetëm në viset e kryezotit tonë, kontit Stres të Gjikëve, ose Stres Gjikondit, siç e thërresin shkurt, por edhe më tutje, madje edhe përtej kufijve të shtetit të Arbrit, në pjesët e tjera të gadishullit. Introduction Application ► Features I/O formats Language specificities

Alinea features Step 2 : Full alignment computation –Extracts a sequence of sentence grouping: (1-0) (0-1) (1-1) (1-2) (2-1) (1-3) (3-1) … –Uses a combination of various clues: sentence lengths (Gale & Church, 1992) cognateness (Simard, 1992) word to word correspondences (requires training from a large corpus) Introduction Application ► Features I/O formats Language specificities

Alinea features Step 3 : Lexical correspondence extraction –Extracts word to word correspondences (except for words in the stoplist) –Requires a large amount of parallel texts (>500 000 words) in order to compute reliable statistics –Takes into account a combination of clues: word positions cognateness distributions across the training corpus -> Has obtained more than 90% of precision and recall on a literary corpus (Kraif & Chen, Coling 2004) Introduction Application ► Features I/O formats Language specificities

3 steps I. Anchor points II. Full alignment III. Lexical correspondances Introduction Application ► Features I/O formats Language specificities

Bi-text browsing and edition Introduction Application ► Features I/O formats Language specificities

Input / output format Input files –raw texts (Iso-Latin-1, UTF-8) –cesAna texts with sentence segmentationcesAna texts –xml tagged textsxml tagged texts –cesAligncesAlign Output files –kwic –aligned raw texts –cesAlign –htmlhtml Introduction Application Features ► I/O formats Language specificities

Alinea features Bilingual concordancer –Implements queries using xml tags and regular expressions at token level. –Example (using tagged corpora) : to search the verb être as an auxiliary followed by a past participle (French passé composé) : <>? Introduction Application ► Features I/O formats Language specificities

Alinea features

Language specific knowledge Minimal tuning –language pair -> sentence length average ratio Language specific knowledge is optional –stoplists to eliminate function words and false friends (faux-amis) –occurrence/cooccurrence statistics for lexical correspondence extraction –forthcoming : bilingual lexicon Introduction Application Features I/O formats ► Language specificities

References about Alinea Kraif O., Chen B. (2004) Combining clues for lexical level aligning using the Null hypothesis approach, in Proceedings of Coling 2004, Geneva, August 2004, pp. 1261-1264. Kraif O. (2001) Exploitation des cognats dans les systèmes d’alignement bi-textuel : architecture et évaluation, TAL 42 :3, ATALA, Paris, pp. 833-867. Kraif O. (2001) Constitution et exploitation de bi-textes pour l’Aide à la traduction, PhD dissertation, dir. by Henri Zinglé, Université de Nice Sophia Antipolis, http://www.u-grenoble3.fr/kraifhttp://www.u-grenoble3.fr/kraif Kraif O. (2000) Evaluation of statistical measures for automatic extraction of French-English bilingual lexicons, in Proceedings of Comlex 2000, Patras, Greece, 22-23 september 2000, pp. 134-144 Alinea is distributed freely for research purposes. Please contact : kraif@u-grenoble3.frkraif@u-grenoble3.fr

Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop.

Similar presentations

Presentation on theme: "Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop.

Similar presentations

Presentation on theme: "Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop."— Presentation transcript:

Similar presentations

About project

Feedback