Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Similar presentations


Presentation on theme: "Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling."— Presentation transcript:

1 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT

2 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Outline Introduction Multilingual MT-R (for revisors): linguistic methodology & basic software Goals and linguistic methodology Ariane-G5, an MT shell for building multilingual MT-R systems What has been and is done with Ariane-G5: MT-R, MT-A (for authors), MT of speech Representation of input documents Structuration of corpuses Functionalities during processing

3 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY Produce RAW translation GOOD ENOUGH to be revised Specialize to SUBLANGUAGES and use MULTILEVEL TRANSFER (semantic + traces) HEURISTIC PROGRAMMING

4 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ MULTILINGUAL MT-R: BASIC DIAGRAM

5 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Ariane-G5 ( ) : structure

6 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ DB of lingware components Declaration of variables (= typed attributes), templates… Dictionaries Grammars (rules = transitions of abstract automata) DB of texts Corpuses Source texts Intermediate results Translations (± revisions) Ariane-G5: 2 specialized DB relative to variants =>

7 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ What has been and is done with Ariane-G5: MT-R (for revisors) Large, operational systems: RU>FR, FR>EN Prototypes: EN>MY, TH, FR Lots of mockups MT-A (for authors) LIDIA mockups: FR>DE, EN, RU (adding CH) MT of speech (for task-oriented dialogues) CSTAR demo system (EN, DE, KR, IT, FR, JP)

8 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ MT-R examples of translation (1) français-anglais en aéronautique (avant révision humaine)

9 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ MT-R examples of translation (2)

10 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ MT-A example of a disambiguation dialogue Le capitaine a rapporté des tasses et des assiettes bleues > The captain has brought back blue bowls and plates /bowls and blue plates O Odes tasses bleues et des assiettes bleues Odes assiettes bleues et des tasses Question 1 O Ocapitaine de marine Ocapitaine daviation Ocapitaine dartillerie Ocapitaine dinfanterie Ocapitaine de cavalerie O… Question 2

11 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ server Interaction in source for the quality MT for all Example scenario : multilingual (UNL) tool Nicknames + language preferences tool Nicknames + language preferences enconversion server analysis server interactive disambiguation server decoding server deconversion servers Addressees servers

12 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Other future possibility: production of multilingual self-explaining documents

13 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Speech Translation: advantages of an Interchange Format N target languages for the cost of one analysis Translating into ones language from N source languages with one generation Using the same generation to backgenerate Analysis into IF IF Backgeneration

14 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Interchange Format : example la semaine du 12 nous avons des chambres simples et doubles disponibles give-information+availability+room(room-type= (single ; double), time=(week, md12)) give-information+availability+room(room-type= (single ; double), time=(week, md12)) give-information give-information +availability+room +availability+room (room-type=(single ; double), time=(week, md12)) (room-type=(single ; double), time=(week, md12)) Acte de dialogue Concepts Arguments

15 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Interface of CLIPS++ CSTAR-II demonstrator ReconnaissanceIF Rétrogénération (pour contrôler la compréhension) Génération

16 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Hardware architecture of the CLIPS++ CSTAR-II demonstrator F IF Montpellier Grenoble RNIS Reco Ethernet Contrôle, IF F Synthèse VCIU

17 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Steps in translating a text Build its hierarchical structure Chapters, sections, paragraphs, [sentences] Segment into translation units According to current length parameter [min..max] Translate each segment Adding segment results to text results for desired phases Revise (manually) the whole translations, keep the revisions

18 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Representations of input documents 3 main questions: how to represent the writing system, separate formatting tags from the text or not, how to handle non-textual elements (figures, icons, or formulas) contained in utterances Transliterations of textual elements Keeping formatting tags in the texts Non-textual elements

19 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Facilitate string-matching operations Diminish the size of dictionaries Represent diacritics Make some processing easier for some tools kataba > ktb$aaa, katub > ktb$au- or ktb$-ua Transliterations of textual elements

20 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Transliterations of textual elements (2) Represent writing systems using non Roman characters " мать " (mother) > "MATQ" and not "MAT6" (Today theme Kyoto dest go.) > KYOU WA KYOUTO E IKI MASU.

21 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Keeping formatting tags in the texts If the translation units get larger, almost all tags become inside tags Tags often have a linguistic role For example, a sentence may contain a bullet list or a numbered list which are normally linguistically homogeneous. For example, a sentence may contain a bullet list or a numbered list which are normally linguistically homogeneous.

22 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Non-textual elements Formulas, figures, icons, brand names, anchors, links… are often best replaced by tags or special occurrences The situation may be recursive (text inside figures) *IF x 2 +5y>3, x+y IS CONVENIENT. *IF, IS CONVENIENT. *IF $$R-1, $$E-2 IS CONVENIENT.

23 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Structuration of corpuses Motivations for corpuses Segmentation and structuration Representation of texts, intermediate results, translations and revisions

24 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Motivations for corpuses Corpus = collection of texts sharing some factual characteristics: natural language transliteration and method for handling formatting information and non-textual elements segmentation method structuration method some management information: source (journal/volume, book/chapter…) usage destination (send back, postedit, tests…)

25 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Segmentation and structuration "segmentation" = input texts > words, sentences… best done by the morphological analyzer & units of translation "structuration" =segmentation > higher level units paragraphs, sections, etc. + production of a corresponding tree structure In Ariane-G5, up to 7 hierarchical separators for a given corpus

26 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Representation of texts, intermediate results, translations and revisions Corpus = list of text files + descriptor Text = (transliterated) text + descriptor (+ non-textual elements replaced by tags or spec.occs) Intermediate result = list of decorated trees + descriptor (lingware variant + interval processed) Translation = (transliterated) text + descriptor (transliterated form may reduce morph. gen. size) Revision = (transliterated) text + descriptor (usually another, more natural transliteration)

27 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Functionalities during processsing Ensuring coherence between lingware and results Stopping & restarting processing of a text Reusing intermediate results recovery from interruptions debugging multitarget translation (analysis 2/3 of translation time)

28 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Conclusion and perspectives (1) Text & corpus handling in complete MT systems is quite complex and interesting… handling texts and corpuses not a straightforward problem, suggests many interesting technological and scientific issues

29 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Conclusion and perspectives (2) but more is coming: Synergy MT systems <> TA (Translation Aids) unification of the representations of texts in both worlds: MT: revised texts structured as input texts, => the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results) TA: translation memories from "bags" to structured translation memories (keeping the sequential context) both: multiple-layer translation memories lemmatized forms "concrete" syntactic trees & "abstract" logico-semantic trees formatting tags

30 Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/ Conclusion and perspectives (3) Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine » some sublanguages are good for MT, bad for TA weather bulletins others are good for TA, bad for MT weather related warnings, slightly modified versions of already translated documents and others are best kept for specialists Fine-tune legal sentences


Download ppt "Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling."

Similar presentations


Ads by Google