Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMPILING A FRENCH- SLOVENIAN PARALLEL CORPUS Adriana Mezeg University of Ljubljana Department of Translation Studies.

Similar presentations


Presentation on theme: "COMPILING A FRENCH- SLOVENIAN PARALLEL CORPUS Adriana Mezeg University of Ljubljana Department of Translation Studies."— Presentation transcript:

1 COMPILING A FRENCH- SLOVENIAN PARALLEL CORPUS Adriana Mezeg University of Ljubljana Department of Translation Studies

2 Presentation plan 1.Introduction 2.Part I: Corpus design and development 3.Part II: Case study 4.Conclusion 2UCCTS 2010, Edge Hill University, Ormskirk

3 Introduction Definition of a parallel corpus: a collection of electronic texts originally written in a language A alongside their translations into a language B (Baker 1995), built according to explicit design criteria for a specific purpose (Atkins 1992) Situation in Slovenia and motives behind the compilation of a French-Slovenian corpus Objectives 3UCCTS 2010, Edge Hill University, Ormskirk

4 Part I: Corpus design and development Medium, genre, size Collecting texts, getting permission from copyright holders Pre-processing Alignment Annotation Some statistics 4UCCTS 2010, Edge Hill University, Ormskirk

5 Medium, genre, size Medium : written/spoken Availability of French-Slovene written parallel texts : EU documents, legal and administrative texts, promotional texts, journalistic texts, literature Genre choice Envisaged size: 1 million/language 5UCCTS 2010, Edge Hill University, Ormskirk

6 Collecting texts, getting permission from copyright holders Journalistic subcorpus: 300 articles from Le Monde diplomatique and Le Monde diplomatique v slovenščini (1 164 074 words); copyright permission granted Literary subcorpus: 12 contemporary French novels and their Slovene translations (1 302 911 words); problems getting copyright permission from French publishing houses 6UCCTS 2010, Edge Hill University, Ormskirk

7 Pre-processing Getting/producing texts in electronic form Cleaning the texts (removal of tables, graphics, pictures, footnotes etc.) Converting the texts into text-only ANSI files (problems with character č); Unicode Utf-8 7UCCTS 2010, Edge Hill University, Ormskirk

8 Alignment Commercial automatic alignment tools: WinAlign (Trados), Atril Déjà Vu, ParaConc Automatic sentence alignment using Michael Barlow’s ParaConc Due to different number of segments (2-1, 1- 2, 1-0), manual correction needed Example of (semi-)automatic alignment: ParaConc (→ handout) 8UCCTS 2010, Edge Hill University, Ormskirk

9 Annotation FraSloK is a morphosyntactically tagged corpus: each word in a corpus is assigned a grammatical tag corresponding to the word class to which it belongs (a part-of-speech tag) POS taggers for French: TreeTagger, MeLT tagger (→ handout) 9UCCTS 2010, Edge Hill University, Ormskirk

10 Some statistics LMD (journalistic subcorpus) LIT (literary subcorpus) Total/language French part637 297701 7151 339 012 Slovene part526 777601 1961 127 973 Total/subcorpus (tokens) 1 164 0741 302 911 Total/corpus (tokens) 2 466 985 UCCTS 2010, Edge Hill University, Ormskirk10 Graph 1: Size of the French-Slovenian corpus (FraSloK) and its subcorpora.

11 UCCTS 2010, Edge Hill University, Ormskirk11 Graph 2: General statistics for the French and Slovene journalistic subcorpus. CategoryFrench subcorpus (LMD)Slovene subcorpus (LMD) tokens637 297526 777 types38 99463 514 type/token ratio6,2112,27 standardised TTR46,8960,30 mean word length4,985,55 sentences25 42024 002 average sentence length in words 24,7121,56

12 UCCTS 2010, Edge Hill University, Ormskirk12 Graph 3: General statistics for the French and Slovene literary subcorpus. CategoryFrench subcorpus (LIT)Slovene subcorpus (LIT) tokens701 715601 196 types41 97668 919 type/token ratio5,9911,47 standardised TTR47,7758,61 mean word length4,544,82 sentences42 35042 151 average sentence length in words 16,5514,25

13 Part II: Case study Translation of French detached constructions into Slovenian What are detached constructions? Problem: due to specific syntactic and semantic characteristics of French initial detached constructions, their translation into Slovene is problematic Hypothesis: explicitation (Vinay and Darbelnet 1958, Blum-Kulka 1986) 13UCCTS 2010, Edge Hill University, Ormskirk

14 Example: gerundive (en participle) detached constructions Semi-automatic extraction from ParaConc Syntactic patterns based on part-of-speech tags and regular expressions Example: En(\W \w+){0,1}(\W)? \w+ UCCTS 2010, Edge Hill University, Ormskirk14

15 Journalistic subcorpus: 90 occurrences out of 96 correct (94 %) Literary subcorpus: 157 occurrences out of 160 correct (98 %) After automatic extraction for all the syntactic patterns and manual elimination of unsuitable examples: 391 French initial DC having a gerund as a base (JC: 134 oc., 34 %; LC: 257 oc., 66 %) UCCTS 2010, Edge Hill University, Ormskirk15

16 Distribution of translation strategies for detached constructions with a gerund as a base 16UCCTS 2010, Edge Hill University, Ormskirk

17 Examples (1) En confiant le “ sale travail ” à l’Ethiopie, l’exécutif américain a pris le risque de ranimer des braises mal éteintes dans la région. (LMD, November 2007) [In entrusting the »dirty work« to Ethiopia, the American executive risked rekindling badly extinguished embers in the region.] Ko je ameriški izvajalec zaupal »umazano delo« Etiopiji, je tvegal, da se bo v regiji razpihala žerjavica, ki še ni dobro ugasnila. (→ Subordination) [When the American executive entrusted the »dirty work« to Ethiopia, he risked …] (2) Puis, en secouant sa torpeur, jeta d'une voix rauque : – Qu'est-ce que tu veux qu'on en fasse ? (Andreï Makine, The French Testament) [Then, getting out of the numbness, said pointedly: - What do you want us to do about it?] Potem se je zdrznil iz odrevenelosti in rezko odvrnil: - Kaj pa bi rada, da naredim? (→ Coordination) [Then he got out of the numbness and said pointedly: - What do you want me to do?] UCCTS 2010, Edge Hill University, Ormskirk17

18 (3) En m'emmenant trois jours en week-end avec son trésorier et ses dobermans, le directeur de la chaîne a cru me faire passer à jamais le goût de la gaudriole. (Marie Darrieussecq, Pig Tales. A Novel of Lust and Transformation, 1996) [By taking me for three days to his country house with his treasurer and his Dobermanns, the director of the chain thought that I would repress for ever my desire for hanky-panky.] Direktor me je peljal na vikend z blagajnikom in s svojimi tremi dobermani, bil je prepričan, da me bo razuzdanost za vedno minila. (→ Other sentence relations, namely juxtaposition (absence of linking elements)) [The director took me to his country house with the treasurer and his three Dobermanns, he was convinced that I would get over for ever my hanky-panky.] (4) En murmurant des supplications confuses, il m'écrasa sous son corps. (Shan Sa, Empress, 2003) [By whispering confused pleas, he crushed me with his body.] Šepetajoč zmedene prošnje me je sploščil s svojo težo. (→ Detached construction) [(By) whispering confused pleas he crushed me with his body.] (5) En souriant, il arrachait sa tunique, dénouait son pantalon de soie et dénudait son corps vigoureux. (Shan Sa, Empress, 2003) [By smiling, he tore off his tunic, undid his silk trousers and stripped naked his vigorous body.] Z nasmehom je s sebe strgal tuniko, si odvezal svilene hlače in razgalil svoje krepko telo. (→ Adverbial phrase, namely adjunct of manner) [With a smile he tore off his tunic, undid his silk trousers and revealed his vigorous body.] UCCTS 2010, Edge Hill University, Ormskirk18

19 Conclusion Summing-up Future perspectives Corpus enlargement, improvement, enrichment Public access (→ SPOOK project, http://lojze.lugos.si/spook/korpus.htmlhttp://lojze.lugos.si/spook/korpus.html) Further use UCCTS 2010, Edge Hill University, Ormskirk19

20 THANK YOU UCCTS 2010, Edge Hill University, Ormskirk20


Download ppt "COMPILING A FRENCH- SLOVENIAN PARALLEL CORPUS Adriana Mezeg University of Ljubljana Department of Translation Studies."

Similar presentations


Ads by Google