Presentation on theme: "Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo."— Presentation transcript:
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo
What is COMPARA? A collection of texts Originally written in Portuguese and English Aligned with their respective English and Portuguese translations Held in machine-readable/searchable form.
Main Features Free WWW access Made for people who are not necessarily corpus-literate as well as for experienced corpus users Open-ended
Text Selection All varieties of Portuguese and English Date of publication not restricted More than one TT per ST possible Phase I: published fiction Phase II: other genres
Copyright permissions 60 text-pairs Native authors and translators from Angola, Brazil, Mozambique, Portugal, South Africa, UK and USA. 33 authors & 31 translators Some interesting text combinations
Text pairs: 5 Tokens: 113,190 Types: 19,828 Portuguese English Source texts: 4 1 Translations: 1 4 Tokens: 58,568 64,608 Types: 10,965 64,608 Texts available in November 2000
Encoding aims and options To provide accurate examples of how sentences have been translated from Portuguese into English and from English into Portuguese To provide “co-textualised” examples of how words and phrases have been translated. No attempt to preserve the texts in a format that allows exact future replication Generally TEI-inspired, but not TEI-bound
From print to Web 1. Scanning & OCR 2. OCR revision; minus non-translational material; plus,,, 3. Manual paragraph alignment 4. Automatic sentence separation and tokenization (AC/DC project corpus tools) and automatic sentence alignment (IMS-CWB Easy-Align). 5. Manual alignment revision and markup. 6. Automatic alignment markup and final IMS-CWB encoding.
Alignment 1 source text sentence = 1 alignment unit ST:TT s preserved 1 : 1 s split 1 : 2 s joined 1 : s deleted 1 : 0 s added 1[+0] : 1[+1] s reordered A,B,C : A, C, B
Target Users Portuguese learners of English English learners of Portuguese Students and teachers of translation Professional translators Bilingual dictionary makers Developers of machine translation software Anyone else interested in translation language and in the similarities and differences between Portuguese and English