Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using corpora in contrastive and translation studies

Similar presentations

Presentation on theme: "Using corpora in contrastive and translation studies"— Presentation transcript:

1 Using corpora in contrastive and translation studies
Corpus Linguistics Richard Xiao

2 Aims of this session Lecture Lab session Closing
Corpora in contrastive and translation studies Use of comparable and parallel corpora Case study: Translation universals, do they really exist? Lab session CUC paraconc and Babel parallel corpus Closing Shedding of valedictory tears

3 Types of corpora: Some distinctions
Monolingual versus multilingual corpora Parallel versus comparable corpora Comparable versus comparative corpora

4 Monolingual versus multilingual corpora
Monolingual corpora A corpus that only involves one language Multilingual corpora A corpus that involves texts of more than one language A corpus covering two languages is conventionally known as ‘bilingual’ Multilingual corpora, in a narrow sense, must involve more than two languages ‘Multilingual’ and ‘bilingual’ are often used interchangeably Parallel and comparable corpora

5 Parallel versus comparable corpora
Terminological confusion centres around the terms For some scholars (e.g. Aijmer and Altenberg 1996; Granger 1996: 38) Corpora composed of source texts in one language and their translations in another language (or other languages) are ‘translation corpora’ while those comprising different components sampled from different native languages using comparable sampling techniques are called ‘parallel corpora’ For many others (e.g. Baker 1993: 248, 1995, 1999; Barlow 1995, 2000: 110; Hunston 2002: 15; McEnery and Wilson 1996: 57; McEnery, Xiao and Tono 2006) Corpora of the first type are labelled ‘parallel corpora’ while those of the latter type are ‘comparable corpora’

6 Parallel versus comparable corpora
In classifying corpora, the criteria used must be consistent and logical ways of doing things… - We can say a corpus is a translation or a non-translation corpus if the criterion of corpus content is used - But if we choose to define corpus types by the criterion of corpus form, we must use the criterion consistently - We can say a corpus is parallel if the corpus contains source texts and translations in parallel, or it is a comparable corpus if its components or subcorpora are comparable by applying the same sampling techniques and similar balance and coverage - It is simply inconsistent and illogical to refer to corpora of the first type as translation corpora by the criterion of content while referring to corpora of the latter type as parallel corpora by the criterion of form!

7 Multilingual vs. monolingual comparable corpora
A common practice in TS is to compare a corpus of translated texts (translational corpus) with a corpus consisting of comparably sampled non-translated texts in the same language The two sub-corpora form a monolingual comparable corpus for translation research, as opposed to a multilingual comparable corpus composed of comparable texts for different languages for cross-linguistic contrast

8 Comparative corpora Corpora containing different regional varieties of the same language are not comparable corpora E.g. the International Corpus of English (ICE), the Brown family of corpora All corpora, as a resource for linguistic research, have ‘always been pre-eminently suited for comparative studies’ (Aarts 1998: ix), either intralingually or interlingually Corpora of this kind are comparative corpora

9 Use of parallel & comparable corpora
Parallel and comparable corpora “offer specific uses and possibilities” for contrastive and translation studies (Aijmer & Altenberg 1996: 12) - they give new insights into the languages compared – insights that are not likely to be gained from the study of monolingual corpora; - they can be used for a range of comparative purposes and increase our knowledge of language-specific, typological and cultural differences, as well as of universal features; - they illuminate differences between source texts and translations, and between native and non-native texts; - they can be used for a number of practical applications, e.g. in lexicography, language teaching and translation.

10 Use of parallel & comparable corpora
Used primarily for translation and contrastive studies The two types of corpora have their own characteristics, and serve different purposes Parallel corpora are useful in translation studies, but they alone serve as a poor basis for cross-linguistic contrast, because translations cannot avoid the effect of translationese Comparable corpora are well suited for contrastive research, but are less useful in translation studies

11 Using corpora in translation studies
Parallel corpora Useful in exploring how an idea in one language is conveyed in another language, thus providing indirect evidence to the study of translation processes Indispensable for building statistical or example-based machine translation (EBMT) systems, and for the development of bilingual lexicons and translation memories Parallel concordancing is a useful tool for translators Comparable corpora Useful in improving the translator’s understanding of the subject field and improving the quality of translation in terms of fluency, correct term choice and idiomatic expressions in the chosen subject field Can also be used to build terminology banks

12 Using corpora in translation studies
Translational corpora Provide primary evidence in product-oriented Translation Studies, and in studies of translation universals If corpora of this kind are encoded with sociolinguistic and cultural parameters, they can also be used to study the sociocultural environment of translations (e.g. functions of translation in DTS) Monolingual corpora (source / target language ) Raising the translator’s linguistic and cultural awareness in general Providing a useful and effective reference tool for translators In combination with a parallel corpus to form a so-called ‘translation evaluation corpus’ that helps translator trainers or critics to evaluate translations more effectively and objectively

13 Corpus-based translation studies
Laviosa (1998a) “the corpus-based approach is evolving, through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation.” Hypotheses that translation universals can be tested by corpus data (Baker 1993, 1995) Rapid development of corpus linguistics, esp. multilingual corpus research in the early 1990s Increasing interest in Descriptive Translation Studies (Toury 1995) Tymoczko (1998) “Corpus Translation Studies is central to the way that Translation Studies as a discipline will remain vital and move forward.” Meta 43/4 (1998); Kenny (2001); Laviosa (2002); Granger et al (eds.) (2003); Olohan (2004); Mauranen et al (eds.) (2004); Kruger & Munday (ed.) (2011); Hu (2011), Wang (2011), Xiao (2012)

14 The Holmes-Toury map Applied Translation Studies
Descriptive Translation Studies Theoretical Translation Studies

15 Applied Translation Studies
Three major contributions of corpora Corpus-assisted translating Bowker (1998: 631): ‘corpus-assisted translations are of a higher quality with respect to subject field understanding, correct term choice and idiomatic expressions.’ Corpus-aided translation teaching and training Bernardini (1997): ‘large corpora concordancing’ (LCC) can help students to develop ‘awareness’, ‘reflectiveness’ and ‘resourcefulness’, which are said to be the skills that distinguish a translator from those unskilled amateurs Development of translation tools Corpora, and especially aligned parallel corpora, are essential for the development of translation technology such as machine translation (MT) systems, and computer-aided translation (CAT) tools

16 Descriptive Translation Studies
Characterized by its emphasis on the study of translation per se It is to answer the question of ‘why a translator translates in this way’ instead of ‘how to translate’ Baker (1993) predicted that the availability of large corpora of both source and translated texts, together with the development of the corpus-based approach, would enable translation scholars to uncover the nature of translation as a mediated communicative event

17 Descriptive Translation Studies
Three focuses (Holmes 1972/1988) Translation as a product Concerned with describing translation as a product by comparing corpora of translated and non-translational native texts in the target language Attempting to uncover evidence to support or reject the so-called translation universal hypotheses Translation as a process Aims at revealing the thought processes that take place in the mind of the translator while she or he is translating One possible way for corpus-based DTS is to investigate the written transcripts of these recordings off-line, which is known as Think-Aloud Protocols (or TAPs) Translation as product providing indirect evidence to translation as process The function of translation The study of contexts rather than texts: function or impact of a translation Relatively few function-oriented studies that are corpus-based

18 Theoretical Translation Studies
Aims ‘to establish general principles by means of which these phenomena can be explained and predicted’ (Holmes 1988: 71) Closely related to, and often reliant on the empirical findings produced by Descriptive Translation Studies One good battleground of using DTS findings to pursue general theory of translation is the hypothesis of so-called translation universals (TUs) and its related sub-hypotheses Sometimes referred to as the inherent features of translational language, or ‘translationese’

19 TU: A focus of CBTS An important area of corpus-based TS over the past decade Baker (1993, 1996); Chesterman (2004); Kenny (1998, 1999, 2000, 2001); Laviosa (1998b); Mauranen & Kujamaki 2004); McEnery & Xiao (2002, 2007); Olohan (2004); Olohan & Baker’s (2000); Øverås (1998); Pym (2005); Xiao and Yue (2008), Xiao (2010), Xiao & Dai (2010), Xiao (2010, 2011, 2012) The Translational English Corpus (TEC) Manual Software

20 Features of translated English
Laviosa (1998b): Four core patterns of lexical use in translational English - A relatively low proportion of lexical words over function words - A relatively high proportion of high-frequency words over low-frequency words - A relatively great repetition of the most frequent words - Less variety in most frequently used words

21 Features of translated English
Beyond the lexical level Simplification: “tendency to simplify the language used in translation” (Baker 1996: ) simpler language than target native language lexically / syntactically / stylistically Normalization: “tendency to exaggerate features of the target language and to conform to its typical patterns” (Baker 1996: 183) more “normal” than the target native language Explicitation: tendency in translations to “spell things out rather than leave them implicit” (Baker 1996: 180) more frequent use of conjunctions, and increased cohesion in translated text Sanitization: translated texts are “somewhat ‘sanitized’ versions of the original” (Kenny 1998: 515) Lost or reduced connotational meaning in translation “TU hypotheses”

22 TU: A target of debate Is translational language different from target native language? Translational language is at best an unrepresentative special variant of the target language because translations cannot possibly avoid the effect of translationese e.g. Baker 1993; Gellerstam 1996; Hartmann 1985; Laviosa 1997; McEnery & Wilson 2001; McEnery & Xiao (2002, 2007); Teubert 1996

23 TU: A target of debate Are the features uncovered on the basis of translational English generalizable to other translated languages? Existing evidence has largely come from translational English and related European languages If such features are to be generalized as “translation universals”, the language pairs involved must not be restricted to English and closely related languages Cheong’s (2006) study of English-Korean translation contradicts even the least controversial explicitation hypothesis Evidence from “genetically” distinct language pairs such as English and Chinese is undoubtedly more convincing, if not indispensable

24 The ZCTC corpus Created with the explicit aim of studying the features of translated Chinese A translational counterpart of the Lancaster Corpus of Mandarin Chinese (LCMC), a one-million-word balanced corpus of native Chinese (McEnery & Xiao 2004) Five hundred 2,000-word text samples taken proportionally from fifteen written text categories published in China in the 1990s

25 LCMC / ZCTC corpus design

26 ZCTC vs. LCMC

27 Corpus markup and annotation
CES-compliant XML CES: Tokenization and POS tagging ICTCLAS2008: A precision rate of 98.54% for tokenization Paragraph, sentence, word token Encoded in Unicode (UTF-8)

28 Core patterns of lexical use
Do the core patterns of lexical use Laviosa (1998b) observes in translational English also apply in translated Chinese? Same criteria and parameters as in Laviosa (1998b) Lexical density Frequency profiles Mean sentence length

29 Lexical density The Stubbs-style lexical density: the ratio between the number of lexical words (i.e. content words) and the total number of words (Stubbs 1986: 33; 1996: 172) Measure of informational load Adopted in Laviosa (1998b) Lexical density measured by TTR or Standardized TTR (STTR) (Scott 2004) Measure of lexical variability Commonly used in Corpus Linguistics

30 Stubbs-style lexical density
Mean LD is significantly greater in native than translational corpus (66.93% vs %, t = -4.94, p<0.001) In addition, the native Chinese corpus displays a greater LD score in all of the 15 genres – and significant for nearly all genres (except for M) Translations make more frequent use of function words

31 Standardized TTR Mean STTR is slightly greater in native than translation corpus (46.58 vs ): not significant (t = , p=0.571) The differences in most genres are also marginal Greater STTR scores can be found in both native (e.g. A) and translated (C) Chinese genres

32 Lexical-function ratio ≈ Stubbs LD
Mean ratio between lexical and function words is significantly greater in native than translational corpus (2.08 vs1.64, t = -4.88, p<0.001) Also, native Chinese has a greater ratio in all genres, and the differences are significant in nearly all genres (except for M) In line with Laviosa’s (1998b) initial hypothesis that translational language has a relatively low proportion of lexical words over function words

33 Frequency profiles of LCMC/ZCTC
Laviosa’s (1998b) ‘list head’ or ‘high frequency words’ Wordlist items which individually account for at least 0.10% of the total number of tokens in a corpus The same criterion for high frequency words in this study to ensure comparability

34 Frequency profiles The numbers of high frequency words are very similar in the two corpora High frequency words account for a considerably greater proportion of tokens in the translational corpus (40.47% vs %) High frequency words display a much greater repetition rate in translated Chinese ( vs ) Also the ratio between high- and low-frequency words is greater in the translational corpus ( vs )

35 Mean sentence length vs. simplification
Conflicting observations of mean sentence length as an indicator of simplification (e.g. Laviosa 1998b vs. Malmkjaer 1997) The native Chinese corpus (LCMC) shows a marginally greater mean sentence length: not significant (t = , p = 0.17) Mean sentence length is sensitive to genre variation and may not be reliable as an indicator of simplification in translational Chinese (Mean sentence segment length)

36 Lexical use in translational Chinese
Summary - Analysis of lexical density and frequency profiles shows that the four core patterns of lexical use in translational English are essentially also applicable in translated Chinese - But mean sentence length is less reliable as an indicator of simplification in translational Chinese

37 Explicitation: Connectives as a device?
Perhaps the most studied topic in TU research and also the least controversial hypothesis Chen (2006) Connectives are a device for explicitation in English-Chinese translation of popular science books Xiao and Yue (2008) Connectives are significantly more frequent in translational than native Chinese fiction Question Can we generalize this finding from these specific genres to Mandarin Chinese in general?

38 Conjunctions in ZCTC and LCMC
Mean frequency of conjunctions is significantly greater in translational than native corpus and instances per 10,000 tokens, LL= for 1 d.f., p<0.001 In addition, genre-based distribution shows that most genres covered in the corpora display a significantly more frequent use of conjunctions in translational Chinese in spite of some genre-based subtleties (e.g. F, J)

39 Conjunctions of different frequency bands
More conjunction types of high frequency bands (0.01% or above) are used in translational corpus There are an equal number of conjunction types (56 types) of medium frequency band (0.005%) in translational and native corpora Beyond this balance point, the native corpus displays a greater number of conjunction types of low frequency band (0.001% or below) In line with observations about high vs. low frequency words

40 Conjunctions of different styles
A closer comparison of the lists of frequent conjunctions (0.001% or above) in their respective corpus also sheds some new light on the simplification hypothesis There are 91 and 99 types of frequent conjunctions in the two corpora – 86 items overlap in the two lists Conjunctions on the translational but not native list are all informal, colloquial, and simple , which usually have more formal alternatives (e.g. 虽然 for 虽说,总之 for 总的来说) Conjunctions on the native but not translation list are typically formal, literate and archaic (e.g. 故、可见、进而、加之、固然、继而、非但、然、而后) These results provide evidence for the simplification hypothesis but against the normalization hypothesis

41 Conclusions Results based on two comparable Chinese corpora suggest that the core patterns of lexical use in translational English are generally also applicable in translated Chinese Beyond the lexical level Mean sentence length is sensitive to genre variation and may not be reliable as an indicator of simplification A comparison of frequent conjunctions in native and translated Chinese shows that simpler forms tend to be used in translations In spite of some genre-based subtleties, conjunctions are more frequently used in translational Chinese, which provides evidence in favour of the explicitation hypothesis Corpus Translation Studies is a promising area of research 《语料库翻译文库:英汉翻译中的汉语译文语料库研究》,上海交通大学出版社,2012

42 CUC ParaConc Software demo…

43 Shedding of valedictory tears

Download ppt "Using corpora in contrastive and translation studies"

Similar presentations

Ads by Google