Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Translational English Corpus: A practical approach to corpus building.

Similar presentations


Presentation on theme: "The Translational English Corpus: A practical approach to corpus building."— Presentation transcript:

1 The Translational English Corpus: A practical approach to corpus building

2 Outline TEC and new developments – EDT Corpus – Humanities Corpus Corpus design – Representativeness – Balance – Size Corpus building – Identifying material – Scanning/Converting texts – Tagging & Annotation

3 A corpus of contemporary English translations: written texts translated into English from a variety of source languages http://www.llc.manchester.ac.uk/ctis/research/english-corpus /

4

5 Number of books in each language for fiction and (auto)biography

6 Set of software tools for the investigation of a wide range of issues to do with the language of translated texts. Header File: contains meta‐data such as the title of the text, author, publisher, etc. Text File: contains the actual data to be analysed – Sub-corpus Selection: Allows you to select particular text files or groups of text files to search. – Sort Tool: Allows you to sort concordances to the left or right, and specify the number words between the search keywords. – Corpus Tree Viewer: Allows you to “grow” a tree for various keywords. The size of the text reflects frequency of occurrence in the corpus.

7 An electronic database of all material (to be) included in the TEC for the subcorpora of fiction and (auto)biography. The entry for each book includes not only most of the information that is included in the header file, but also images of the covers of the books.

8 A corpus of discourses on translation for the investigation of they way in which translation/translators are conceptualised in society at different historical periods. No time, language or genre restriction: any material is included as long as it is written in English. Two types of material – Peritextual : material that accompanies the translation, e.g. prefaces, introductions, afterwords, etc. – Epitextual: published material (broadsheet and mainstream newspapers, literary magazines, etc.) Link with TEC

9 A corpus of translations into English of works by theorists in the humanities, e.g. philosophers, sociologists, literary theorists, etc. Temporality: translations date from 1900 onwards, but the source texts texts do not have a time restriction. * Multiple translations of the same book.

10

11 What is a corpus? ‘A collection of texts held in machine-readable form and capable of being analysed automatically or semi-automatically’ (Baker 1995)… ….and has certain characteristics: – Representativeness – Balance – Size

12 “a corpus is thought to be representative of the language variety it is supposed to represent, if the findings based on its contents can be generalised to the said language variety” (Leech 1991). A corpus may focus on a particular genre/language/ author/translator, etc. Decisions about criteria for selection of texts

13 TEC Design Material: English translations (whole texts) Genres: Fiction, (auto)biography, in-flight magazines, news articles Time of publication: Late 80s onwards Place of publication: UK and USA

14 “a balanced corpus covers a wide range of texts which are supposed to be representative of the language variety under question” (McEnery et al. 2006). Also, ‘internal’ balance, e.g. – Gender balance – Source language balance – Genre balance

15

16 A corpus needs to be adequate for the purposes for which it is intended. A bigger corpus is not necessarily more useful than a smaller one. Factors that affect corpus size: – Purpose of the corpus – Availability of data – Copyright

17 Research questions (purpose of the corpus) – Specialised corpora and corpora intended for morphosyntactic studies tend to be smaller than general corpora and corpora intended for lexical studies. Static corpora are also smaller than dynamic ones. Availability of data – The availability of suitable data (especially in machine- readable form), as well as the ease with which they can be identified may affect the size of a corpus.

18 Copyright – Copyright clearance can impede corpus development as well as the accessibility and availability of a corpus to a wide audience. – Copyright law varies internationally. – Fair dealing: no permission needed for short extracts not exceeding 400 words for prose (or a total of 800 words in a series of extracts, none exceeding 300 words). – Out of copyright material: author’s / translator’s lifetime + 70 years (UK). – If you’re in doubt, seek permission! (McEnery et al. 2006)

19 We're delighted to learn of your interest project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the pieces you use. ….We don't feel comfortable posting the entirety of both titles to your database, but would be willing to make half of both books available to your research center…We typically charge a fee of $150 per title for use of such a large portion. …University Press is pleased to grant you non-exclusive, English language, world rights to reprint limits of fair use (under 300 words)… We're delighted to learn of your interesting project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the pieces you use.

20

21 Possible sources Publishers’ websites Search engines e.g. Farrar, Strauss and Giroux, NYTimesFarrar, Strauss and GirouxNYTimes Publishing houses specialising in translation Databases National databases e.g. Three Percent, LTI KoreaThree PercentLTI Korea Internet, archives, etc. Problems Search engine not well-designed e.g. The TelegraphThe Telegraph Need for specific material In some cases, not indicated whether it is a translation or not For reviews: not always related to translation

22 Scanning Flat-bed scanner – Document feeder Paper and print quality Scanner settings: Resolution and Colour vs Greyscale OCR (Optical Character Recognition) Process Language support Accuracy Font type Document format Text File Spelling errors Character recognition errors (e.g. Tm instead of I’m) Save as.txt file

23 Adds value to a corpus, makes it easier to extract information and prepares texts to be used with a corpus software Factors that affect the extent of tagging/annotation (Olohan 2004): Purpose of the corpus Corpus software Accessibility of the corpus Technical expertise of the researcher

24

25

26 POS (Part-of-Speech) Tagging –Marks up a word in a corpus as corresponding to a particular part of speech, based on both its definition, as well as its context. E.g. John_NP0 loves_VVZ Mary_NP0._. Lemmatisation –Reduces the inflectional variants of words to their respective lemmas, i.e. as they appear in a dictionary. E.g. is, are, am -> BE Parsing –Marks the syntactic structure of each sentence. E.g. (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary)))

27

28 Develop and use your own software Use existing corpus tools –TEC Tools For more information about how to use TEC Tools with local corpora, you can download the tutorial from the TEC webpage. –WordSmith Tools A collection of corpus linguistics tools –ParaConc A bilingual or multilingual concordancer –….

29 “When a corpus is created, a compromise has often to be reached between ideal design criteria and practical constraints. However, while opportunistic choices may be justified, the limitations and distortions they introduce in the makeup of a corpus should not be forgotten when evaluating the results”. (Zanettin 2011)

30 TEC website http://www.llc.manchester.ac.uk/ctis/research/english-corpus/ TEC Email Address tec@manchester.ac.uk

31 Baker, Mona (1995) ‘Corpora in Translation Studies: An overview and some suggestions for future research’, Target 7(2): 223-243. Leech, Geoffrey (1991) ‘The state of the Art in Corpus Linguistics’, in Karin Aijmer and Bengt Altenberg (eds) English Corpus Linguistics: Linguistic studies in honour of Jan Svartvik, London: Longman, pp. 8-29. McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-based Language Studies, London and New York: Routledge. Olohan, Maeve (2004) Introducing Corpora in Translation Studies, London and New York: Routledge. Zanettin, Federico (2011) ‘Translation and Corpus Design’, SYNAPS 26:14-23.


Download ppt "The Translational English Corpus: A practical approach to corpus building."

Similar presentations


Ads by Google