Presentation on theme: "What is a corpus?* A corpus is defined in terms of form purpose The word corpus is used to describe a collection of examples of language collected."— Presentation transcript:
What is a corpus?* A corpus is defined in terms of form purpose The word corpus is used to describe a collection of examples of language collected for linguistic study. It can also describe collections of texts stored and accessed electronically. (Hunston:2002). Corpus planning and design is functional to some linguistic purpose. It is on this basis that texts are selected and stored, so that they can be studied quantitatively and qualitatively. *Ref. Text: Hunston S. Corpora in Applied Linguistics 2002
What are corpora used for? Corpora are often used for language teaching and learning. They give information about how a language works. They also help calculate the relative frequency of different features. Exploring corpora can help students to observe nuances of usage and to make comparisons between languages. Corpora are also used to investigate cultural attitudes expressed through language. NB a corpus will not give information about whether something is possible or not, only whether it is frequent or not!
Using corpora in translation Corpora are also used in translation. Comparable corpora allow to compare the use of apparent equivalents Parallel corpora allow to see how words and phrases have been translated in the past. General corpora can be used to establish norm of frequency and usage.
What can a corpus do? Corpus access software is used to rearrange the information which has been stored so that observations of various kinds can be made. It is not the corpus which gives new information about language. It is the software which gives new perspectives on what is already familiar. Software packages process data showing: frequency, phraseology collocation.
Frequency Corpus processing allows comparisons of words in terms of frequency lists. Quite obviously, grammar words are more frequent than lexical words. That explains why they are found top of the list. Frequency lists can be useful for identifying differences between the corpora. But comparisons can be made only if the corpora are comparable, i.e. if their length is approximately the same.
Concordance The most frequent way to access a corpus is through a concordancing program. Concordance lines bring together instances of use of words or phrases, so that regularities in use can be observed. Concordances also help to understand how nouns or adjectives are used
Collocation Collocation is the tendency of words to co- occur. The collocates of a given word are those words which often occur in conjunction Collocation can indicate pairs of lexical items, or the association between a lexical word and its frequent grammatical environment. In the latter case, the term used is colligation.
Types of corpora A corpus is designed for a particular purpose. Consequently, the type of corpus depends on its purpose: Specialized corpus General corpus Comparable corpora Parallel corpora Learner corpus Historical or diachronic corpus Monitor corpus
Specialized corpus: a corpus of texts of a particular type (editorials, academic articles, lectures, essays, etc.). Specialized corpora reflect the type of language a researcher wants to explore. You may also restrict the corpus to a time frame, to a social setting, to a given topic. General corpus: is a corpus of texts of many types, of written or spoken language, or of both. A general corpus is usually much larger than a specialized corpus. Since it can be used to produce reference materials it is sometimes called a reference corpus.
Comparable corpora: two or more corpora in different languages, or in different varieties of a language. They are designed to contain the same proportion of texts (i.e. newspaper texts, essays, novels, conversations, etc.). They can be used by translators and learners to identify differences and equivalences in each language. Parallel corpora: two or more corpora in different languages, containing translated texts, or texts produced simultaneously in two or more languages (e.g. EU texts). They can be used by translators and learners to find potential equivalents in each language, and to investigate differences between languages.
Learner corpus: a collection of texts produced by learners of a language. It is used to identify differences among learners, frequency and type of mistakes, etc. Historical or diachronic corpus: a corpus of texts from different periods of time. It helps to trace the development of a language over time. Monitor corpus: a corpus used to track current changes in a language. It rapidly increases in size, since it is added annually, monthly, daily, etc. The proportion of text types has to remain constant, so that each year is comparable with every other.
The use of corpora is not limited to identifying, quantifying and analyzing keywords. The concordance lines offer many instances of use of words or phrases, so that the user can observe regularities in use by means of several examples of the same word or phrase in its natural context. Calculating collocation means finding the statystical tendency of words to co-occur, and collocations also emphasize some metaphorical use. A good example is the collocations of the word shed, with light, tears, blood, pounds, confidence, hair, skin, labour. In this contexts shed is a verb. As such, its Italian equivalent may vary, so collocates are different.
Shed lightfare/gettare luce Shed tearsspargere lacrime Shed bloodspargere/versare sangue Shed poundsperdere chili/peso Shed skinperdere/mutare la pelle (fare la muda) Shed confidence ispirare fiducia Shed hairperdere il pelo shed labourdisfarsi della manodopera (licenziare)
Tokens: the term is used to indicate the words which are counted in a corpus or in a given text. But many of these words occur more than once. So, if we count each repeated item once only, the total number changes. In a given text, for instance, we have 250 tokens, but 194 types (articles, repeated nouns etc. are counted once only). Hapax legomena or hapaxes are those words which occur only once. We may also have words which occur in two (or more) different forms: friend and friends, for instance. These are two word-forms which belong to the same lemma. The same is for go, goes, going, went, gone: five word-forms which belong to the same lemma, go. This implies that when using the lemma as a keyword, all its different word-forms have to be looked for.
Usually word-forms are considered to belong to the same lemma when they belong to the same word-class (verb, noun, adjective, etc.) Tagging usually refers to the addition of a code to each word in a corpus, to indicate the part of speech. Automatic tagging is possible, but not fully accurate. Tagging is useful when you want to look at different word categories. For instance, the noun work can be considered separately from the verb.
Corpus parsing is the analysis of a text constituents, for instance clauses, and groups. This allows you to analyse the different structures in a corpus. Just like tagging, parsing can be done automatically, though the output is not very accurate. Manual editing is often necessary.