Presentation on theme: "McEnery, T., Xiao, R. and Y.Tono. 2006. Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)"— Presentation transcript:
McEnery, T., Xiao, R. and Y.Tono. 2006. Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
A2.1 Introduction Representativeness is an essential feature of a corpus; it distinguishes a corpus from an archive (a random collection of texts) Sampling is unavoidable – it is impossible to analyse every sentence in a language Representativeness – is ensured by balance and sampling
A2.2 Representativeness in CL Leech 1991– representativeness refers to what is general to the language variety; Biber (1993:243) – (how this feature is achieved): representativeness refers to the extent to which a sample includes the full range of variability in a population (= language variety)
A2.2 Representativeness in CL The representativeness of most corpora is to a great extent determined by two factors: the range of genres included in a corpus, i.e. balance (Unit A2.4) and how the text chunks for each genre are selected, i.e. sampling (Unit A2.5)
A2.2 Representativeness in CL A corpus is typically designed to study distributions, i.e. the full range of environments in which a lexical or grammatical form can occur.
A2.2 Representativeness in CL The criteria used to select texts for a corpus are in principle external. The external vs. internal criteria correspond to Biber’s (1993:243) situational (= irrespective of the distribution of linguistic features) vs. linguistic (= taking into account the distribution of linguistic features) perspectives.
A2.2 Representativeness in CL Biber: situationally defined text categories are genres or registers (more on that Unit A10.4) Biber: (internal) linguistically defined – text types (Unit B1) NB!!! These terms are usually used interchangeably in the literature
A2.2 Representativeness in CL In addition to text selection criteria, Hunston (2002) suggests another aspect of representativeness - change over time, Thus a corpus has to be regularly updated.
A2.3 The representativeness of general and specialized corpora There are two broad types of corpora in terms of the range of text categories represented in the corpus: General and Specialized corpora
A2.3 The representativeness of general corpora General corpora are compiled to answer questions about the vocabulary, grammar or discourse structure of the language, i.e. it provides an overall description of a language (e.g. the BNC that represents modern British English as a whole)
A2.3 The representativeness of general corpora Representativeness of a general corpus is measured by a range of genres included. It is designed to be balanced by containing texts from different genres and domains of use including spoken and written, private and public.
A2.3 The representativeness of specialized corpora Specialized corpora tend to be domain (e.g. medicine or law) or genre (e.g. newspaper text or academic prose) specific. Specialized corpora are designed with particular research projects in mind, e.g. training corpora, dialect, regional, non- standard and learners’ corpora.
A2.3 The representativeness of specialized corpora Representativeness of a specialized corpus, at the lexical level at least, is measured by the degree of ‘closure’ or ‘saturation’ of the corpus.
A2.3 The representativeness of specialized corpora Closure/ saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of language (e.g. computer manuals) means that a feature appears to be finite or is subject to very limited variation).
A2.3 The representativeness of specialized corpora To measure the saturation of a corpus, the corpus is first divided into segments of equal size based on its tokens (recall what type/token is).
A2.3 The representativeness of specialized corpora The corpus is said to be saturated at the lexical level if each addition of a new segment yields approximately the same number of new lexical items as the previous segment.
A2.4 Balance Representativeness of a corpus, especially general, depends on the range of text categories included in the corpus. The acceptable balance is determined by its intended uses. The British National Corpus, e.g., was designed to be representative of British English as a whole and not just as one particular genre, subject field or register.
The BNC - a scanned page from McEnery et al. P17-18
A2.5 SAMPLING Corpus representativeness and balance are closely associated with sampling. Because we cannot exhaustively describe natural language, we need to sample it in order to achieve a balance and representativeness which match our research questions.
A2.5 SAMPLING A sample is assumed to be representative if what we find for the sample also holds for the general population (= the entire set of items from which samples can be drawn)
A2.5 SAMPLING TECHNIQUES A basic sampling method is simple random sampling. All sample units are numbered and random numbers are chosen. Simple random sampling may generate a sample that does not include relatively rare items, even though they can be of interest to researchers.
A2.5 SAMPLE SIZE Sample size. Problematic questions: –With written language, should we sample full texts or text chunks? –If only text chunks are to be sampled, then which – initial, middle or end chunks? –Full text samples are certainly useful, yet they present copyright issues.
A2.5 SAMPLING On the whole, it is advisable to sample text segments. According to Biber (1993:252), Frequent linguistic features are quite stable in their distribution and hence short chunks(e.g. 2,000 running words) are usually sufficient; However, rare features are more varied in their distribution and thus require larger samples.
A2.5 SAMPLING Another sampling issue – the proportion and number of samples for each text category. Such proportions can be difficult to determined objectively. So, the representativeness of a corpus should be viewed as a statement of belief rather than fact.
TERMS AND CONCEPTS Distribution - the full range of environments in which a lexical or grammatical form can occur Population – the entire set of items from which samples can be drawn Genre – a type of discourse that occurs in a particular setting that has distinctive patterns of organization and has a particular communicative function
TERMS AND CONCEPTS Register - refers to specific lexical and grammatical choices as made by speakers depending on the situational context, the participants of a conversation and the function of the language in the discourse (cf. Halliday 1989:44). Text type – a very vague category; it can be used in a very vague way to mean almost anything (descriptive, narrative, expository, argumentative)