Corpus design and types of corpora

Name: Corpus design and types of corpora
Uploaded: 2017-08-17T09:55:41+00:00
Duration: PTM30S17
Channel: Zachary Jensen
Description: Corpus design and types of corpora

Corpus design and types of corpora
Corpus Linguistics Richard Xiao

Outline of the session Corpus design issues
Corpus representativeness Corpus balance Sampling Corpus size Types of corpora Introducing some well-known English corpora of different types

Representativeness A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety A corpus is different from a random collection of texts or an archive Representativeness is a defining feature of a corpus As language is infinite but a corpus has to be finite in size, we sample and proportionally include a wide range of text types to ensure maximum balance and representativeness

Some definitions … “generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type” (Leech 1992: 116) “…selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996) “A well-organized collection of data” (McEnery 2003) “gathered according to explicit design criteria” (Tognini-Bonelili 2001: 2) “built according to explicit design criteria for a specific purpose” (Atkins et al 1992) texts selected and put together “in a principled way” (Johansson 1998: 3)

What is representativeness?
“A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety” (Leech 1991) Representativeness refers to the extent to which a sample includes the full range of variability in a population (Biber 1993)

What is representativeness?
Representativeness is a fluid concept closely related to your research questions If you want a corpus which is representative of general English, a corpus representative of newspapers will not do If you want a corpus representative of newspapers, a corpus representative of The Times will not do

Two types of representativeness
The representativeness of general corpora and (domain- or genre specific) specialized corpora are achieved and measured in different ways General corpora Balance: The range of genres included in a corpus and their proportion Sampling: How the text chunks for each genre are selected Specialized corpora Degree of closure/saturation: Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of language (e.g. computer manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point, i.e. the curve of lexical growth is flattening out

Why should we care about representativeness?
Reader of corpus-based studies (assessment) To interpret the results of corpus research with caution, considering whether the corpus data and the method used in the study was appropriate Corpus user (assessment) Important to “know your corpus” To decide whether a given corpus is appropriate for their specific research question To make appropriate claims on the basis of such a corpus Corpus creator (assessment?) To make their corpus as representative as possible of a language (variety) claimed to represent To document design criteria explicitly and make the documentation available to corpus users

Criteria for text selection
The criteria used to select texts for a corpus are principally external The external vs. internal criteria corresponds to Biber’s (1993: 243) situational vs. linguistic perspectives External criteria are defined situationally irrespective of the distribution of linguistic features Internal criteria are defined linguistically, taking into account the distribution of such features It is circular to use internal criteria like the distribution of words or grammatical features as the primary parameters for the selection of corpus data If the distribution of linguistic features is pre-determined when the corpus is designed, there is no point in analyzing such a corpus to discover naturally occurring linguistic feature distributions The corpus is problematic as it is skewed by design

Time? If a corpus is not regularly updated, it rapidly becomes unrepresentative (Hunston 2002) The relevance of permanence in corpus design actually depends on how we view a corpus - a static or dynamic language model Static model: sample corpora (nearly all existing corpora, BNC, LOB/FLOB) Dynamic model: monitor corpora (e.g. Bank of English)

Tips “Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.” (Sinclair 2005)

Corpus balance A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration The proportions of different kinds of text it contains should correspond with informed and intuitive judgements There is no scientific measure for balance – just best estimation The acceptable balance is determined by the intended use – your research questions

The BNC model Generally accepted as being a balanced corpus
Has been followed in the construction of a number of corpora 4,124 texts (including transcripts of recording) ca. 100 million words: 90% Written + 10% Spoken Three criteria for Written Domain: the content type (i.e. subject field) Time: the period of text production Medium: the type of text publication (book, periodicals etc) Two criteria for Spoken Demographic: informal conversations by speakers selected by age group, sex, social class and geographical region Context-governed: formal encounters such as meetings, lectures and radio broadcasts recorded in 4 broad context categories

Written BNC

Spoken BNC

BNC vs. balance The design criteria of the BNC illustrates the notion of corpus balance/representativeness very well “In selecting texts for inclusion in the corpus, account was taken of both production, by sampling a wide variety of distinct types of material, and reception, by selecting instances of those types which have a wide distribution. Thus, having chosen to sample such things as popular novels, or technical writing, best-seller lists and library circulation statistics were consulted to select particular examples of them.” (Aston and Burnard 1998: 28)

Pragmatics in corpus design
“Most general corpora of today are badly balanced because they do not have nearly enough spoken language in them; estimates of the optimal proportion of spoken language range from 50% - the neutral option - to 90%, following a guess that most people experience many times as much speech as writing” (Sinclair 2005) The written BNC is nine times as large as the spoken BNC Is speech less frequent or important than writing?

Pragmatics in corpus design
Absolutely not! …but writing typically has a larger audience than speech …also collection of spoken data costs 10 times as much as for written data …it takes 10 hours to transcribe one hour of recording Pragmatic considerations also mean that balance is a more important issue for a static sample corpus than for a dynamic monitor corpus As a monitor corpus is frequently updated, it is usually “impossible to maintain a corpus that also includes text of many different types, as some of them are just too expensive or time consuming to collect on a regular basis.” (Hunston 2002: 30-31)

Corpus balance: Some tips
“The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components.” (Sinclair 2005) “It would be short-sighted indeed to wait until one can scientifically balance a corpus before starting to use one, and hasty to dismiss the results of corpus analysis as ‘unreliable’ or ‘irrelevant’ because the corpus used cannot be proved to be ‘balanced’.” (Atkins et al 1992: 6)

Sampling in corpus creation
Language is infinite, but a corpus is finite in size, so sampling is inescapable in corpus building “Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not.” (Biber 1993) Population ( language/variety) vs. sample (corpus) The aim of sampling “is to secure a sample which, subject to limitations of size, will reproduce the characteristics of the population, especially those of immediate interest, as closely as possible” (Yates 1965: 9) A sample is a scaled-down version of a larger population A sample is representative if what we find for the sample also holds for the general population Corpus representativeness and balance rely heavily on sampling A corpus is a sample of a given population (language or language variety)

Sampling in corpus creation
Sampling unit For written text, it could be a book (chapter), periodical or newspaper (article) Sampling frame A list of sampling units Population Languages, language, or language variety under consideration The assembly of all sampling units, which can be defined in terms of Language production (demographic: speakers and writers) Language reception (demographic: audience and readers) Language as a product (registers and genres)

Examples of Brown and LOB
Population: Written English text published in the United States in 1961 Sampling frame: A list of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum Sampling unit: each book/periodical within the sampling frame LOB Population: Written English text published in the UK around 1961 Sampling frame: The British National Bibliography Cumulated Subject Index 1960–1964 (for books) and Willing’s Press Guide 1961 (for periodicals) Sampling unit: each book/periodical within the sampling frame

Sampling techniques Simple random sampling Stratified random sampling
All sampling units within the sampling frame are numbered and the sample is chosen by use of a table of random numbers Positively correlating with frequency in the population, so rare features may not be included Stratified random sampling The population is divided in relatively homogeneous groups (i.e. strata), and then these latter are sampled at random Never less representative than simple random sampling

Stratified random sampling
The whole population for the Brown/LOB corpus is divided into 15 text categories and then samples were drawn from each category at random In demographic sampling for collecting spoken data, individuals (sampling units) in the population are first divided into different groups on the basis of demographic variables such as speaker/writer age, sex and social class, and then samples are taken at random from each group

Size of samples Full texts or text segments?
“Samples of language for a corpus should wherever possible consist of entire documents or transcriptions of complete speech events” (Sinclair 2005) Good for studying textual organization A full-text corpus may be inappropriate or problematic Peculiarity of an individual style or topic may occasionally show through There are copyright issues in including full texts Frequent linguistic features are quite stable in their distributions and hence short text chunks (e.g. 2,000 running words) are usually sufficient Text initial, middle or end chunks? Text initial, middle, and end samples must be taken in a balanced way

Proportion of samples In stratified random sampling, how many samples should be taken for each category? The numbers of samples across text categories should be proportional to their frequencies and/or weights in the target population in order for the resulting corpus to be considered as representative Difficult to determine objectively, just well-informed and intuitive guess

Proportion of genres in Brown
Constant sample size: ca. 2,000 words

“Relatively speaking…”
Any claim of corpus representativeness and balance must be interpreted in relative terms There is no objective way to balance a corpus or to measure its representativeness Any claim for representativeness is an act of faith rather than a statement of fact Corpus balance and representativeness are a fluid concept The research question that one has in mind when building/choosing a corpus determines what an acceptable balance is for the corpus one should use and whether it is suitably representative Corpus balance is also influenced by practical considerations How easily can data of different types be collected?

Corpus size How large should a corpus be?
There is no easy answer to this question. Krishnamurthy (2001): “Size matters.” Leech (1991): “Size is not all-important.” The size of the corpus needed depends upon the purpose for which it is intended as well as a number of practical considerations The kind of query that is anticipated from users Are you studying common or rare linguistic features? The methodology they use to study the data How much work can be done by the machine and how much has to be done by hand? For corpus creators, also the source of data Are the data in electronic form readily available at a reasonable cost? Can copyright permissions be granted easily if at all?

Corpus size Corpus size increases with the development of technology
Brown and LOB: one million words 1980s The Birmingham/Cobuild corpora: 20 M words 1990s The British National Corpus: 100 M words Early 21st Century The Bank of English: 645 M words

Corpus size Is a large corpus really what you want?
The size of the corpus needed to explore a research question depends on the frequency and distribution of the linguistic features under consideration in that corpus – your research question Corpora for lexical studies are usually much larger than those for grammatical studies Specialized corpora serve a very different yet important purpose from large multi-million-word corpora Corpora that need extensive manual annotation or analysis are necessarily small Many corpus tools set a ceiling on the number of concordances that can be extracted The optimum size of a corpus is determined by the research question the corpus is intended to address as well as practical considerations

Exploring existing English corpora
To learn how corpora can be classified To learn about design decisions in creating different kinds of corpora To become familiar with a range of well-known and influential corpora Corpus survey: “Well-known and influential corpora”

Types of corpora, different uses
General/reference vs. specialized corpora Written vs. spoken corpora Synchronic vs. diachronic corpora Monolingual vs. multilingual corpora Comparable vs. parallel corpora Native vs. learner corpora Developmental vs. learner/interlanguage corpora Raw vs. annotated corpora Static/sample vs. dynamic/monitor corpora …

Monitor corpora Constantly updated and growing in size Disadvantages
Much larger corpus size Often contain full text Always up-to-date Often only admit new material which has new features not already present in corpus Used to track changes across different periods of time Monitor corpora could be a series of static corpora Disadvantages No attempt to balance the corpus Text availability can become an issue (e.g. copyrights) Confusing to indicate specific corpus version (token number) Cannot easily compare results obtained from corpora of different sizes

Some well-known English corpora
The British National Corpus (BNC) The Bank of English (BoE) BYU American English corpus Corpora of the Brown family (Brown, LOB, FLOB, Frown) ICE corpora (GB, EA, HK, Singapore, Philippines, New Zealand etc) London-Lund corpus of spoken English SBCSAE The Helsinki Diachronic Corpus of English Texts (8th - 18th Century, ca. 5 million words) The International Corpus of Learner English (ICLE) MICASE

The BNC First and best-known national corpus (sample corpus)
100 M word balanced corpus of written (90%) and spoken (10%) British English in current use earlier 1990s ( , , ) Rich metadata encoded for language variation studies POS tagged Accessing the BNC BYU-BNC: BNC Online: Lancaster BNCWeb CQP edition BNC Baby: Sketch Engine: BNC PIE:

The BoE Best known monitor corpus
645 M words (counting and growing) of present-day English language 75% written and 25% spoken 70% BrE, 20% AmE and 10% other English varieties Particularly useful for lexical and lexicographic studies, e.g. tracking new words, new uses or meanings of old words, and words falling out of use Access to the BoE A 56 M word sampler:

Corpus of Contemporary American English (COCA)
385+ M words of American English 20M per year for Equally divided among spoken, fiction, popular magazines, newspapers, and academic texts Updated every 6-9 months Useful for studying variation across genres and over time Free online access

Corpora of the Brown family
Brown: Written AmE in 1961 LOB: Written BrE in 1961 FLOB: Written BrE in 1991 Frown: Written AmE in 1991 Common corpus design One M word each 500 samples (ca words each) Same proportions from the same 15 text categories Useful for synchronic and diachronic comparison of BrE and AmE Further information ICAME CD: Exended Brown family: (access account to be applied)

The ICE corpora 20 one M word balanced corpora Common corpus design
E.g. Britain, Ireland, US, Canada, Hong Kong, Singapore, India, the Philippines, East Africa Common corpus design 500 samples (ca words each) 60% spoken + 40% written 12 Genres Designed for the synchronic study of “world Englishes” More information

The London-Lund Corpus
First electronic corpus of spontaneous language A corpus of spoken British English recorded from 100 texts, each of 5,000 words, totaling half a million running words Both dialogue (e.g. face-to-face conversations, telephone conversations, and public discussion) and monologues (both spontaneous and prepared) Speaker information (gender, age, occupation) Annotated with prosodic information Further information

SBCSAE Based on hundreds of recordings of spontaneous speech from all over the United States Representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds Each of the 60 transcripts is time stamped and accompanied by a digital audio file Free download

Helsinki Corpus of English Texts
Best-known historical corpus 1.5 million words of English in 400 text samples dating from the 8th to 18th centuries Divided into three periods (Old, Middle, and Early Modern English) and 11 sub-periods Socio-historical variation and a wide range of text types for each specific period Allows for researchers to go beyond simply dating and reporting language change by combining diachronic, sociolinguistic and genre studies Further information Oxford Text Archive:

The ICLE corpus First and best-known learner English corpus
Comprising argumentative essays written by advanced learners of English (i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study Over 2.5 million words in 3,640 texts ranging between 500-1,000 words in length 11 L1 backgrounds and still expanding with 8 additional L1s Useful in investigating the interlanguage of the foreign language learners Further information:

MICASE ca. 1.8 M words in 152 transcripts of nearly 200 hours of recordings of 1,571 speakers Focusing on contemporary university speech within the domain of the University of Michigan Encoded with speaker information (age, academic role, language status) Free online search or transcript download

Corpus design and types of corpora

Similar presentations

Presentation on theme: "Corpus design and types of corpora"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus design and types of corpora

Similar presentations

Presentation on theme: "Corpus design and types of corpora"— Presentation transcript:

Similar presentations

About project

Feedback