Intro to corpus linguistics: Data Driven Grammar John Corbett & Wendy Anderson
This session Corpora and grammatical theories The development of data-driven grammars To tag or not to tag? Exploring grammar without tags Exploring grammar with tags A few caveats
Key concepts in grammar Saussure langue vs parole synchronic vs diachronic focus syntagmatic vs paradigmatic relations Boas linguistic anthropology/fieldwork Bloomfield Immediate Constituent Analysis Fries importance of distribution over meaning Chomsky competence vs performance Halliday meaning, metafunctions of language Sinclair data-driven grammar
Why use real data (and lots of it)? Naturalistic language Access to range of varieties & styles (linguist not limited to own knowledge of language) If corpus is well designed, we can generalise from it Patterns emerge which cannot otherwise be detected
Early data-driven grammars “The materials which furnished the linguistic evidence for the analysis and the discussions […] were primarily some fifty hours of mechanically recorded conversations on a great range of topics – conversations in which the participants were entirely unaware that their speech was being recorded. These mechanical records were transcribed for convenient study, and roughly indexed so as to facilitate reference to the original discs recording the actual speech.” (Fries 1952/57, p.3)
Early data-driven grammars The Survey of English Usage, from 1959 Quirk et al. (1985) A Comprehensive Grammar of the English Language.
Towards corpus-informed grammars Biber, Johansson, Leech, Conrad and Finegan (1999) Longman Grammar of Spoken and Written English.
“Through study of authentic texts, corpus-based analysis of grammatical structure can uncover characteristics that were previously unsuspected. For example, we have found that speakers in conversation use a number of relatively complex and sophisticated grammatical constructions, contradicting the widely held belief that conversation is grammatically simple.” Introduction to Longman Grammar of Spoken and Written English (p.7)
Cambridge International Corpus Over 1 billion words (and expanding) Cambridge Grammar of English (Carter and McCarthy, 2006)
Tackling grammar with a corpus Do you have access to a ‘raw’ or a tagged corpus? Are you restricted to searching for individual words or phrases? Can you also search for Parts of Speech (PoS)?
Working with words/phrases “He is not” He is not He is not a. He’s not b. He isn’t
He isn’t vs He’s not US English Data from Cambridge International Corpus, reported in McCarthy, McCarten, Sandiford (2005) form frequency a. He’s not 704 b. He isn’t 18 She’s not 476 She isn’t 15
He isn’t vs He’s not UK English http://corpus.byu.edu/bnc Data from British National Corpus form frequency a. He’s not 1894 b. He isn’t 372 She’s not 1019 She isn’t 167
He isn’t vs He’s not US He’s not / She’s not forms are chosen in 97-98% of instances in this corpus Isn’t forms are chosen in 2-3% of instances UK He’s not / She’s not forms are chosen in 84-86% of instances in this corpus Isn’t forms are chosen in 14-16% of instances
He isn’t vs He’s not In conversation… People use ’s not and ’re not after pronouns. She’s not strict They’re not nice. Isn’t and aren’t often follow nouns. My boss isn’t strict. My co-workers aren’t nice. Touchstone, McCarthy, McCarten and Sandiford 2005: 25
Automatic parsing
The perils of automatic parsing CLAWS claims 96-7% accuracy in its automatic parsing of texts, depending on the texts analysed. How worrying is that inaccurate 3-4%?
Working with POS: The BNC Tagset
Pattern Grammar Susan Hunston and Gill Francis (2000) Pattern Grammar
Using a parsed corpus to discover patterns: V + over + wh-
V + over + wh-
Pattern grammar V over + Wh-clause heels while critics argue over the niceties of translation st look at each other bicker over the poisoning of a dog. If th be convinced to compromise over the structure of the competiti hat in mind, as they fight over whether to support a rescue pa er bridal showers and fuss over her trousseau. That’s her busi the countryside. To haggle over prices and engage in sharp so ill-bred as to quarrel over the last cucumber sandwich. ‘N
Some caveats about corpora Corpora do not provide ‘negative evidence’ Corpora are not exhaustive, and may be skewed – focus on the typical Corpora do not provide explanations – this is the linguist’s job! Not all corpora are suitable for a particular research question But corpora allow us to study language as it is actually used
Data-driven grammar – take home messages… Corpora allow us to take account of naturalistic data Focus on parole (or ‘performance’) A return to the social as opposed to the cognitive