Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intro to corpus linguistics: Data Driven Grammar

Similar presentations


Presentation on theme: "Intro to corpus linguistics: Data Driven Grammar"— Presentation transcript:

1 Intro to corpus linguistics: Data Driven Grammar
John Corbett & Wendy Anderson

2 This session Corpora and grammatical theories
The development of data-driven grammars To tag or not to tag? Exploring grammar without tags Exploring grammar with tags A few caveats

3 Key concepts in grammar
Saussure langue vs parole synchronic vs diachronic focus syntagmatic vs paradigmatic relations Boas linguistic anthropology/fieldwork Bloomfield Immediate Constituent Analysis Fries importance of distribution over meaning Chomsky competence vs performance Halliday meaning, metafunctions of language Sinclair data-driven grammar

4 Why use real data (and lots of it)?
Naturalistic language Access to range of varieties & styles (linguist not limited to own knowledge of language) If corpus is well designed, we can generalise from it Patterns emerge which cannot otherwise be detected

5 Early data-driven grammars
“The materials which furnished the linguistic evidence for the analysis and the discussions […] were primarily some fifty hours of mechanically recorded conversations on a great range of topics – conversations in which the participants were entirely unaware that their speech was being recorded. These mechanical records were transcribed for convenient study, and roughly indexed so as to facilitate reference to the original discs recording the actual speech.” (Fries 1952/57, p.3)

6 Early data-driven grammars
The Survey of English Usage, from 1959 Quirk et al. (1985) A Comprehensive Grammar of the English Language.

7 Towards corpus-informed grammars
Biber, Johansson, Leech, Conrad and Finegan (1999) Longman Grammar of Spoken and Written English.

8 “Through study of authentic texts, corpus-based analysis of grammatical structure can uncover characteristics that were previously unsuspected. For example, we have found that speakers in conversation use a number of relatively complex and sophisticated grammatical constructions, contradicting the widely held belief that conversation is grammatically simple.” Introduction to Longman Grammar of Spoken and Written English (p.7)

9 Cambridge International Corpus
Over 1 billion words (and expanding) Cambridge Grammar of English (Carter and McCarthy, 2006)

10 Tackling grammar with a corpus
Do you have access to a ‘raw’ or a tagged corpus? Are you restricted to searching for individual words or phrases? Can you also search for Parts of Speech (PoS)?

11 Working with words/phrases
“He is not” He is not He is not a. He’s not b. He isn’t

12 He isn’t vs He’s not US English
Data from Cambridge International Corpus, reported in McCarthy, McCarten, Sandiford (2005) form frequency a. He’s not 704 b. He isn’t 18 She’s not 476 She isn’t 15

13 He isn’t vs He’s not UK English
Data from British National Corpus form frequency a. He’s not 1894 b. He isn’t 372 She’s not 1019 She isn’t 167

14 He isn’t vs He’s not US He’s not / She’s not forms are chosen in 97-98% of instances in this corpus Isn’t forms are chosen in 2-3% of instances UK He’s not / She’s not forms are chosen in 84-86% of instances in this corpus Isn’t forms are chosen in 14-16% of instances

15 He isn’t vs He’s not In conversation…
People use ’s not and ’re not after pronouns. She’s not strict They’re not nice. Isn’t and aren’t often follow nouns. My boss isn’t strict. My co-workers aren’t nice. Touchstone, McCarthy, McCarten and Sandiford 2005: 25

16 Automatic parsing

17 The perils of automatic parsing
CLAWS claims 96-7% accuracy in its automatic parsing of texts, depending on the texts analysed. How worrying is that inaccurate 3-4%?

18 Working with POS: The BNC Tagset

19 Pattern Grammar Susan Hunston and Gill Francis (2000) Pattern Grammar

20 Using a parsed corpus to discover patterns: V + over + wh-

21 V + over + wh-

22 Pattern grammar V over + Wh-clause
heels while critics argue over the niceties of translation st look at each other bicker over the poisoning of a dog. If th be convinced to compromise over the structure of the competiti hat in mind, as they fight over whether to support a rescue pa er bridal showers and fuss over her trousseau. That’s her busi the countryside. To haggle over prices and engage in sharp so ill-bred as to quarrel over the last cucumber sandwich. ‘N

23 Some caveats about corpora
Corpora do not provide ‘negative evidence’ Corpora are not exhaustive, and may be skewed – focus on the typical Corpora do not provide explanations – this is the linguist’s job! Not all corpora are suitable for a particular research question But corpora allow us to study language as it is actually used

24 Data-driven grammar – take home messages…
Corpora allow us to take account of naturalistic data Focus on parole (or ‘performance’) A return to the social as opposed to the cognitive


Download ppt "Intro to corpus linguistics: Data Driven Grammar"

Similar presentations


Ads by Google