Intro to corpus linguistics: Data Driven Grammar

Slides:



Advertisements
Similar presentations
Corpora in grammatical studies
Advertisements

Diachronic study and language change Corpus Linguistics Richard Xiao
Uses of a Corpus “[E]xplore actual patterns of language use”
Lengua Inglesa II Grammar Topics Tom Morton IV bis 205
Teaching the language system: vocabulary & Grammar
C HINESE 318 Introduction to Applied Chinese Linguistics.
Word Order Choices Chapter 12
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
LELA English Corpus Linguistics
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
National Curriculum Key Stage 2
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Researching language with computers Paul Thompson.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
1.The COBUILD approach to grammar is simple and direct.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Linguistics The third week. Chapter 1 Introduction 1.3 Some Major Concepts in Linguistics.
I. INTRODUCTION.
1 And yeah, it was really good! Positive stance in native and learner speech Sylive De Cock Centre for English Corpus Linguistics Université catholique.
Corpus approaches to discourse
Introduction Chapter 1 Foundations of statistical natural language processing.
Corpus search What are the most common words in English
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Some Distinctions in Linguistics. Descriptivism & Prescriptivism Synchronic & diachronic Speech & writing Language & parole Competence & performance Traditional.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Linguistic Anthropology
The Introduction of Saussure and Chomsky ——12 英语 2 班 丁王婷、陈楠、刘燕妹 庞林艳、高志鹏、翟小波.
Approaches to teaching English The differences between EAP and General EFL Louis Rogers.
Usage-Based Phonology Anna Nordenskjöld Bergman. Usage-Based Phonology overall approach What is the overall approach taken by this theory? summarize How.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
Use of Literature in Language Teaching
Esther Daborn, Anneli Williams & Louis Harrison
Collecting Written Data
E303 Part II The Context of Language Research
Linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense that it scientifically studies the.
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Searching corpora.
INTRODUCTION TO LINGUISTICS 1
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Exploring Collocation
Introduction to Corpus Linguistics: Applications Lexicography
Introduction to Corpus Linguistics: Key Word Analysis
Introduction to Corpus Linguistics: Basic tools: Concordances
Listening listen attentively to spoken language and show understanding by joining in and responding explore the patterns and sounds of language through.
Language, Mind, and Brain by Ewa Dabrowska
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
Corpus Linguistics I ENG 617
Introduction To Linguistics
Phil Durrant Debra Myhill Mark Brenchley
Topics in Linguistics ENG 331
Stylistics and Stylometry
McEnery, T. , Xiao, R. and Y. Tono Corpus-based language studies
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Linguistic Anthropology
Chapter 4.
The quality of choices determines the quantity of Key words
Presentation transcript:

Intro to corpus linguistics: Data Driven Grammar John Corbett & Wendy Anderson

This session Corpora and grammatical theories The development of data-driven grammars To tag or not to tag? Exploring grammar without tags Exploring grammar with tags A few caveats

Key concepts in grammar Saussure langue vs parole synchronic vs diachronic focus syntagmatic vs paradigmatic relations Boas linguistic anthropology/fieldwork Bloomfield Immediate Constituent Analysis Fries importance of distribution over meaning Chomsky competence vs performance Halliday meaning, metafunctions of language Sinclair data-driven grammar

Why use real data (and lots of it)? Naturalistic language Access to range of varieties & styles (linguist not limited to own knowledge of language) If corpus is well designed, we can generalise from it Patterns emerge which cannot otherwise be detected

Early data-driven grammars “The materials which furnished the linguistic evidence for the analysis and the discussions […] were primarily some fifty hours of mechanically recorded conversations on a great range of topics – conversations in which the participants were entirely unaware that their speech was being recorded. These mechanical records were transcribed for convenient study, and roughly indexed so as to facilitate reference to the original discs recording the actual speech.” (Fries 1952/57, p.3)

Early data-driven grammars The Survey of English Usage, from 1959 Quirk et al. (1985) A Comprehensive Grammar of the English Language.

Towards corpus-informed grammars Biber, Johansson, Leech, Conrad and Finegan (1999) Longman Grammar of Spoken and Written English.

“Through study of authentic texts, corpus-based analysis of grammatical structure can uncover characteristics that were previously unsuspected. For example, we have found that speakers in conversation use a number of relatively complex and sophisticated grammatical constructions, contradicting the widely held belief that conversation is grammatically simple.” Introduction to Longman Grammar of Spoken and Written English (p.7)

Cambridge International Corpus Over 1 billion words (and expanding) Cambridge Grammar of English (Carter and McCarthy, 2006)

Tackling grammar with a corpus Do you have access to a ‘raw’ or a tagged corpus? Are you restricted to searching for individual words or phrases? Can you also search for Parts of Speech (PoS)?

Working with words/phrases “He is not” He is not He is not a. He’s not b. He isn’t

He isn’t vs He’s not US English Data from Cambridge International Corpus, reported in McCarthy, McCarten, Sandiford (2005) form frequency a. He’s not 704 b. He isn’t 18 She’s not 476 She isn’t 15

He isn’t vs He’s not UK English http://corpus.byu.edu/bnc Data from British National Corpus form frequency a. He’s not 1894 b. He isn’t 372 She’s not 1019 She isn’t 167

He isn’t vs He’s not US He’s not / She’s not forms are chosen in 97-98% of instances in this corpus Isn’t forms are chosen in 2-3% of instances UK He’s not / She’s not forms are chosen in 84-86% of instances in this corpus Isn’t forms are chosen in 14-16% of instances

He isn’t vs He’s not In conversation… People use ’s not and ’re not after pronouns. She’s not strict They’re not nice. Isn’t and aren’t often follow nouns. My boss isn’t strict. My co-workers aren’t nice. Touchstone, McCarthy, McCarten and Sandiford 2005: 25

Automatic parsing

The perils of automatic parsing CLAWS claims 96-7% accuracy in its automatic parsing of texts, depending on the texts analysed. How worrying is that inaccurate 3-4%?

Working with POS: The BNC Tagset

Pattern Grammar Susan Hunston and Gill Francis (2000) Pattern Grammar

Using a parsed corpus to discover patterns: V + over + wh-

V + over + wh-

Pattern grammar V over + Wh-clause   heels while critics argue over the niceties of translation st look at each other bicker over the poisoning of a dog. If th be convinced to compromise over the structure of the competiti hat in mind, as they fight over whether to support a rescue pa er bridal showers and fuss over her trousseau. That’s her busi the countryside. To haggle over prices and engage in sharp so ill-bred as to quarrel over the last cucumber sandwich. ‘N

Some caveats about corpora Corpora do not provide ‘negative evidence’ Corpora are not exhaustive, and may be skewed – focus on the typical Corpora do not provide explanations – this is the linguist’s job! Not all corpora are suitable for a particular research question But corpora allow us to study language as it is actually used

Data-driven grammar – take home messages… Corpora allow us to take account of naturalistic data Focus on parole (or ‘performance’) A return to the social as opposed to the cognitive