1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.

Slides:



Advertisements
Similar presentations
Corpora in grammatical studies
Advertisements

Diachronic study and language change Corpus Linguistics Richard Xiao
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation.
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
Corpora and Language Teaching
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
English Corpora and Language Learning Tamás Váradi
Memory Strategy – Using Mental Images
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Researching language with computers Paul Thompson.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Translation Studies 9. The use of corpora in TS Krisztina Károly, Spring, 2006 Sources: Olohan, 2004; Tirkkonen-Condit, 2005.
Enda F. Scott 2001 Good morning An introduction to modern dictionary making.
Communicative and Academic English for the EFL Professional.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Corpus search What are the most common words in English
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Applications Lexicography
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
Corpus Linguistics I ENG 617
APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY
Using GOLD to Tracking L2 Development
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

1/26 Corpus Linguistics

2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal and subjective –Using computers to look at larger amounts of data allows us to be more formal and objective –“Corpus linguistics” basically provides a “mindset” (and some procedures) for doing this

3/26 What is a corpus? Corpus (pl. corpora) = ‘body’ Collection of written text or transcribed speech Usually but not necessarily purposefully collected Usually but not necessarily structured Usually but not necessarily annotated (Usually stored on and accessible via computer) Corpus ~ text archive

4/26 “Purposefully collected” Text samples collected to meet a specific need Corpus may be quite focused, eg corpus of newswire texts, or may be more general Issue of balance often important –Demographic features (age, sex, location, social class of writer/reader) –Different styles and genres

5/26 “Structured” Overall corpus is divided into sections defined by parameters Again balance will ensure that different genres or demographic features are equally represented

6/26 Parameters in the BNC (written portion)

7/26 Genre distinctions in the BNC (written portion)

8/26 Parameters in BNC (spoken part)

9/26 Parameters in BNC (spoken part) cont

10/26 “Annotated” Not just plain text Most corpora are at least “POS tagged” –Each word has its part of speech (POS) identified –POS tags contain quite rich information, eg not just “verb” but including some morphological information –tags also disambiguate, eg between book (N/V) if possible Some may also have other information indicated –structural information resulting from parse –word sense distinctions for same-POS homonyms

11/26

12/26

13/26 What is corpus linguistics? Not a branch of linguistics, like socio~, psycho~, … Not a theory of linguistics A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject

14/26 Evidence in linguistics Real attested usage as linguistic evidence Contrasts with introspective approach previously typical Relates to the competence~performance (langue~parole) distinction Corpus linguists often more interested in trends than rules (probabilities rather than certainties) Famous stories of corpus evidence contradicting widely-held assumptions about language use.

15/26 Activities in corpus linguistics Design and compilation of corpora Development of tools for corpus analysis Descriptive linguists using corpora to analyze lexical and grammatical behaviour of language, eg for lexicography, and of course stylistics Exploiting corpora in applied linguistics – language teaching, translation.

16/26 History of Corpus Linguistics Textual study has always included an element of counting and cataloguing, despite impracticalities – notably concordances of Shakespeare, the Bible, etc. Arrival of computers in 1950s of course changed everything

17/26 Brown corpus First modern computer-readable corpus W.N. Francis and H. Kučera, Brown University, Providence, RI one million words of American English texts printed in 1961 sampled from 15 different text categories used as model for other corpora, including …

18/26 LOB corpus compiled by researchers in Lancaster, Oslo and Bergen one million words of British English texts printed in 1961 sampled from same 15 text categories as Brown corpus All texts ≤ 2,000 words long Kolhapur corpus of Indian English compiled in 1978 to same sepcification

19/26 The London-Lund Corpus of Spoken English (LLC) First corpus of transcribed spoken language Part of Survey of Spoken English at Lund University under the direction of J. Svartvik 500,000 words of spoken British English recorded from 1953 to 1987 different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration

20/26 COBUILD 1m-word corpus too small for many applications 1980: Collins instigated collection of 20m-word corpus to support lexicographers writing new Collins Birmingham University International Learners’ Dictionary (John Sinclair) Now expanded to Bank of English corpus, 320m words and growing

21/26 BNC (1995) 100m word collection of written and spoken text from (already dated in some respects!) Carefully designed and balanced Corpus is closed (finite, synchronic) All text tagged to high quality Lots of tools available for exploration Nice online interface (available on campus)

22/26 What can you do with a corpus? Many things, but just some examples: Investigate behaviour of words and how they relate to genre, mode, sex of speaker/hearer Prove (or disprove) supposed trends with quantitative data

23/26 Example 1: swearing Women and men swear (and use taboo words) differently Data (from BNC spoken part) shows –Women and men use different swear words –They use them for different effect (men use them to disparage, women use them to intensify) –Their use changes depending on the sex of the listener(s): women swear more in single-sex groups; men don’t swear more in mixed-sex than amongst themselves

24/26 Example 2.1: Near synonyms Subtle differences in the meaning of near synonyms can be distinguished by looking at the words they collocate with –“You shall know a word by the company it keeps” (Firth)

25/26 frail vs fragile

26/26 Example 2.2: Near synonyms In addition, near synonyms can be shown to be favoured depending on genre, eg big vs large Categorybiglarge Spoken conversation Other spoken material Newspapers Fiction and verse Other published written material Unpublished written material Non-academic prose and biography Academic prose Frequency per million words