Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation.

Slides:



Advertisements
Similar presentations
Introduction to Computational Linguistics
Advertisements

Introduction to Computational Linguistics
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Assessing Student Learning: Using the standards, progression points and assessment maps Workshop 1: An overview FS1 Student Learning.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Introduction: A discourse perspective on grammar
English Lexicography.
1 Analysing and teaching meaning (3) Analysing and teaching meaning (3) SSIS Lazio - Lesson 3 prof. Hugo Bowles January 2007.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
The origins of language curriculum development
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
B.A. (Hons.) in English Language & Literature Programme Structure Majors60 Units Major Required Courses:27 units (including Honours Project) Major Elective.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
English Corpora and Language Learning Tamás Váradi
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Research Papers Locating Your Sources. Two Kinds of Sources Primary source: original text, document, interview, speech, or letter (it is the text itself)
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Chapter 1: By: Ms. Ola Al-arjani
Researching language with computers Paul Thompson.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
THE NATURE OF TEXTS English Language Yo. Lets Refresh So we tend to get caught up in the themes on English Language that we need to remember our basic.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Corpus approaches to discourse
Introduction Chapter 1 Foundations of statistical natural language processing.
Corpus search What are the most common words in English
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Stylistics. Stylistic Stylistics is a critical approach which uses the methods and findings of the science of linguistics in the analysis of literary.
GCSE English Language 8700 GCSE English Literature 8702 A two year course focused on the development of skills in reading, writing and speaking and listening.
Lecture # 21.  A branch of applied linguistics concerned with the study of style in texts, especially (but not exclusively) in literary works.applied.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Use of Literature in Language Teaching
Collecting Written Data
Corpus Linguistics Anca Dinu February, 2017.
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Computational and Statistical Methods for Corpus Analysis: Overview
Making Connections: guidance on non-exam assessment
Exploring the BNC Corpus
عمادة التعلم الإلكتروني والتعليم عن بعد
Corpus Linguistics I ENG 617
Applied Linguistics Chapter Four: Corpus Linguistics
Grade 1.
Presentation transcript:

Corpus 01 Introduction Historical Review

Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation or elicitation, observations in spoken or written texts Focus on performance rather on competence, on observation to theory rather than theory to observation Scope: text as domain of study and as the source of evidence for linguistic description and argumentation Methodologies: quantification of linguistic description

Difference between corpus linguistics and other linguistics Richness of the evidence Confidence in generalizability Validity and reliability

Corpus Linguistic Activities 1. design and compilation of corpora collection of texts preparation and storage for later analysis 2. develop tools for the analysis of corpora: computational linguistics 3. use of computerized corpora to describe the lexicon and grammar of languages probalistic aspect of corpus-based description and study how often a particular form is used 4. language learning and teaching, natural language processing

Function of corpus linguistics Not that it is a faster way of description of language, but that it may reveal facts we might never have thought of seeking. e.g. Altenberg’s study of amplifier collocation in English (1991a): frequent maximizers such as quite tend to collocate with on-scalar words (quite obviously) while absolutely has a greater tendency than other maximizers to collocate with negatives (absolutely not) Statistical distribution of linguistic items

Topics of Corpus linguistics Annotating corpora Tagging of parts of speech and the senses of polysemous word forming Improved automatic parsing Identification of collocations Phraseological units and discourse structure Text categorization Research methodology Application in lexicography, syntactic description, translation, speech and handwriting recognition, language teaching

Pre-electronic corpora Biblical and literary studies Lexicography Dialect Language education Grammatical

Biblical and literary studies Alexander Cruden (1736): Concordance of the Authorized Fig. 2.1 Similar works on Shakespeare

Lexicography Samuel Johnson (early 17th century): Dictionary of the English Language. Corpus of sentences from writers of the first reputation. James Murray: OED (1928), corpus of the canon of literary written English. Noah Webster (1828): An American Dictionary of the English Language

Dialect Wright ( ): The English Dialect Dictionary Ellis (1889): The Existing Phonology of English Dialects

Language Education Thorndike (1921): word frequency list based on a corpus of 45 million words from 41 different sources

Grammatical Jespersen ( ) Cruisinga ( ) Putsma ( ) Fries (1940): American English Grammar. Corpus of letters to the US Government by persons of different educational and social background. Describe social class differences in usage.

Grammatical The Structure of English (1952): 250,000 word corpus of recorded telephone conversations. Randolph Quirk: Survey of English Usage (1968). 5,000 words X 200 samples >>>> 1,000,000 word corpus representative of spoken and written English to describe the grammar and usage of educated adult native speakers of British English.

Types of electronic corpora General corpora General corpora (core corpora) A text base for linguistic analysis to seek answers to particular questions about vocabulary, grammar, discourse structure. Balanced, containing texts from different genres, and domains in speaking and writing, private and public

Types of electronic corpora Specialized corporaSpecialized corpora designed with particular projects in mind, e.g. corpus for compilation of modern dictionary Cartereet & Jones (1974): child language development Zhu (1989): English used in petroleum geology exploration, drilling and refining. People disagreeing with each other in radio interview Teachers’ directives in high school classrooms

Types of electronic corpora Leech (1992): training corpora and test corpora for language models and language processing Dialect corpora Regional corpora Non-standard corpora Learners’ corpora

Types of electronic corpora full text corpusfull text corpus: complete texts stylistic or discourse studies: 200 word samples may not be able to capture the internal structural characteristics of full texts. raw corpus: tagging, parsing, concordance

Major electronic corpora: first generation Brown Corpus (1961): Brown University Standard Corpus of Present-Day American English Significance: 1. first computer corpus 2. in the face of massive indifference linguistic research is not to record but to describe while corpora are statistically based, with probabilistic model of competence derived from linguistic performance. Structure (Table 2.2 p24)

Major electronic corpora: first generation Features: widely selected categories in written English: both formal and informal written English is taken into account Selected by a method that makes it reasonably representative of current printed American English Establish coding conventions: abbreviations, formula, quotations, punctuations. Number of characters per line: 70 Grammatically tagged: each word assigned to one of over 80 tags.

Major electronic corpora: first generation Lancaster-Oslo/Bergen Corpus (1970): LOB Corpus A British counterpart to the Brown Corpus 2000 words of 500 texts published in 1961 same categories as the Brown Corpus differences from the Brown Corpus coding: sentence initial markers

Major electronic corpora: first generation abbreviations partly analyzed version versions for more different platforms. More grammatical tags Key word in context (KWIC) concordance

Major electronic corpora: first generation Other first generation corpora Indian EnglishIndian English: the Kolpapur Corpus of Indian English (1988) Collected materials of 1978 the Willington Corpus of Written New Zealand English and Australian Corpus of EnglishNew Zealand: the Willington Corpus of Written New Zealand English and Australian Corpus of English (1986) Collected texts in 1961

Problems 1. one million size is prohibitory and too small. 2. difficult to find interesting differences between regional varieties: differences are sometime not in structure but in the frequency of the structure used 3. additional words: sample ends at the first sentence ending after 2000 words. Thus in the Brown Corpus, the size is actually 1,014,312 and in LOB, 1,006,825. In word counting, LOB concordance size is 1,123,380

London-Lund Corpus (LLC) Spoken part of SEU which was half size for written and the other half for spoken SEU original: 87 texts which make up 435,000 words plus 13 more texts. The total size is 5000 words of 100 texts which makes up 500,000 words. Features: less detailed prosodic analysis

London-Lund Corpus (LLC) Tone units Onset Location of nuclei Direction of nuclear tones

London-Lund Corpus (LLC) Boosters Degree of pause Degree of stress Speaker identity Simultaneous talk Contextual comment Incomprehensible words

Corpora for special purposes Algeo (1988): a corpus of 5 million words from the 18 th century to present for studying Briticisms in the English language. American Heritage Intermediate Corpus (1969) 5.09 million words from 10,043 samples of 500 words long from the publications widely read among American schoolchildren aged 7 to 15.

Corpora for special purposes Categories: reading, English and grammar, composition, literature, mathematics, social studies, spelling, science, music, art, religion, home economics, library fiction, non-fiction, reference and magazines. Words are not lemmatized One of the first computer-based databases for lexicographical purposes.

Other special corpora The Nijmegen CorpusThe Nijmegen Corpus Early 70s Goal: grammatical description of British English Size: 132,000 words

Other special corpora Composition: 20,000 words extract from 6 authors = 120,000 words 12,000 words of transcribed sports commentary categories: written, mainly literary English sports commentary Sample span: 1962—1968 Analysis: a large set of labeled trees or phrase markers.

Other special corpora TOSCATOSCA (Tools for Syntactic Corpus Analysis Corpus Later then the Nijmegen Corpus Size: 20,000 X 75=1,500,000

Other special corpora Categories: various fiction and nonfiction, genres in written British English Span: 1976—1986 Composition: 45 samples from 21 nonfiction genres ((auto)biography, history, literary criticism, politics, women’s studies, chemistry, economics, physics. 30 samples from 9 fiction genres: horror, humor, love and romance, general fiction

Other special corpora Hong Kong University of Science and TechnologyHong Kong University of Science and Technology Size: 1,000,000 of computer science English 2,000 word sample from 166 English language textbooks used in computer science course in the early 90s goal: assist the teaching of English for computer science students.

Other special corpora JTESTJiao Tong University Corpus for English in Science and Technology (JTEST) 1980s 1,000,000 words from written English texts in the physical science, engineering and technology goal: facilitate lexical analysis of particular registers, e.e. count of high frequency words

Other special corpora GPECGuangzhou Petroleum English Corpus (GPEC) 411,000 words from 700 texts from the petroleum industry from written American and British English of the mid 1980s. goal: the same as JTEST

Second generation mega-corpora COBUILD Collins Birmingham University International Language Database 25% from spoken texts reflect broadly general rather than technical language

Second generation mega-corpora current usage from 1960 on naturally occurring texts Prose included but not poetry Contributions: commercial research and development project for dictionaries, grammars and language teaching courses.

Longman Corpus Network Three major corpora LLELC: the Longman/Lancaster English Language Corpus LSC: Longman Spoken Corpus LCLE: Longman Corpus of Learners’ English

British National Corpus (BNC) 100 million words of contemporary spoken and written British English. Structure: Table 2.3 p.51 Automatic word-class tagging with CLAWS

Issues in corpus design and compilation Static or dynamic: Representativeness and balance Size Written and spoken

Issues in corpus design and compilation Extralinguistic variables: text origin, participants, medium genre, style, factuality, topic, date of publication, authorship (age, gender, nationality), audience Storage Text capture: keyboarding, CD-ROM or electronic version, scanning (software, quality of printing) Spoken text: transcribing (conventions for transcribing prosodic phenomena: ICE project) Markup: marks for tagging (Standard Generalized markup language—SGML) p.84

Organizations and professional associations Descriptive linguistics: the International Computer Archive of Modern English (ICANE) ICASME CD-ROM: The Brown, LOB, Kolhapur, London_Lund and Helsinki corpora And softwares: WordCruncher, TACT and Free Text Browser

Organizations and professional associations Bibliographic overview: Humanities Computing yearbook Computational linguistics: the Association for Computational Linguistics (ACL) Literary studies: The Association for Computers and Humanities (ACH) Association for Literary and Linguistic Computing (ALLC)