Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation.

Corpus 01 Introduction Historical Review

Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation or elicitation, observations in spoken or written texts Focus on performance rather on competence, on observation to theory rather than theory to observation Scope: text as domain of study and as the source of evidence for linguistic description and argumentation Methodologies: quantification of linguistic description

Difference between corpus linguistics and other linguistics Richness of the evidence Confidence in generalizability Validity and reliability

Corpus Linguistic Activities 1. design and compilation of corpora collection of texts preparation and storage for later analysis 2. develop tools for the analysis of corpora: computational linguistics 3. use of computerized corpora to describe the lexicon and grammar of languages probalistic aspect of corpus-based description and study how often a particular form is used 4. language learning and teaching, natural language processing

Function of corpus linguistics Not that it is a faster way of description of language, but that it may reveal facts we might never have thought of seeking. e.g. Altenberg’s study of amplifier collocation in English (1991a): frequent maximizers such as quite tend to collocate with on-scalar words (quite obviously) while absolutely has a greater tendency than other maximizers to collocate with negatives (absolutely not) Statistical distribution of linguistic items

Topics of Corpus linguistics Annotating corpora Tagging of parts of speech and the senses of polysemous word forming Improved automatic parsing Identification of collocations Phraseological units and discourse structure Text categorization Research methodology Application in lexicography, syntactic description, translation, speech and handwriting recognition, language teaching

Pre-electronic corpora Biblical and literary studies Lexicography Dialect Language education Grammatical

Biblical and literary studies Alexander Cruden (1736): Concordance of the Authorized Fig. 2.1 Similar works on Shakespeare

Lexicography Samuel Johnson (early 17th century): Dictionary of the English Language. Corpus of sentences from writers of the first reputation. James Murray: OED (1928), corpus of the canon of literary written English. Noah Webster (1828): An American Dictionary of the English Language

Dialect Wright (1898-1905): The English Dialect Dictionary Ellis (1889): The Existing Phonology of English Dialects

Language Education Thorndike (1921): word frequency list based on a corpus of 45 million words from 41 different sources

Grammatical Jespersen (1909-49) Cruisinga (1931-32) Putsma (1926-29) Fries (1940): American English Grammar. Corpus of letters to the US Government by persons of different educational and social background. Describe social class differences in usage.

Grammatical The Structure of English (1952): 250,000 word corpus of recorded telephone conversations. Randolph Quirk: Survey of English Usage (1968). 5,000 words X 200 samples >>>> 1,000,000 word corpus representative of spoken and written English to describe the grammar and usage of educated adult native speakers of British English.

Types of electronic corpora General corpora General corpora (core corpora) A text base for linguistic analysis to seek answers to particular questions about vocabulary, grammar, discourse structure. Balanced, containing texts from different genres, and domains in speaking and writing, private and public

Types of electronic corpora Specialized corporaSpecialized corpora designed with particular projects in mind, e.g. corpus for compilation of modern dictionary Cartereet & Jones (1974): child language development Zhu (1989): English used in petroleum geology exploration, drilling and refining. People disagreeing with each other in radio interview Teachers’ directives in high school classrooms

Types of electronic corpora Leech (1992): training corpora and test corpora for language models and language processing Dialect corpora Regional corpora Non-standard corpora Learners’ corpora

Types of electronic corpora full text corpusfull text corpus: complete texts stylistic or discourse studies: 200 word samples may not be able to capture the internal structural characteristics of full texts. raw corpus: tagging, parsing, concordance

Major electronic corpora: first generation Brown Corpus (1961): Brown University Standard Corpus of Present-Day American English Significance: 1. first computer corpus 2. in the face of massive indifference linguistic research is not to record but to describe while corpora are statistically based, with probabilistic model of competence derived from linguistic performance. Structure (Table 2.2 p24)

Major electronic corpora: first generation Features: widely selected categories in written English: both formal and informal written English is taken into account Selected by a method that makes it reasonably representative of current printed American English Establish coding conventions: abbreviations, formula, quotations, punctuations. Number of characters per line: 70 Grammatically tagged: each word assigned to one of over 80 tags.

Major electronic corpora: first generation Lancaster-Oslo/Bergen Corpus (1970): LOB Corpus A British counterpart to the Brown Corpus 2000 words of 500 texts published in 1961 same categories as the Brown Corpus differences from the Brown Corpus coding: sentence initial markers

Major electronic corpora: first generation abbreviations partly analyzed version versions for more different platforms. More grammatical tags Key word in context (KWIC) concordance

Major electronic corpora: first generation Other first generation corpora Indian EnglishIndian English: the Kolpapur Corpus of Indian English (1988) Collected materials of 1978 the Willington Corpus of Written New Zealand English and Australian Corpus of EnglishNew Zealand: the Willington Corpus of Written New Zealand English and Australian Corpus of English (1986) Collected texts in 1961

Problems 1. one million size is prohibitory and too small. 2. difficult to find interesting differences between regional varieties: differences are sometime not in structure but in the frequency of the structure used 3. additional words: sample ends at the first sentence ending after 2000 words. Thus in the Brown Corpus, the size is actually 1,014,312 and in LOB, 1,006,825. In word counting, LOB concordance size is 1,123,380

London-Lund Corpus (LLC) Spoken part of SEU which was half size for written and the other half for spoken SEU original: 87 texts which make up 435,000 words plus 13 more texts. The total size is 5000 words of 100 texts which makes up 500,000 words. Features: less detailed prosodic analysis

London-Lund Corpus (LLC) Tone units Onset Location of nuclei Direction of nuclear tones

London-Lund Corpus (LLC) Boosters Degree of pause Degree of stress Speaker identity Simultaneous talk Contextual comment Incomprehensible words

Corpora for special purposes Algeo (1988): a corpus of 5 million words from the 18 th century to present for studying Briticisms in the English language. American Heritage Intermediate Corpus (1969) 5.09 million words from 10,043 samples of 500 words long from the publications widely read among American schoolchildren aged 7 to 15.

Corpora for special purposes Categories: reading, English and grammar, composition, literature, mathematics, social studies, spelling, science, music, art, religion, home economics, library fiction, non-fiction, reference and magazines. Words are not lemmatized One of the first computer-based databases for lexicographical purposes.

Other special corpora The Nijmegen CorpusThe Nijmegen Corpus Early 70s Goal: grammatical description of British English Size: 132,000 words

Other special corpora Composition: 20,000 words extract from 6 authors = 120,000 words 12,000 words of transcribed sports commentary categories: written, mainly literary English sports commentary Sample span: 1962—1968 Analysis: a large set of labeled trees or phrase markers.

Other special corpora TOSCATOSCA (Tools for Syntactic Corpus Analysis Corpus Later then the Nijmegen Corpus Size: 20,000 X 75=1,500,000

Other special corpora Categories: various fiction and nonfiction, genres in written British English Span: 1976—1986 Composition: 45 samples from 21 nonfiction genres ((auto)biography, history, literary criticism, politics, women’s studies, chemistry, economics, physics. 30 samples from 9 fiction genres: horror, humor, love and romance, general fiction

Other special corpora Hong Kong University of Science and TechnologyHong Kong University of Science and Technology Size: 1,000,000 of computer science English 2,000 word sample from 166 English language textbooks used in computer science course in the early 90s goal: assist the teaching of English for computer science students.

Other special corpora JTESTJiao Tong University Corpus for English in Science and Technology (JTEST) 1980s 1,000,000 words from written English texts in the physical science, engineering and technology goal: facilitate lexical analysis of particular registers, e.e. count of high frequency words

Other special corpora GPECGuangzhou Petroleum English Corpus (GPEC) 411,000 words from 700 texts from the petroleum industry from written American and British English of the mid 1980s. goal: the same as JTEST

Second generation mega-corpora COBUILD Collins Birmingham University International Language Database 25% from spoken texts reflect broadly general rather than technical language

Second generation mega-corpora current usage from 1960 on naturally occurring texts Prose included but not poetry Contributions: commercial research and development project for dictionaries, grammars and language teaching courses.

Longman Corpus Network Three major corpora LLELC: the Longman/Lancaster English Language Corpus LSC: Longman Spoken Corpus LCLE: Longman Corpus of Learners’ English

British National Corpus (BNC) 100 million words of contemporary spoken and written British English. Structure: Table 2.3 p.51 Automatic word-class tagging with CLAWS

Issues in corpus design and compilation Static or dynamic: Representativeness and balance Size Written and spoken

Issues in corpus design and compilation Extralinguistic variables: text origin, participants, medium genre, style, factuality, topic, date of publication, authorship (age, gender, nationality), audience Storage Text capture: keyboarding, CD-ROM or electronic version, scanning (software, quality of printing) Spoken text: transcribing (conventions for transcribing prosodic phenomena: ICE project) Markup: marks for tagging (Standard Generalized markup language—SGML) p.84

Organizations and professional associations Descriptive linguistics: the International Computer Archive of Modern English (ICANE) ICASME CD-ROM: The Brown, LOB, Kolhapur, London_Lund and Helsinki corpora And softwares: WordCruncher, TACT and Free Text Browser

Organizations and professional associations Bibliographic overview: Humanities Computing yearbook Computational linguistics: the Association for Computational Linguistics (ACL) Literary studies: The Association for Computers and Humanities (ACH) Association for Literary and Linguistic Computing (ALLC)

Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation.

Similar presentations

Presentation on theme: "Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation.

Similar presentations

Presentation on theme: "Corpus 01 Introduction Historical Review. Corpus Linguistics Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation."— Presentation transcript:

Similar presentations

About project

Feedback