Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Linguistics and Corpora. Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription.

Similar presentations


Presentation on theme: "Corpus Linguistics and Corpora. Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription."— Presentation transcript:

1 Corpus Linguistics and Corpora

2 Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies.

3 Corpus Linguistics Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. (cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford, 85) (cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford, 85)

4 Corpus CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse……….. (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, 265-266) (cf. McArthur, Tom 1992 "Corpus", The Oxford Companion to the English Language. Oxford, 265-266)

5 Chomsky 1957 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159 "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. " Syntactic structures. The Hague, 159

6 Fillmore 1992 "I have two main observations to make. "I have two main observations to make. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate.

7 Fillmore 1992 The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way." In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35. In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35.

8 Types of corpus Monolingual corpora - in which the texts are all in the same language Monolingual corpora - in which the texts are all in the same language Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original. Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts are synchronized to appear on the screen together and it is easy to see how the translator has translated the original.

9 Types of corpus Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates. Concurrent corpora - a term used to describe texts taken from newspapers on the same subject on approximately the same dates.

10 Types of corpus Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc Specialized corpora - texts on specialized subjects. The principal use for these corpora is the extraction of terminology and complementary explanatory material - definitions, explanations, semantic relations etc

11 Types of corpus 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language 'Do-it-yourself ' corpora - a term coined by those of us using small specialized corpora for the purpose of teaching translation or language Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions. Disposable corpora - the same as 'do-it- yourself' corpora, but taking into account that such corpora need to be disposed of after use so that their users do not get into trouble with copyright restrictions.

12 How do you search a corpus? Concordancing Concordancing Sentence level – see BNC Sentence level – see BNC http://www.natcorp.ox.ac.uk COMPARA – parallel concordance http://www.linguateca.pt/COMPARA COMPARA – parallel concordance http://www.linguateca.pt/COMPARA http://www.linguateca.pt/COMPARA

13 The Survey of English Usage 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) 60s - Randolph Quirk et al > launched the Survey of English Usage (SEU) "with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English"with the aim of collecting a large and stylistically varied corpus as the basis for a systematic description of spoken and written English

14 The Survey of English Usage Brown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken EnglishBrown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken English See ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at http://gandalf.aksis.uib.no/icame.htmlSee ICAME - International Computer Archive of Modern and Medieval English at the Norwegian Computing Centre for the Humanities at http://gandalf.aksis.uib.no/icame.html http://gandalf.aksis.uib.no/icame.html

15 The Survey of English Usage Today at University of London at http://www.ucl.ac.uk/english-usage/ Today at University of London at http://www.ucl.ac.uk/english-usage/ http://www.ucl.ac.uk/english-usage/ ICE - the International Corpus of English ICE - the International Corpus of English Download the sampler of this corpus fully tagged and analysed from http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm Download the sampler of this corpus fully tagged and analysed from http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm http://www.ucl.ac.uk/english- usage/ice-gb/sampler/form.htm

16 Quality versus quantity A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million words) British National Corpus – 100 million words British National Corpus – 100 million words Other corpora Other corpora Bank of English - 450 millionBank of English - 450 million The Internet The Internet

17 Corpora, lexicography & terminology Lexicography BEFORE corpora Lexicography BEFORE corpora Emphasis on etymologyEmphasis on etymology Complex definitionsComplex definitions Usage based on intuitions of lexicographersUsage based on intuitions of lexicographers Terminology BEFORE corpora Terminology BEFORE corpora Standardization > one word= one concept, rigid definitionsStandardization > one word= one concept, rigid definitions Paper dictionaries/glossariesPaper dictionaries/glossaries

18 Corpora, lexicography & terminology Lexicography & terminology AFTER corpora Lexicography & terminology AFTER corpora Emphasis on modern usage in contextEmphasis on modern usage in context Simple definitionsSimple definitions Usage based on evidence in textsUsage based on evidence in texts emphasis on establishing REAL rather than IDEAL usageemphasis on establishing REAL rather than IDEAL usage

19 COBUILD project Begun in 1969 Begun in 1969 Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair Collins, the well-known dictionary publisher, and the University of Birmingham – led by John Sinclair A pioneering project A pioneering project Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Objective > to collect texts for a corpus of contemporary texts from which to extract information on modern English usage Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987 Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987

20 COBUILD > Bank of English Present site for COBUILD > Bank of English http://www.titania.bham.ac. uk/docs/about.htm Present site for COBUILD > Bank of English http://www.titania.bham.ac. uk/docs/about.htmhttp://www.titania.bham.ac. uk/docs/about.htmhttp://www.titania.bham.ac. uk/docs/about.htm

21 British National Corpus (BNC) - original Oxford University Computing Service at http://www.natcorp.ox.ac.uk/ Oxford University Computing Service at http://www.natcorp.ox.ac.uk/http://www.natcorp.ox.ac.uk/ This completely free – but you only get up to 50 results This completely free – but you only get up to 50 results

22 Brigham Young University (BYU) http://corpus.byu.edu/ Note: Note: Corpus of American English Corpus of American English BNC BNC TIME corpus TIME corpus Corpus de Português Corpus de Português Corpus de Español Corpus de Español

23 Brigham Young University (BYU) PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing

24 BNC – CQP version Lancaster university http://bncweb.lancs.ac.uk/bncwebSi gnup/ http://bncweb.lancs.ac.uk/bncwebSi gnup/ PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing

25 Other large monolingual corpora Portuguese > CETEMPUBLICO http:// www.linguateca.pt/cetempublico/ Portuguese > CETEMPUBLICO http:// www.linguateca.pt/cetempublico/ http:// www.linguateca.pt/cetempublico/ http:// www.linguateca.pt/cetempublico/ Spanish > Real Academia Spanish > Real Academia German > Mannheimer corpus German > Mannheimer corpus

26 Using corpora to study syntax For example: For example: whether certain nouns occur more often in the singular than pluralwhether certain nouns occur more often in the singular than plural how pronouns are used in different languageshow pronouns are used in different languages which verbs favour certain forms of tense, aspect or moodwhich verbs favour certain forms of tense, aspect or mood how adjectives combine with nounshow adjectives combine with nouns where adjuncts occur in sentenceswhere adjuncts occur in sentences ETCETC

27 Monolingual corpora General language corpora useful for studying: General language corpora useful for studying: Words in contextWords in context Problems of COLLOCATIONProblems of COLLOCATION Relative usage of synonymsRelative usage of synonyms Syntactic structuresSyntactic structures Sentence structureSentence structure

28 Parallel Corpora - multilingual European commission - Multilingual http://ec.europa.eu/ European commission - Multilingual http://ec.europa.eu/ http://ec.europa.eu/ EUROPARL - Multilingual http://www.statmt.org/europarl/ EUROPARL - Multilingual http://www.statmt.org/europarl/ http://www.statmt.org/europarl/ ELDA http://www.elda.org/sommaire.php ELDA http://www.elda.org/sommaire.php http://www.elda.org/sommaire.php

29 Parallel Corpora COMPARA EN/PT http://www.linguateca.pt/compara COMPARA EN/PT http://www.linguateca.pt/compara http://www.linguateca.pt/compara

30 Corpógrafo - LINGUATECA An on-line suite of tools we have developed for: An on-line suite of tools we have developed for: Construction of corporaConstruction of corpora Semi-automatic extraction of terminologySemi-automatic extraction of terminology Construction of terminology databasesConstruction of terminology databases Terminology & corpora researchTerminology & corpora research Research into information retrieval and knowledge engineeringResearch into information retrieval and knowledge engineering

31 CORPÓGRAFO http://www.linguateca.pt/corpografo http://www.linguateca.pt/corpografo http://www.linguateca.pt/corpografo FREE! FREE! On-line! On-line! For individual research For individual research

32 Bibliography ICAME site at http://helmer.aksis.uib.no/icame.html ICAME site at http://helmer.aksis.uib.no/icame.htmlhttp://helmer.aksis.uib.no/icame.html BIBER, D., CONRAD, S. & REPPEN, R. 1998 Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, D., CONRAD, S. & REPPEN, R. 1998 Corpus Linguistics: Investigating Language structure and Use. Cambridge: Cambridge University Press. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd. BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd.

33 Bibliography HOEY, Michael. 1991. Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN 0 19 437142 5. HOEY, Michael. 1991. Patterns of Lexis in Text. Oxford: Oxford University Press. ISBN 0 19 437142 5. MCENERY, Tony & WILSON, Andrew. 2001. Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. MCENERY, Tony & WILSON, Andrew. 2001. Corpus Linguistics. 2nd Edition. Edinburgh: Edinburgh University Press. OAKES, Michael P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6 OAKES, Michael P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6 SINCLAIR, John (ed) 1987. Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. SINCLAIR, John (ed) 1987. Looking Up - An account of the COBUILD project in lexical computing. Collins COBUILD. Collins ELT: London and Glasgow. STUBBS, Michael. 1996. Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk). STUBBS, Michael. 1996. Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk).


Download ppt "Corpus Linguistics and Corpora. Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription."

Similar presentations


Ads by Google