What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.

Slides:



Advertisements
Similar presentations
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
Advertisements

Interpreting Concordance Lines Susan Hunston, University of Birmingham John Sinclair, Tuscan Word Centre.
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
How To Teach Vocabulary. Best Practices What does effective, comprehensive vocabulary instruction look like? It has identified four key components: 1.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Word Usage and Vocabulary in context Lecture 8
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
TEACHING VOCABULARY Калинина Е.А. доцент кафедры филологического образования СарИПКиПРО.
Researching language with computers Paul Thompson.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Chapter 10 Language and Computer English Linguistics: An Introduction.
Academic Vocabulary and Grammar Academic Word Lists.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
How Can Corpora Help Me To Be Successful in CO150?
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Elena Tarasheva, PhD New Bulgarian University. Conclusions at last year’s BETA conference.
Corpus approaches to discourse
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Engaging with data Choices and decisions. Seeing or looking at? The advance of corpus linguistics has certainly changed the way that we can look at our.
Communicative and Academic English for the EFL Professional.
Corpus search What are the most common words in English
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Genre and cultural purpose We recognize a genre when a text does something with language that we’re familiar with. Very often we are able state what kind.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Using Parallel Corpora for Contrastive Studies Michael Barlow.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Collecting Written Data
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Searching corpora.

Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Exploring Collocation
How Do We Translate? Methods of Translation The Process of Translation.
Introduction to Corpus Linguistics: Key Word Analysis
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected for linguistic study. It can also describe collections of texts stored and accessed electronically. (Hunston:2002). Corpus planning and design is functional to some linguistic purpose. It is on this basis that texts are selected and stored, so that they can be studied quantitatively and qualitatively. *Ref. Text: Hunston S. Corpora in Applied Linguistics 2002

What are corpora used for? Corpora are often used for language teaching and learning. They give information about how a language works. They also help calculate the relative frequency of different features. Exploring corpora can help students to observe nuances of usage and to make comparisons between languages. Corpora are also used to investigate cultural attitudes expressed through language. NB a corpus will not give information about whether something is possible or not, only whether it is frequent or not!

Using corpora in translation Corpora are also used in translation. Comparable corpora allow to compare the use of apparent equivalents Parallel corpora allow to see how words and phrases have been translated in the past. General corpora can be used to establish norm of frequency and usage.

What can a corpus do? Corpus access software is used to rearrange the information which has been stored so that observations of various kinds can be made. It is not the corpus which gives new information about language. It is the software which gives new perspectives on what is already familiar. Software packages process data showing: frequency, phraseology collocation.

Frequency Corpus processing allows comparisons of words in terms of frequency lists. Quite obviously, grammar words are more frequent than lexical words. That explains why they are found top of the list. Frequency lists can be useful for identifying differences between the corpora. But comparisons can be made only if the corpora are comparable, i.e. if their length is approximately the same.

Concordance The most frequent way to access a corpus is through a concordancing program. Concordance lines bring together instances of use of words or phrases, so that regularities in use can be observed. Concordances also help to understand how nouns or adjectives are used

Collocation Collocation is the tendency of words to co- occur. The collocates of a given word are those words which often occur in conjunction Collocation can indicate pairs of lexical items, or the association between a lexical word and its frequent grammatical environment. In the latter case, the term used is colligation.

Types of corpora A corpus is designed for a particular purpose. Consequently, the type of corpus depends on its purpose: Specialized corpus General corpus Comparable corpora Parallel corpora Learner corpus Historical or diachronic corpus Monitor corpus

Specialized corpus: a corpus of texts of a particular type (editorials, academic articles, lectures, essays, etc.). Specialized corpora reflect the type of language a researcher wants to explore. You may also restrict the corpus to a time frame, to a social setting, to a given topic. General corpus: is a corpus of texts of many types, of written or spoken language, or of both. A general corpus is usually much larger than a specialized corpus. Since it can be used to produce reference materials it is sometimes called a reference corpus.

Comparable corpora: two or more corpora in different languages, or in different varieties of a language. They are designed to contain the same proportion of texts (i.e. newspaper texts, essays, novels, conversations, etc.). They can be used by translators and learners to identify differences and equivalences in each language. Parallel corpora: two or more corpora in different languages, containing translated texts, or texts produced simultaneously in two or more languages (e.g. EU texts). They can be used by translators and learners to find potential equivalents in each language, and to investigate differences between languages.

Learner corpus: a collection of texts produced by learners of a language. It is used to identify differences among learners, frequency and type of mistakes, etc. Historical or diachronic corpus: a corpus of texts from different periods of time. It helps to trace the development of a language over time. Monitor corpus: a corpus used to track current changes in a language. It rapidly increases in size, since it is added annually, monthly, daily, etc. The proportion of text types has to remain constant, so that each year is comparable with every other.

The use of corpora is not limited to identifying, quantifying and analyzing keywords. The concordance lines offer many instances of use of words or phrases, so that the user can observe regularities in use by means of several examples of the same word or phrase in its natural context. Calculating collocation means finding the statystical tendency of words to co-occur, and collocations also emphasize some metaphorical use. A good example is the collocations of the word shed, with light, tears, blood, pounds, confidence, hair, skin, labour. In this contexts shed is a verb. As such, its Italian equivalent may vary, so collocates are different.

Shed lightfare/gettare luce Shed tearsspargere lacrime Shed bloodspargere/versare sangue Shed poundsperdere chili/peso Shed skinperdere/mutare la pelle (fare la muda) Shed confidence ispirare fiducia Shed hairperdere il pelo shed labourdisfarsi della manodopera (licenziare)

Key terms Type Token Hapax Lemma Word-form Tagging Parsing Annotate

Tokens: the term is used to indicate the words which are counted in a corpus or in a given text. But many of these words occur more than once. So, if we count each repeated item once only, the total number changes. In a given text, for instance, we have 250 tokens, but 194 types (articles, repeated nouns etc. are counted once only). Hapax legomena or hapaxes are those words which occur only once. We may also have words which occur in two (or more) different forms: friend and friends, for instance. These are two word-forms which belong to the same lemma. The same is for go, goes, going, went, gone: five word-forms which belong to the same lemma, go. This implies that when using the lemma as a keyword, all its different word-forms have to be looked for.

Usually word-forms are considered to belong to the same lemma when they belong to the same word-class (verb, noun, adjective, etc.) Tagging usually refers to the addition of a code to each word in a corpus, to indicate the part of speech. Automatic tagging is possible, but not fully accurate. Tagging is useful when you want to look at different word categories. For instance, the noun work can be considered separately from the verb.

Corpus parsing is the analysis of a text constituents, for instance clauses, and groups. This allows you to analyse the different structures in a corpus. Just like tagging, parsing can be done automatically, though the output is not very accurate. Manual editing is often necessary.