Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.

Similar presentations


Presentation on theme: "Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically."— Presentation transcript:

1 Dr. Radhika Mamidi Corpus

2 What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules. used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules.

3 Dr. Radhika Mamidi Corpus A corpus is: A corpus is: Spoken (transcribed) or writtenSpoken (transcribed) or written In any languageIn any language Usually naturally-occurringUsually naturally-occurring Stored electronicallyStored electronically Searched using dedicated softwareSearched using dedicated software Using the techniques of frequency, phraseology and collocations, the data is processed. Using the techniques of frequency, phraseology and collocations, the data is processed.

4 Dr. Radhika Mamidi Some notable text corpora English language: Bank of English Bank of English Bank of English Bank of English British National Corpus [BNC] British National Corpus [BNC] British National Corpus British National Corpus Brown Corpus Brown Corpus Brown Corpus Brown Corpus Lancaster Oslo Bergen [LOB] Lancaster Oslo Bergen [LOB] International Corpus of English International Corpus of English International Corpus of English International Corpus of English Oxford English Corpus Oxford English Corpus Oxford English Corpus Oxford English Corpus Scottish Corpus of Texts & Speech Scottish Corpus of Texts & Speech Scottish Corpus of Texts & Speech Scottish Corpus of Texts & Speech

5 Types of Corpora A corpus may contain texts in a single language monolingual corpus) or text data in multiple languages (multilingual corpus). A corpus may contain texts in a single language monolingual corpus) or text data in multiple languages (multilingual corpus). Specialized Corpus Specialized Corpus General Corpus General Corpus Learner’s Corpus Learner’s Corpus Pedagogic Corpus Pedagogic Corpus Diachronic Corpus Diachronic Corpus Monitor Corpus Monitor Corpus Comparable Corpus Comparable Corpus Parallel Corpus Parallel Corpus Monolingual Bilingual

6 Concordance lines Concordance lines show every instance of the word you have asked for (or a sample of these), with a few words before and after; Concordance lines show every instance of the word you have asked for (or a sample of these), with a few words before and after; They can be sorted to put together similar co-texts; They can be sorted to put together similar co-texts; They encourage observation of recurring patterns (‘samenesses’). They encourage observation of recurring patterns (‘samenesses’).

7 Dr. Radhika Mamidi Example: Cup of tea (1) Example: Cup of tea (1) and we'll discuss it over a cup of tea." He handed back the ID and and we'll discuss it over a cup of tea." He handed back the ID and in and ask me to make her a cup of tea. When I refuse she'll say in and ask me to make her a cup of tea. When I refuse she'll say she wouldn't even accept a cup of tea because she didn't have the she wouldn't even accept a cup of tea because she didn't have the where to begin. I offered him a cup of tea and he blurted out: `I will where to begin. I offered him a cup of tea and he blurted out: `I will trolley approaches and a cup of tea is set down on her locker. T trolley approaches and a cup of tea is set down on her locker. T play quietly, while I have a cup of tea, I'll cook you some chips f play quietly, while I have a cup of tea, I'll cook you some chips f on the sofa enjoying a nice cup of tea. Since I've done all this on on the sofa enjoying a nice cup of tea. Since I've done all this on much. I will have a nice cup of tea with him before the game and much. I will have a nice cup of tea with him before the game and into my uniform, have a quick cup of tea, and then get breakfast read into my uniform, have a quick cup of tea, and then get breakfast read for a quick, or not so quick, cup of tea to return the compliment in for a quick, or not so quick, cup of tea to return the compliment in a chat about old times over a cup of tea and a biscuit." Eurosta a chat about old times over a cup of tea and a biscuit." Eurosta She sat down and picked up the cup of tea I'd poured for her. She dran She sat down and picked up the cup of tea I'd poured for her. She dran I sure as hell remembered the cup of tea because I mean because it a I sure as hell remembered the cup of tea because I mean because it a

8 Dr. Radhika Mamidi Example: Cup of tea (2) Example: Cup of tea (2) hours on motorways is not my cup of tea, but I do like visiting new p hours on motorways is not my cup of tea, but I do like visiting new p her. `No, really, she's not my cup of tea. But the powerful deputy edi her. `No, really, she's not my cup of tea. But the powerful deputy edi lecturers were more my cup of tea than homicidally tanked-up l lecturers were more my cup of tea than homicidally tanked-up l of Ruby -- she's not everyone's cup of tea. By the way, I understand yo of Ruby -- she's not everyone's cup of tea. By the way, I understand yo marketing may not be everyone's cup of tea. There's an old advertising two This is much more Linda's cup of tea: a three-bedroom, brand-new two This is much more Linda's cup of tea: a three-bedroom, brand-new play. This won't be everybody's cup of tea; but you'd be hard pushed to play. This won't be everybody's cup of tea; but you'd be hard pushed to which are not everybody's cup of tea. And the annual management c which are not everybody's cup of tea. And the annual management c catching. But if Leo isn't your cup of tea, you might like AMERICAN BE catching. But if Leo isn't your cup of tea, you might like AMERICAN BE have been here. It's more your cup of tea, as it were, with its High C have been here. It's more your cup of tea, as it were, with its High C was saying. `Not quite your cup of tea, isn't that what you say?" S was saying. `Not quite your cup of tea, isn't that what you say?" S

9 Dr. Radhika Mamidi Uses of Concordance lines Concordance lines make recurrences of pattern apparent. Concordance lines make recurrences of pattern apparent. They encourage us to see that: They encourage us to see that: Pattern and meaning are associatedPattern and meaning are associated Many words and phrases occur in a restricted set of contextsMany words and phrases occur in a restricted set of contexts They encourage us to make unexpected connections between items. They encourage us to make unexpected connections between items.

10 Dr. Radhika Mamidi Use of Corpus: Language teaching Language teaching: The most frequent words with the most frequent senses are taught, using the concordance lines the grammar patterns are studied, the difference in easily confused pairs is noted. Language teaching: The most frequent words with the most frequent senses are taught, using the concordance lines the grammar patterns are studied, the difference in easily confused pairs is noted. Eg: interested and interesting Eg: interested and interesting interested is used in the phrase ‘interested in’ and the pattern ‘someone is interested in something’ is more frequent. interested is used in the phrase ‘interested in’ and the pattern ‘someone is interested in something’ is more frequent. interesting is nearly always used before a noun and the pattern ‘an interesting thing’ is more frequent. interesting is nearly always used before a noun and the pattern ‘an interesting thing’ is more frequent.

11 Dr. Radhika Mamidi Use of Corpus: Dictionary making New words/phrases/collocations, new meanings to old words and real examples are added using corpus. New words/phrases/collocations, new meanings to old words and real examples are added using corpus. Frequency plays an important in making the entries for each headword. Frequency plays an important in making the entries for each headword. Longman Dictionary of Contemporary English 1 st and 2 nd editions were written without using corpus and the 3 rd edition, 1995, was written using a corpus. Longman Dictionary of Contemporary English 1 st and 2 nd editions were written without using corpus and the 3 rd edition, 1995, was written using a corpus. You will find example sentences from corpus in this edition apart from more number of senses. You will find example sentences from corpus in this edition apart from more number of senses.Example: New words – internet, New words – internet, New meanings – file, folder, save New meanings – file, folder, save New shades of meaning – ‘I know’ New shades of meaning – ‘I know’

12 Dr. Radhika Mamidi Use of Corpus: Translation Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. Aligned parallel corpora is useful for translators to study the SL and TL equivalents. Aligned parallel corpora is useful for translators to study the SL and TL equivalents.

13 Dr. Radhika Mamidi Use of Corpus: Natural Language Processing In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation.annotation An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags.part-of-speech taggingpart-of-speech tagging Other types of annotation– syntactic, semantic and discourse. Other types of annotation– syntactic, semantic and discourse. To build NLP tools like POS taggers, syntactic parsers or semantic analyzers, such annotated corpus is used. To build NLP tools like POS taggers, syntactic parsers or semantic analyzers, such annotated corpus is used.


Download ppt "Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically."

Similar presentations


Ads by Google