Searching corpora.

Slides:



Advertisements
Similar presentations
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES language teaching (1) Bambang Kaswanti Purwo
Advertisements

Chapter 5: Introduction to Information Retrieval
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Chapter 5: Information Retrieval and Web Search
Basic Concept of Data Coding Codes, Variables, and File Structures.
Memory Strategy – Using Mental Images
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES lexico-grammatical profiles Bambang Kaswanti Purwo
Vocabulary connections
Teaching Vocabulary Chapter 14
Vocabulary connections:multi- word items in English.
Chapter 6: Information Retrieval and Web Search
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Adapting activities in the Lexical Approach
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
Section 2 Effective Groupwork Online. Contents Effective group work activity what is expected of you in this segment of the course: Read the articles.
Corpora and language learning
Oral Presentation of the Teaching Plan for Module 2, Book 1
Organizing Qualitative Data
Lecture 3 Syllabuses and Coursebooks
Lectured by: Miss Yanna Queencer Telaumbanua, M.Pd.
Structure of a research article and Skimming skills
Component 1.6.
AP CSP: Cleaning Data & Creating Summary Tables
Collecting Written Data
E303 Part II The Context of Language Research
Measuring Monolinguality
Introduction to Corpus Linguistics
Identifying Question Stems
Statistical NLP: Lecture 7
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
LEXICAL APPROACH.
TEACHING VOCABULARY.
Text Based Information Retrieval
Reading and Frequency Lists
Structured Browsing for Unstructured Text

EXTENSIVE READING PART 1.
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
عمادة التعلم الإلكتروني والتعليم عن بعد
Content analysis, thematic analysis and grounded theory
Introduction to Corpus Linguistics: Exploring Collocation
Taking notes when listening
Introduction to Corpus Linguistics: Dispersion/concordance plots
Intro to corpus linguistics: Data Driven Grammar
Introduction to Corpus Linguistics: Basic tools: Concordances
Qualitative and Quantitative Data
 DATAABSTRACTION  INSTANCES& SCHEMAS  DATA MODELS.
Corpus Linguistics I ENG 617
Cognitive approach Lesson 6.
Coding Qualitative Data
Teaching Different Classes
National Curriculum Requirements of Language at Key Stage 2 only
Effective Presentation
Chapter 5: Information Retrieval and Web Search
Introduction to Data Structures
Spreadsheets, Modelling & Databases
Organizing Qualitative Data
Applied Linguistics Chapter Four: Corpus Linguistics
How to use hash tables to solve olympiad problems
STRATEGIES FOR BUILDING A POWERFUL VOCABULARY
The Lexical Approach By: Yajaira Carrillo and Lorena Chirinos.
Data Analysis, Interpretation, and Presentation
Presentation transcript:

Searching corpora

Review Word lists Form: grammatical words are most frequent More words occur as the rank increases. For example, there are going to be more words (types) occurring 4 times in a corpus than words occurring 5 times.

Uses of wordlists Determine frequency bands for English Most frequent 1000 words Academic “band” etc. Such frequency bands can be used in course planning and to judge the difficulty of a text

Uses of wordlists Determine the core vocabulary for a particular topic, such as academic English or business English Produce a wordlist for a business English corpus and use a stoplist (remove grammatical words) Do a corpus comparison

Corpus comparison table

How to search: Using concordance software Corpora are large and computer-aided analysis is necessary Concordance software is the most common type of text analysis software Concordance software produces frequency lists from a corpus allows searches for words or phrases As I mentioned earlier, a corpus may consist of millions of words, which means that we need some help in manipulating the data. We cannot look through such large amounts of data. We need some computer-aided analysis. Concordance software is the most common type of text analysis software. (I like to design and create text analysis software, but the basic ideas were around even before computers existed.) Typically concordance software will produce a word frequency list from a corpus -- I won’t say more about unless there are some questions about it at the end. And concordance sofware allows searches for words or phrases.

Searching a corpus

Info easily obtained from corpora word lists and word frequency (What are the most common words in a corpus?) word combinations (collocations) such as high risk, low maintenance, deep passion grammatical constructions (How common is the passive? How often does the passive occur with a by-phrase? He was seen by the doctor.) What kinds of texts/genres are associated with particular collocations/constructions? word lists --

Basic procedure of corpus analysis very simple: searching for words and phrases sorting the results obtaining frequency information performing the analysis (Sinclair reading) This simple procedure covers the basics of corpus analysis. It involves searching for words and phrases; sorting the results, and obtaining frequency information. You can treat a corpus as a reference like a dictionary or grammar book. Text analysis software will allow you to query the corpus and find out about usage. As I said, that is the basics of corpus analysis. There are more sophisticated searches and frequency analyses but they are variations of what you have seen. Working with corpora tends to change the way that people think about language. Let me give you a sense of that change.

Results - sorted We have collected all the instances of thing and we could just look through them all -- but there are quite a lot of them. We can get a clearer idea of whether puzzles following thing often by sorting the results in alphabetical order of the word following the key word. Let us do a 1st Right Search and scroll down the list until we locate words starting with P. We can see that puzzles is not common. We can also look at the one thing pattern, by sorting 1st left. There will be many instances of one thing and so we can sort these instances. Let us sort primarily 1st left and secondarily 1st Right. From these results you can see that repeated patterns stand out visually. You can see that the phrases one thing to and one thing is are common.

Results - surrounding words We are interested in the words that occur with thing and we can make the software count all the surrounding words and give us a table of results. The display shows all the totals for the words preceding thing -- those words are given in the column 1st left. Looking at the list we can see the word one. We call the common associated words collocates. Thus the word one is a collocate of thing. We are also interested in the words following thing. These are listed in the column 1st Right. We see that one thing is quite common, but thing followed by puzzles is not so common. If we look at the results in some detail we find that one thing to do and one thing is are frequent phrases.

Collocates and collocations? The collocates of a word are the frequent co-occurring words If office is the node word, then post, head and take are examples of collocates head office is a collocation

Why are collocations a big deal? There are several answers knowing a language involves knowing lots and lots of collocations it is unclear how to deal with collocations within grammar collocates provide clues to the meaning or connotations of a word, e.g., husbands and wives

husband and wife

Corpus data: collocates of high

Corpus data evidence of extensive multi-word units various names: collocations, fixed expressions, chunks, pre-fabricated units (prefabs), lexical bundles

Corpus data thing: sort of thing, kind of thing, the thing to do, the thing is change: change in attitude, change of attitude, change of heart, change in policy, change over time, time for a change, pace of change, rate of change, subject to change

Issues in interpreting concordance lines (Hunston) Nature of search term – word, lemma, phrase Some unwanted concordance lines may be deleted Sorting (in different ways) brings out patterns in the data Look first at the words surrounding the search term In some cases, a larger co-text must be examined

Issues in interpreting concordance lines (Hunston) Typically, a large number of lines are retrieved. One technique is to look closely at a few concordance lines and try to draw some generalisations. Then look at another set of concordance lines to see if your generalisation holds What is typical? What is central (prototypical)?

Techniques in interpreting concordance lines Examine the frequent collocates – what are the larger (formal patterns) Examine or apply part-of-speech categories. Do patterns emerge if POS data is taken into account Check for semantic categories: colour terms, hedges, …

Corpus view of language Researchers who work with corpora tend to come to similar views about the nature of language More attention is paid to the words (lexis and phraseology) Led to lexical approaches to language teaching (Willis, Lewis, McCarthy)

Words and grammar Language is traditionally divided into the lexicon (words) and grammar (a set of rules) But where, for example, does the phrase “sort of thing” or “the thing to do” fit within this view of language? Traditionally language is analysed in terms of words (the lexicon) and a set of rules (the grammar). We saw that one thing to do or one thing to V is quite common in English. Is this pattern a part of the lexicon or a part of grammar? It is hard to tell -- it seems too large and too variable to be a part of the lexicon. On the other hand, it doesn’t seem to be a good example of a grammatical rule. It involves words and it is not very general. Researchers who work with corpora see that language contains a lot of these semi-lexical, semi-grammatical patterns. Let me give one more example.

View of language corpus view – language/grammar is a vast network of lexical/grammatical relations collocations (high standards, high on drugs) verb – object co-selection (lose – job, find - employment, made – redundant) constructions - passive (or BE + adjective/participle)

Representing a corpus-based grammar Quite difficult Use schemas to represent words, collocations and constructions [post office] [N N] [the thing to V] [SUBJ Vmanner POSS way PATH]

Schemas Related to schema theory in reading Schemas have a form and a meaning and they are linked to form a network. [change of heart] -- meaning Abstract schema [N of N] -- abstract meaning