Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching corpora.

Similar presentations


Presentation on theme: "Searching corpora."— Presentation transcript:

1 Searching corpora

2 Review Word lists Form: grammatical words are most frequent
More words occur as the rank increases. For example, there are going to be more words (types) occurring 4 times in a corpus than words occurring 5 times.

3 Uses of wordlists Determine frequency bands for English
Most frequent 1000 words Academic “band” etc. Such frequency bands can be used in course planning and to judge the difficulty of a text

4 Uses of wordlists Determine the core vocabulary for a particular topic, such as academic English or business English Produce a wordlist for a business English corpus and use a stoplist (remove grammatical words) Do a corpus comparison

5 Corpus comparison table

6 How to search: Using concordance software
Corpora are large and computer-aided analysis is necessary Concordance software is the most common type of text analysis software Concordance software produces frequency lists from a corpus allows searches for words or phrases As I mentioned earlier, a corpus may consist of millions of words, which means that we need some help in manipulating the data. We cannot look through such large amounts of data. We need some computer-aided analysis. Concordance software is the most common type of text analysis software. (I like to design and create text analysis software, but the basic ideas were around even before computers existed.) Typically concordance software will produce a word frequency list from a corpus -- I won’t say more about unless there are some questions about it at the end. And concordance sofware allows searches for words or phrases.

7 Searching a corpus

8 Info easily obtained from corpora
word lists and word frequency (What are the most common words in a corpus?) word combinations (collocations) such as high risk, low maintenance, deep passion grammatical constructions (How common is the passive? How often does the passive occur with a by-phrase? He was seen by the doctor.) What kinds of texts/genres are associated with particular collocations/constructions? word lists --

9 Basic procedure of corpus analysis
very simple: searching for words and phrases sorting the results obtaining frequency information performing the analysis (Sinclair reading) This simple procedure covers the basics of corpus analysis. It involves searching for words and phrases; sorting the results, and obtaining frequency information. You can treat a corpus as a reference like a dictionary or grammar book. Text analysis software will allow you to query the corpus and find out about usage. As I said, that is the basics of corpus analysis. There are more sophisticated searches and frequency analyses but they are variations of what you have seen. Working with corpora tends to change the way that people think about language. Let me give you a sense of that change.

10 Results - sorted We have collected all the instances of thing and we could just look through them all -- but there are quite a lot of them. We can get a clearer idea of whether puzzles following thing often by sorting the results in alphabetical order of the word following the key word. Let us do a 1st Right Search and scroll down the list until we locate words starting with P. We can see that puzzles is not common. We can also look at the one thing pattern, by sorting 1st left. There will be many instances of one thing and so we can sort these instances. Let us sort primarily 1st left and secondarily 1st Right. From these results you can see that repeated patterns stand out visually. You can see that the phrases one thing to and one thing is are common.

11 Results - surrounding words
We are interested in the words that occur with thing and we can make the software count all the surrounding words and give us a table of results. The display shows all the totals for the words preceding thing -- those words are given in the column 1st left. Looking at the list we can see the word one. We call the common associated words collocates. Thus the word one is a collocate of thing. We are also interested in the words following thing. These are listed in the column 1st Right. We see that one thing is quite common, but thing followed by puzzles is not so common. If we look at the results in some detail we find that one thing to do and one thing is are frequent phrases.

12 Collocates and collocations?
The collocates of a word are the frequent co-occurring words If office is the node word, then post, head and take are examples of collocates head office is a collocation

13 Why are collocations a big deal?
There are several answers knowing a language involves knowing lots and lots of collocations it is unclear how to deal with collocations within grammar collocates provide clues to the meaning or connotations of a word, e.g., husbands and wives

14 husband and wife

15 Corpus data: collocates of high

16 Corpus data evidence of extensive multi-word units
various names: collocations, fixed expressions, chunks, pre-fabricated units (prefabs), lexical bundles

17 Corpus data thing: sort of thing, kind of thing, the thing to do, the thing is change: change in attitude, change of attitude, change of heart, change in policy, change over time, time for a change, pace of change, rate of change, subject to change

18 Issues in interpreting concordance lines (Hunston)
Nature of search term – word, lemma, phrase Some unwanted concordance lines may be deleted Sorting (in different ways) brings out patterns in the data Look first at the words surrounding the search term In some cases, a larger co-text must be examined

19 Issues in interpreting concordance lines (Hunston)
Typically, a large number of lines are retrieved. One technique is to look closely at a few concordance lines and try to draw some generalisations. Then look at another set of concordance lines to see if your generalisation holds What is typical? What is central (prototypical)?

20 Techniques in interpreting concordance lines
Examine the frequent collocates – what are the larger (formal patterns) Examine or apply part-of-speech categories. Do patterns emerge if POS data is taken into account Check for semantic categories: colour terms, hedges, …

21 Corpus view of language
Researchers who work with corpora tend to come to similar views about the nature of language More attention is paid to the words (lexis and phraseology) Led to lexical approaches to language teaching (Willis, Lewis, McCarthy)

22 Words and grammar Language is traditionally divided into the lexicon (words) and grammar (a set of rules) But where, for example, does the phrase “sort of thing” or “the thing to do” fit within this view of language? Traditionally language is analysed in terms of words (the lexicon) and a set of rules (the grammar). We saw that one thing to do or one thing to V is quite common in English. Is this pattern a part of the lexicon or a part of grammar? It is hard to tell -- it seems too large and too variable to be a part of the lexicon. On the other hand, it doesn’t seem to be a good example of a grammatical rule. It involves words and it is not very general. Researchers who work with corpora see that language contains a lot of these semi-lexical, semi-grammatical patterns. Let me give one more example.

23 View of language corpus view – language/grammar is a vast network of lexical/grammatical relations collocations (high standards, high on drugs) verb – object co-selection (lose – job, find - employment, made – redundant) constructions - passive (or BE + adjective/participle)

24 Representing a corpus-based grammar
Quite difficult Use schemas to represent words, collocations and constructions [post office] [N N] [the thing to V] [SUBJ Vmanner POSS way PATH]

25 Schemas Related to schema theory in reading
Schemas have a form and a meaning and they are linked to form a network. [change of heart] -- meaning Abstract schema [N of N] -- abstract meaning


Download ppt "Searching corpora."

Similar presentations


Ads by Google