LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt.

LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt

In this lecture…  Corpora and the Lexicon uses of corpora in lexicography  Counting words lemmatisation and other issues types versus tokens word frequency distributions in corpora

Part 1 Corpora and lexicography

Why corpora are useful  Lexicographic work has long relied on contextual cues to identify meanings. e.g. Samuel Johnson used examples from literature to exemplify uses of a word.  Corpora make this procedure much easier not only to provide examples but: to actually identify meanings of a word given its context definitions of word meanings should therefore be more precise, if based on large amounts of data

Specific applications  Grammatical alternations of words E.g. Verb diathesis alternations:  Atkins and Levin (1995) found that verbs such as quiver and quake have both intransitive and transitive uses. (see Lecture 1) E.g. uses of prepositions such as on, with…  Regional variations in word use relying on corpora which include gender/region/dialect/date information

Specific applications - II  Identification of occurrences of a specific homograph, e.g. house (Verb) examination of the contexts in which it occurs relies on POS tagging  Keeping track of changes in a language through a monitor corpus  Identifying how common a word is, through frequency counts. many dictionaries include such information now this shall be our starting point

Part 2 Counting words in corpora: types versus tokens

Running example  Throughout this lecture, reference is made to data from a corpus of Maltese texts: ca. 51,000 words all from Maltese-language newspapers various topics and article types

How to count words: types versus tokens  token = any word in the corpus (also counting words that occur more than once)  type = all the individual, different words in the corpus (grouping occurrences of a word together as representatives of a single type)  Example: I spoke to the chap who spoke to the child 10 tokens 7 types (I, spoke, to, the, chap, who, child)

More on types and tokens  The number of tokens in the corpus is an estimate of overall corpus size Maltese corpus: 51,000 tokens  The number of types is an estimate of vocabulary size gives an idea of the lexical richness of the corpus Maltese corpus: 8193 types

Type/token ratio  A (rough!) way of measuring the amount of variation in the vocabulary in the corpus.  Roughly, can be interpreted as the “rate at which new types are introduced, as a function of number of tokens”

Difficult decisions - I  Do we distinguish upper- and lower- case words? is New in New York the same as new in new car? but what of New in New cars are expensive? (sentence-initial caps) in practise, it’s not straightforward to distinguish the two accurately, but can be done

Difficult decisions - II  What about morphological variants? man – men  one type or two? go – went  one type or two?  If we map all morphological (inflectional) variants to a single type, our counts will be cleaner (lemmatisation). depends on availability of automated methods to do this  Maltese also presents problems with variants of the definite article (ir-, is-, ix- etc) ir-raġel (DEF-man): one token or two?

Difficult decisions - III  Do numbers count? e.g. is 1,500 a word? may artificially inflate frequency counts one approach is to treat all numbers as tokens of a single type “NUMBER” or “###”  Punctuation can compromise frequency counts computer will treat “woman!” as different from “woman” needs to be stripped problematic for languages that rely on non-alphabetic symbols: Maltese ‘l (“to”) vs l- (“the”)

Part 2 Representing word frequencies

Raw frequency lists (data from Maltese)  A simple list, pairing each word with its frequency wordfrequency aħħar (“last”)97 jkun (“be.IMPERF.3SG”)96 ukoll (“also”)93 bħala (“as”)91 dak (“that.SGM”)86 tat- (“of.DEF”)86

Frequency ranks  Word counts can get very big. most frequent word in the Maltese corpus occurs 2195 times (and the corpus is small)  Raw frequency lists can be hard to process.  Useful to represent words in terms of rank: count the words sort by frequency (most frequent first) assign a rank to the words:  rank 1 = most frequent  rank 2 = next most frequent ……

Rank-frequency list example (data from Maltese) rankFrequency 12195 22080 31277 41264 Rank of type, according to frequency Number of times the type occurs

Frequency spectrum (data from Maltese)  A representation that shows, for each frequency value, the number of different types that occur with that frequency. frequencytypes 14382 21253 3661 4356

Normalised frequency counts  A raw frequency for a word isn’t necessarily informative. E.g. difficult to compare the frequency of the word in corpora of different sizes.  We often take a “normalised” count. typical to divide the frequency by some constant, such as 10,000 or 1,000,000 this gives “frequency of word per million” rather than a raw count.

Type/token ratio revisited  (no. of types)/(no. of tokens)  Another way of estimating “vocabulary richness” of a corpus, instead of just looking at vocabulary size.  E.g. if a corpus consists of 1000 words, and there are 400 types, then the TTR is 40%

Type/token ratio  Ratio varies enormously depending on corpus size!  If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%.  With 4 million words, it’s more likely to be in the region of 2%.  Reasons: vocab size grows with corpus size but large corpora will contain a lot of tokens that occur many times

Standardised type/token ratio  One way to account for TTR variations due to corpus size is to compute an average TTR for chunks of a constant size. Example: compute the TTR for every 1000 words of running text then, take an average over all the 1000- word chunks  This is the approach used, for example, in WordSmith.

Part 3 Frequency distributions, or “few giants, many midgets”

Non-linguistic case study  Suppose we are interested in measuring people’s height. population = adult, male/female, European sample: N people from the relevant population measure height of each person in the sample  Results: person 1: 1.6 m person 2: 1.5 m …

Measures of central tendency  Given the height of individuals in our sample, we can calculate some summary statistics: mean (“average”): sum of all heights in sample, divided by N mode: most frequent value Median: the middle value  What are your expectations?

The data (example)  Mean: 158.8cm This is the expected value in the long run. If our sample is good, we would expect that most people would have a height at or around the mean.  Mode: 160cm  Median: 160 height 1135 2159 3160 4 5180

Plotting height/frequency Observations: 1.Extreme values are less frequent. 2. Most people fall on the mean 3. Mode is approximately same as mean 4. Bell-shaped curve (“normal” distribution)

Plotting height/frequency This shape characterises the Normal Distribution. A “bell curve” Quite typical for a lot of data sampled from humans (but not all data)

What about language?  Typical observations about word frequencies in corpora: 1.there are a few words with extremely high frequency 2.there are many more words with extremely low frequency 3.the mean is not a good indicator: most words will have an actual value that is very far above or below the mean

A closer look at the Maltese data  Out of 51,000 tokens: 8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5) ca. 15% of our corpus size is made up of only 5 different words!  Out of 8193 types: 4382 are hapax legomena, occurring only once (bottom ranks) 1253 occur only twice …  In this data, the mean won’t tell us very much. it hides huge variations!

Ranks and frequencies (Maltese) 1.2195 2.2080 3.1277 … 2298. 1 2299. 1 … Among top ranks, frequency drops very dramatically Among bottom ranks, frequency drops very gradually

General observations  In corpora: there are always a few very high- frequency words, and many low- frequency words among the top ranks, frequency differences are big among bottom ranks, frequency differences are very small

So what are the high-frequency words?  Top 5 ranked words in the Maltese data: li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of”), tal- (“of the”)  Bottom ranked words: żona (“zone”) f = 1 yankee f = 1 żwieten (“Zejtun residents”) f = 1 xortih (“luck.POSS-3SGM”) f = 1 widnejhom (“ear.POSS-3PL”) f = 1

Zipf’s law  George K. Zipf (1902 – 1950) established a mathematical model for describing frequency data: Frequency decreases with rank. More precisely, frequency is inversely proportional to rank.  We can plot this in a chart: Y-axis = frequency X-axis = rank each dot on the chart represents the lexical item (type) at a given rank

How Zipf’s law pans out (Maltese data) A few high frequency, low-rank words Hundreds of low-frequency, high-rank words

Zipf’s law cross-linguistically  Empirical work has shown that the Zipfian distribution is observable: independent of the language irrespective of corpus size (for reasonably large corpora)  The bigger your corpus: the bigger your vocabulary size (no. types) the more words of frequency 1 (hapax legomena)  Why?

Some reasons  If words were completely random, every word would be equally likely. Our plot would be completely flat: all words at all ranks have same frequency.  Language is absolutely non-random: occurrence of words governed by:  syntax  author/speaker intentions ...  Some words are the basic “skeleton” for our sentences. They are the most frequent.

Implications  Traditional measures of central tendency (mean etc) not very useful.  No two corpora can be directly compared if they are of different size: vocab size increases with corpus size most of the vocab made up of hapax legomena most of the corpus size (no. tokens) made up of a few, very frequent types, typically function words.

Summary  We’ve introduced some of the uses of corpora for lexicography.  Focused today on word frequencies, especially Zipf’s law looked at some of the implications  Next up: collocations and why they’re useful

References  Baroni, M. (2007). Distributions in text. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook. Berlin: Mouton de Gruyter.

LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt.

Similar presentations

Presentation on theme: "LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt.

Similar presentations

Presentation on theme: "LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt."— Presentation transcript:

Similar presentations

About project

Feedback