LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt.

Slides:



Advertisements
Similar presentations
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 4. Measuring Averages.
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Introduction to Summary Statistics
Introduction to Summary Statistics
1 Psych 5500/6500 Measures of Central Tendency Fall, 2008.
Central Limit Theorem.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Measures of Central Tendency. Central Tendency “Values that describe the middle, or central, characteristics of a set of data” Terms used to describe.
QUANTITATIVE DATA ANALYSIS
Statistics for Decision Making Descriptive Statistics QM Fall 2003 Instructor: John Seydel, Ph.D.
Measures of Variability
Central Tendency and Variability
Albert Gatt Corpora and Statistical Methods – Lecture 3.
Measures of Central Tendency Section 2.3 Statistics Mrs. Spitz Fall 2008.
Variance and Standard Deviation. Variance: a measure of how data points differ from the mean Data Set 1: 3, 5, 7, 10, 10 Data Set 2: 7, 7, 7, 7, 7 What.
Measures of Central Tendency
Today: Central Tendency & Dispersion
Measures of Central Tendency
Math 116 Chapter 12.
6 - 1 Basic Univariate Statistics Chapter Basic Statistics A statistic is a number, computed from sample data, such as a mean or variance. The.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Chapter 2 Summarizing and Graphing Data
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Sampling distributions BPS chapter 11 © 2006 W. H. Freeman and Company.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Descriptive Statistics Descriptive Statistics describe a set of data.
Biostatistics: Measures of Central Tendency and Variance in Medical Laboratory Settings Module 5 1.
Introduction to Summary Statistics. Statistics The collection, evaluation, and interpretation of data Statistical analysis of measurements can help verify.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
QBM117 Business Statistics Descriptive Statistics Numerical Descriptive Measures.
Descriptive Statistics: Numerical Methods
Interpreting Performance Data
1 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES. 2 MEASURES OF CENTRAL TENDENCY FOR UNGROUPED DATA  In Chapter 2, we used tables and graphs to summarize a.
Descriptive Statistics Descriptive Statistics describe a set of data.
INVESTIGATION 1.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
A way to organize data so that it has meaning!.  Descriptive - Allow us to make observations about the sample. Cannot make conclusions.  Inferential.
Data and Variation.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Central Tendency & Dispersion
Data Analysis.
Chapter 6: Analyzing and Interpreting Quantitative Data
RESEARCH & DATA ANALYSIS
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Descriptive Statistics Tabular and Graphical Displays –Frequency Distribution - List of intervals of values for a variable, and the number of occurrences.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
Chapter 3: Central Tendency 1. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Why do we analyze data?  It is important to analyze data because you need to determine the extent to which the hypothesized relationship does or does.
A way to organize data so that it has meaning!.  Descriptive - Allow us to make observations about the sample. Cannot make conclusions.  Inferential.
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
CHAPTER 3 – Numerical Techniques for Describing Data 3.1 Measures of Central Tendency 3.2 Measures of Variability.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
Statistical Methods Michael J. Watts
Psychology 202a Advanced Psychological Statistics
Statistical Methods Michael J. Watts
Chapter 5 STATISTICS (PART 1).
Description of Data (Summary and Variability measures)
Descriptive Statistics
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Numerical Descriptive Statistics
Presentation transcript:

LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt

In this lecture…  Corpora and the Lexicon uses of corpora in lexicography  Counting words lemmatisation and other issues types versus tokens word frequency distributions in corpora

Part 1 Corpora and lexicography

Why corpora are useful  Lexicographic work has long relied on contextual cues to identify meanings. e.g. Samuel Johnson used examples from literature to exemplify uses of a word.  Corpora make this procedure much easier not only to provide examples but: to actually identify meanings of a word given its context definitions of word meanings should therefore be more precise, if based on large amounts of data

Specific applications  Grammatical alternations of words E.g. Verb diathesis alternations:  Atkins and Levin (1995) found that verbs such as quiver and quake have both intransitive and transitive uses. (see Lecture 1) E.g. uses of prepositions such as on, with…  Regional variations in word use relying on corpora which include gender/region/dialect/date information

Specific applications - II  Identification of occurrences of a specific homograph, e.g. house (Verb) examination of the contexts in which it occurs relies on POS tagging  Keeping track of changes in a language through a monitor corpus  Identifying how common a word is, through frequency counts. many dictionaries include such information now this shall be our starting point

Part 2 Counting words in corpora: types versus tokens

Running example  Throughout this lecture, reference is made to data from a corpus of Maltese texts: ca. 51,000 words all from Maltese-language newspapers various topics and article types

How to count words: types versus tokens  token = any word in the corpus (also counting words that occur more than once)  type = all the individual, different words in the corpus (grouping occurrences of a word together as representatives of a single type)  Example: I spoke to the chap who spoke to the child 10 tokens 7 types (I, spoke, to, the, chap, who, child)

More on types and tokens  The number of tokens in the corpus is an estimate of overall corpus size Maltese corpus: 51,000 tokens  The number of types is an estimate of vocabulary size gives an idea of the lexical richness of the corpus Maltese corpus: 8193 types

Type/token ratio  A (rough!) way of measuring the amount of variation in the vocabulary in the corpus.  Roughly, can be interpreted as the “rate at which new types are introduced, as a function of number of tokens”

Difficult decisions - I  Do we distinguish upper- and lower- case words? is New in New York the same as new in new car? but what of New in New cars are expensive? (sentence-initial caps) in practise, it’s not straightforward to distinguish the two accurately, but can be done

Difficult decisions - II  What about morphological variants? man – men  one type or two? go – went  one type or two?  If we map all morphological (inflectional) variants to a single type, our counts will be cleaner (lemmatisation). depends on availability of automated methods to do this  Maltese also presents problems with variants of the definite article (ir-, is-, ix- etc) ir-raġel (DEF-man): one token or two?

Difficult decisions - III  Do numbers count? e.g. is 1,500 a word? may artificially inflate frequency counts one approach is to treat all numbers as tokens of a single type “NUMBER” or “###”  Punctuation can compromise frequency counts computer will treat “woman!” as different from “woman” needs to be stripped problematic for languages that rely on non-alphabetic symbols: Maltese ‘l (“to”) vs l- (“the”)

Part 2 Representing word frequencies

Raw frequency lists (data from Maltese)  A simple list, pairing each word with its frequency wordfrequency aħħar (“last”)97 jkun (“be.IMPERF.3SG”)96 ukoll (“also”)93 bħala (“as”)91 dak (“that.SGM”)86 tat- (“of.DEF”)86

Frequency ranks  Word counts can get very big. most frequent word in the Maltese corpus occurs 2195 times (and the corpus is small)  Raw frequency lists can be hard to process.  Useful to represent words in terms of rank: count the words sort by frequency (most frequent first) assign a rank to the words:  rank 1 = most frequent  rank 2 = next most frequent ……

Rank-frequency list example (data from Maltese) rankFrequency Rank of type, according to frequency Number of times the type occurs

Frequency spectrum (data from Maltese)  A representation that shows, for each frequency value, the number of different types that occur with that frequency. frequencytypes

Normalised frequency counts  A raw frequency for a word isn’t necessarily informative. E.g. difficult to compare the frequency of the word in corpora of different sizes.  We often take a “normalised” count. typical to divide the frequency by some constant, such as 10,000 or 1,000,000 this gives “frequency of word per million” rather than a raw count.

Type/token ratio revisited  (no. of types)/(no. of tokens)  Another way of estimating “vocabulary richness” of a corpus, instead of just looking at vocabulary size.  E.g. if a corpus consists of 1000 words, and there are 400 types, then the TTR is 40%

Type/token ratio  Ratio varies enormously depending on corpus size!  If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%.  With 4 million words, it’s more likely to be in the region of 2%.  Reasons: vocab size grows with corpus size but large corpora will contain a lot of tokens that occur many times

Standardised type/token ratio  One way to account for TTR variations due to corpus size is to compute an average TTR for chunks of a constant size. Example: compute the TTR for every 1000 words of running text then, take an average over all the word chunks  This is the approach used, for example, in WordSmith.

Part 3 Frequency distributions, or “few giants, many midgets”

Non-linguistic case study  Suppose we are interested in measuring people’s height. population = adult, male/female, European sample: N people from the relevant population measure height of each person in the sample  Results: person 1: 1.6 m person 2: 1.5 m …

Measures of central tendency  Given the height of individuals in our sample, we can calculate some summary statistics: mean (“average”): sum of all heights in sample, divided by N mode: most frequent value Median: the middle value  What are your expectations?

The data (example)  Mean: 158.8cm This is the expected value in the long run. If our sample is good, we would expect that most people would have a height at or around the mean.  Mode: 160cm  Median: 160 height

Plotting height/frequency Observations: 1.Extreme values are less frequent. 2. Most people fall on the mean 3. Mode is approximately same as mean 4. Bell-shaped curve (“normal” distribution)

Plotting height/frequency This shape characterises the Normal Distribution. A “bell curve” Quite typical for a lot of data sampled from humans (but not all data)

What about language?  Typical observations about word frequencies in corpora: 1.there are a few words with extremely high frequency 2.there are many more words with extremely low frequency 3.the mean is not a good indicator: most words will have an actual value that is very far above or below the mean

A closer look at the Maltese data  Out of 51,000 tokens: 8016 tokens belong to just the 5 most frequent types (the types at ranks ) ca. 15% of our corpus size is made up of only 5 different words!  Out of 8193 types: 4382 are hapax legomena, occurring only once (bottom ranks) 1253 occur only twice …  In this data, the mean won’t tell us very much. it hides huge variations!

Ranks and frequencies (Maltese) … … Among top ranks, frequency drops very dramatically Among bottom ranks, frequency drops very gradually

General observations  In corpora: there are always a few very high- frequency words, and many low- frequency words among the top ranks, frequency differences are big among bottom ranks, frequency differences are very small

So what are the high-frequency words?  Top 5 ranked words in the Maltese data: li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of”), tal- (“of the”)  Bottom ranked words: żona (“zone”) f = 1 yankee f = 1 żwieten (“Zejtun residents”) f = 1 xortih (“luck.POSS-3SGM”) f = 1 widnejhom (“ear.POSS-3PL”) f = 1

Zipf’s law  George K. Zipf (1902 – 1950) established a mathematical model for describing frequency data: Frequency decreases with rank. More precisely, frequency is inversely proportional to rank.  We can plot this in a chart: Y-axis = frequency X-axis = rank each dot on the chart represents the lexical item (type) at a given rank

How Zipf’s law pans out (Maltese data) A few high frequency, low-rank words Hundreds of low-frequency, high-rank words

Zipf’s law cross-linguistically  Empirical work has shown that the Zipfian distribution is observable: independent of the language irrespective of corpus size (for reasonably large corpora)  The bigger your corpus: the bigger your vocabulary size (no. types) the more words of frequency 1 (hapax legomena)  Why?

Some reasons  If words were completely random, every word would be equally likely. Our plot would be completely flat: all words at all ranks have same frequency.  Language is absolutely non-random: occurrence of words governed by:  syntax  author/speaker intentions ...  Some words are the basic “skeleton” for our sentences. They are the most frequent.

Implications  Traditional measures of central tendency (mean etc) not very useful.  No two corpora can be directly compared if they are of different size: vocab size increases with corpus size most of the vocab made up of hapax legomena most of the corpus size (no. tokens) made up of a few, very frequent types, typically function words.

Summary  We’ve introduced some of the uses of corpora for lexicography.  Focused today on word frequencies, especially Zipf’s law looked at some of the implications  Next up: collocations and why they’re useful

References  Baroni, M. (2007). Distributions in text. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook. Berlin: Mouton de Gruyter.