Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic statistics for corpus linguistics

Similar presentations


Presentation on theme: "Basic statistics for corpus linguistics"— Presentation transcript:

1 Basic statistics for corpus linguistics

2 Types of studies in CL What kind of x in y? With which adjectives are politicians described in different newspapers and magazines? Are a and b different? Are the frequencies of negative adjectives different in comparison to positive adjectives in different newspapers and magazines? Is there a correlation between x and y? Is it true that the more write- wing the magazine is the higher the frequency of negative adjectives of left-wing politicians? Usually you need a benchmark: a reference corpus, a frequency expected in random distribution, two distributions between which you compare

3 What do you need for a quantitative corpus study?
A defined corpus in which you can count frequencies of interesting objects Frequencies Benchmark Statistical significance (H0 – null hypothesis, some tool for calculations: on-line calculator, SPSS, R, xcel)

4 Descriptive statistics
Tells how the population is distributed: What is the lowest and highest value What is the mean value What is the mode (most frequent value) What is the median (exactly half of values is smaller and exactly half is larger than median) What is the first and the third quantile (25% and 75% of data is higher and larger than 1st and 3rd quantile)

5 Types of distribution Skewed, symmetric, normal distribution

6 Inferential statistics
Testing statistical significance, collocational strength etc Think of what you want to test Try to find a suitable method Check whether your data fulfills the conditions, think of eventual problems Conduct the test Interpret the results statistically (usually you can find guidelines in the source where you checked for the test) Think what does the statistical information tells you about the language How far can you extrapolate from your data?

7 Normalized frequencies
In order to compare data from obtained from samples of different size you must normalize them: Raw_freq/corpus_size*x (usually 100, 1000, )

8 Types of variables In order to conduct a quantitative study you need numbers, but there are different types of numbers: .pdf

9 Null hypothesis A hypothesis that you try to reject in your study
Instead of asking are a and b different, you ask: what is the probability that a and b are not correlated/not different / come from the same distribution p-value is the value of probability that null hypothesis holds, in other words: p-value tells you how probable is that a and b are not correlated You need to decide what is a satisfying probability: 0.05, 0.01, 0.001? p=0.05 means: there is 5% probability that a and b are not correlated, hence 95% chance that they are correlated. In linguistics p=0.05 is usually a standard, but in medicine it might be not satisfactory. Would you like to take a treatment that in 5/100 is lethal?

10 Useful statistical methods
Tests of statistical significance chi-square test observations on big data sets, more equally distributed (expected value in each cell must be >5) log-likelihood (LL) test – preferred test Fisher’s Exact test – observations on small data sets Collocation statistics Mutual information (MI) -The higher the MI score, the stronger the link between two items MI score of 3.0 the higher the chance it is a collocation The closer to 0 the MI score , the higher probability it was random A negative MI score indicates the candidates dislike z score Testing correlation – simple linear regression More complicated methods Multiple linear regression Multifactorial analysis Clustering

11 Tools http://ucrel.lancs.ac.uk/llwizard.html
+ project.org/web/packages/languageR/languageR.pdf SPSS Calculating manually e.g. with help of excel table

12 Useful literature tml linguistics/grammar-and-syntax/analyzing-linguistic-data-practical- introduction-statistics-using-r


Download ppt "Basic statistics for corpus linguistics"

Similar presentations


Ads by Google