Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun.

Similar presentations


Presentation on theme: "Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun."— Presentation transcript:

1 Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

2 Why calculating statistics? Describe and summarise the data E.g. examination results (out of 100) 2298404516317778 5545619187455466 7587884964765861 … Average mark/Spread of scores/Lowest and highest marks?/Comparison with other results (e.g. from last year’s?)

3 Population vs. Sample Population: total universe of all possible observations. Populations can be finite or infinite, real or theoretical –the IQ of all adult men in Britain –The outcome of an infinite number of flips of a coin Descriptive statitics are called parameters

4 Population vs. Sample (cont’d) Sample: Subset of observations drawn from a given population –The IQ scores of 100 adult men in Britain –The outcome of 50 flips of a coin Descriptive statitics from a sample are called statistics Note: In experimental research it is important to draw a representative, random sample that is not biased

5 Histograms: Frequency distribution of each event Data: Tutorial1.sav

6 Central tendency: mode and median Mode: Most frequent mark (Note: there may be multiple modes) Median: score from the middle of the list when ordered from lowest to highest. Cuts data into halves (doesn’t take account of values of all scores but only of the scores in middle position).

7 Central tendency: mean Mean: sum of scores divided by the number of scores Note on notation: Greek letters often used for population, roman letters used for statistic (properties of a sample)

8 Comparing measures of “central tendency” Mode: –quick if we have frequency distribution –Possible with categorical data Median: –Good estimate if we have abnormally large or small values (e.g. max aircraft speed of 450km/h, 480km/h, 500km/h, 530km/h, 600km/h, and 1100km/h) –Only influenced by values in the middle of ordered data Mean –Every score is taken into account –Some interesting properties  Most widely used

9 Types of variables Interval (scale): difference between consecutive numbers are of equal intervals (e.g. time, speed, distances). Precise measurements Ordinal: assignments of ranks that represent position along some ordered dimension (e.g. ranking people wrt their speed, 1 = fastest, 4 = slowest). No equal intervals Categorical (nominal): numerical categories, labels (e.g. brown = 1, blue = 2, green = 3) Question: on which type of data can we calculate a meaningful “central tendency”?

10 Spread of distributions: why?

11 Spread of distributions: range and quartiles Small spread often desirable as it indicates a high proportion of identical scores Large spread indicates large differences between individual scores Range: difference between highest and lowest score – rather crude measure Quartiles: cuts the ordered data into quarters (second quartile = median)

12 Median, quartiles, and outliers oOutlier (more than 1.5 box lengths above or below the box) Interquartile range *Extreme value (more than 3 box lengths below or above the box) Largestvaluewhich isnot outlier Upper quartile Median Lower quartile Smallestvaluewhich isnot outlier tutorial1.sav: simple bp, sep. var

13 Spread of the population: variance measures Variance: sum of squared deviations from the mean Variance = Standard deviation: square root of variance

14 Normal distribution (Gaussian distribution) Example: IQ scores, mean=100, sd=16 Mean = Median = Mode

15 Skewed distributions and measures of central tendency

16 Bimodal distributions

17 Normal distribution (Gaussian distribution) Example: IQ scores, mean=100, sd=16 Mean = Median = Mode

18 z-scores Z-score: deviation of given score from the mean in terms of standard deviations

19 How likely is a given event? Example: time to utter a particular sentence: x = 3.45s and sd =.84s Questions: –What proportion of the population of utterance times will fall below 3s? –What proportion would lie between 3s and 4s? –What is the time value below which we will find 1% of the data?


Download ppt "Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun."

Similar presentations


Ads by Google