Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Similar presentations


Presentation on theme: "Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan."— Presentation transcript:

1 Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan

2 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2 Frequency Distributions and Scales

3 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3 Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Different Variability

4 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 This Lesson Examine measures of central tendency Mode (Nominal) Median (Ordinal) Mean (Numerical) Examine measures of variability (dispersion) Entropy (Nominal) Variance (Numerical), Standard Deviation Standard scores (z-score)

5 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5 Centrality/Variability Measures and Scales

6 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 The Mode (Mo) השכיח The mode of a variable is the value that is most frequent Mo = argmax f(x) For categorical variable: The category that appeared most For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in the interval

7 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7 Finding the Mode: Example 1 The collection of values that a variable X took during the measurement ? Depends on Grouping

8 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8 Finding the Mode: Example 2 The mode of a grouped frequency distribution depends on grouping 86 88 87

9 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).

10 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ?

11 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0)

12 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0) between 7 and 8

13 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0) 1 of four 8's

14 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits)

15 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Arithmetic mean (mean, for short) Average is colloquial: Not precisely defined when used, so we avoid the term. The Arithmetic Mean ממוצע חשבוני

16 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Properties of Central Tendency Measures Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one: Distributions that have more than one sometimes called multi-modal For uniform distributions, all values are possible modes Typically used only on nominal data

17 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Properties of Central Tendency Measures Mean: Responsive to exact value of each score Only interval and ratio scales Takes total of scores into account: Does not ignore any value Sum of deviations from mean is always zero: Because of this: sensitive to outliers Presence/absence of scores at extreme values Stable between samples, and basis for many other statistical measures

18 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Properties of Central Tendency Measures Median: Robust to extreme values Only cares about ordering, not magnitude of intervals Often used with skewed distributions Mo Mdn Mean

19 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Mdn Mean

20 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Mdn Mean

21 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Dispersion and Variability Mode, Median, Mean: Only give central tendencies Mo Mdn Mean We need to measure the spread of the distribution

22 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22 Dispersion as Ranges Range: max(X) - min(X) Semi-Interquartile Range: Half the range where 50% of the scores are

23 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23 Dispersion as Deviation Look at dispersion as a function of the central tendency (mean) We know sum of deviations from mean is zero But what if we look at sum of absolute deviations? Smaller sum indicates more clustering of the distribution around the mean

24 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24 Variance Statisticians prefer a different way to use absolute values Sum of squares Shorthand for: Sum of squared deviations from the mean And normalizing for the size of the sample This is called the variance of the distribution

25 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25 Standard Deviation (std.) Square root of variance Robust to sampling variation: Does not change very much with new samples of the population Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a bit different We ignore this for now; return to this later

26 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26 Standard Scores Mean, median, etc. are robust to constant translations Adding V to each value is the same as adding V to the central tendency measures We may need to also compare distributions changing in range For instance, what's better: Score of 50, when mean is 60 Score of 60, when mean is 80.... Can compute z-scores of the raw scores

27 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27 z Scores Key idea: Express all values in units of standard deviation This allows comparison of values from different distributions But only if shapes of distributions are similar

28 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28 Measuring Dispersion in Nominal Scales Entropy Where r X is rel f of the value X Entropy of 0 means that all values X are the same rel f = 1.0 for some value X Entropy grows positive when values become more dispersed e.g., Entropy of 1 means all scores split evenly between two values Entropy is maximal when r X = 1/N for all values X i.e., uniform distribution

29 Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29 Normalizing Entropy Can normalize by dividing by maximal entropy given N. This allows comparing the entropy of distributions of different size


Download ppt "Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan."

Similar presentations


Ads by Google