Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 This Lesson Examine measures of central tendency Mode (Nominal) Median (Ordinal) Mean (Numerical) Examine measures of variability (dispersion) Entropy (Nominal) Variance (Numerical), Standard Deviation Standard scores (z-score)

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 The Mode (Mo) השכיח The mode of a variable is the value that is most frequent Mo = argmax f(x) For categorical variable: The category that appeared most For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in the interval

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ?

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0)

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0) between 7 and 8

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0) 1 of four 8's

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits)

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Arithmetic mean (mean, for short) Average is colloquial: Not precisely defined when used, so we avoid the term. The Arithmetic Mean ממוצע חשבוני

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Properties of Central Tendency Measures Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one: Distributions that have more than one sometimes called multi-modal For uniform distributions, all values are possible modes Typically used only on nominal data

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Properties of Central Tendency Measures Mean: Responsive to exact value of each score Only interval and ratio scales Takes total of scores into account: Does not ignore any value Sum of deviations from mean is always zero: Because of this: sensitive to outliers Presence/absence of scores at extreme values Stable between samples, and basis for many other statistical measures

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Properties of Central Tendency Measures Median: Robust to extreme values Only cares about ordering, not magnitude of intervals Often used with skewed distributions Mo Mdn Mean

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Dispersion and Variability Mode, Median, Mean: Only give central tendencies Mo Mdn Mean We need to measure the spread of the distribution

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23 Dispersion as Deviation Look at dispersion as a function of the central tendency (mean) We know sum of deviations from mean is zero But what if we look at sum of absolute deviations? Smaller sum indicates more clustering of the distribution around the mean

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24 Variance Statisticians prefer a different way to use absolute values Sum of squares Shorthand for: Sum of squared deviations from the mean And normalizing for the size of the sample This is called the variance of the distribution

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25 Standard Deviation (std.) Square root of variance Robust to sampling variation: Does not change very much with new samples of the population Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a bit different We ignore this for now; return to this later

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26 Standard Scores Mean, median, etc. are robust to constant translations Adding V to each value is the same as adding V to the central tendency measures We may need to also compare distributions changing in range For instance, what's better: Score of 50, when mean is 60 Score of 60, when mean is 80.... Can compute z-scores of the raw scores

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27 z Scores Key idea: Express all values in units of standard deviation This allows comparison of values from different distributions But only if shapes of distributions are similar

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28 Measuring Dispersion in Nominal Scales Entropy Where r X is rel f of the value X Entropy of 0 means that all values X are the same rel f = 1.0 for some value X Entropy grows positive when values become more dispersed e.g., Entropy of 1 means all scores split evenly between two values Entropy is maximal when r X = 1/N for all values X i.e., uniform distribution

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29 Normalizing Entropy Can normalize by dividing by maximal entropy given N. This allows comparing the entropy of distributions of different size

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Similar presentations

Presentation on theme: "Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Similar presentations

Presentation on theme: "Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan."— Presentation transcript:

Similar presentations

About project

Feedback