Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive Statistics (Part 2)

Similar presentations


Presentation on theme: "Descriptive Statistics (Part 2)"— Presentation transcript:

1 Descriptive Statistics (Part 2)
Lecture 4 Justin Kern September 5 and 7, 2017

2 Mean Interpretation: What can we say about a data set based on the mean? Example: Fall 2016, the average final grade for this course was 83. Are following claims definitely true? In Fall 2012, some students received greater than 72 YES In Fall 2012, all students received at least 83 NO In Fall 2012, some students received 83 or higher This semester, some students will receive at least 83 In Fall 2012, at least one student received exactly 83 In Fall 2012, the number of students who received 83 or higher was equal to the number of students who received 83 or lower

3 Comparisons of Mean, Median & Mode
Measures of Central Tendency Mean Only one possible value for a data set Very sensitive to extreme values or any change in data Median Not sensitive to extreme values in data or replacement of values at extremes More sensitive to removal or addition of values in data set Mode None, one, or more possible values for data set Not sensitive to extreme values Not sensitive to replacement of values at extremes (unless mode is an extreme value) Slightly sensitive to removal or addition of values in data set

4 Comparisons of Mean, Median & Mode
Sensitivity to changes in data 57, 63, 75, 75, 75, 75, 92 Mean = 73.1, Median = 75, Mode = 75 57, 63, 75, 75, 75, 75, 96 Mean = 73.7, Median = 75, Mode = 75 Mean = 70, Median = 75, Mode = 75 Note that small changes in just one number in the data set affects the mean, but not the median and mode, despite the fact that a small change in one data point is probably not reflective on any real world significance. Again note that removal of on high value had a sizeable affect on the mean but no effect on median and mode.

5 Comparisons of Mean, Median & Mode
Sensitivity to changes in data 57, 63, 63, 63, 75, 75, 92 Mean = 69.7, Median = 63, Mode = 63 47, 63, 63, 63, 75, 75, 92 Mean = 68.3, Median = 63, Mode = 63 Mean = 71.8, Median = 69, Mode = 63 Mean fluctuates but relatively little. Median changed quite a bit in response to removal of one value whereas it did not respond at all to a sizeable change in a value.

6 Comparisons of Mean, Median & Mode
Distributions Symmetrical Mean = Median Positively Skewed Mean > Median Show a bimodal distribution on the board. Negatively Skewed Mean < Median

7 Which measure to use? Depends on the situation and what you are interested in Mean is the most widely used statistic Familiar to most people Has the most desirable statistical properties Can be misleading (very sensitive to extreme values or outliers) Example: think about mean household income. Median is more indicative of typical (average?) American family due to a few extremely wealthy households

8 Measures of Spread When describing a distribution of scores, a measure that describes the center of the distribution (or most common value or values) is key. It is also important to know how the values in the distribution are dispersed around the center. Common terms: Measures of Dispersion Measures of Spread Variability

9 The Range Range: Example:
The range is the difference between the largest value (L) and the smallest value (S) is a data set. Range = L – S Example: Suppose we have five scores: 1, 2, 3, 4, 5 What is the range? 5 – 1 = 4 This is a very simple measure of spread that only accounts for two values in the dataset. It is not very useful for data with outliers. Example: 1, 2, 3, 4, 100 The range here is 100 – 1 = 99.

10 Percentiles (and Quartiles)
The value below which a given percentage of observations in a group of observations fall. They divide the data into 100 equal parts. Example: the 20th percentile is the value below which 20 percent of the observations may be found. Quartile: A measure that divides the data into four equal parts. Special Percentiles: Minimum = 0th percentile Lower Quartile = 25th percentile ( 𝑄 1 ) Median = 50th percentile ( 𝑄 2 ) Upper Quartile = 75th percentile ( 𝑄 3 ) Maximum = 100th percentile Percentiles are often confusing and not intuitive. Calculating them is a pain in the ass. Try not to get too frustrated if you’re struggling with this one

11 Quartiles (Five-number Summary)
Minimum: Smallest value of data set Lower Quartile ( 𝑄 1 ): Median of lower half of data set (left of median) Median ( 𝑄 2 ): Middle value of data set Upper Quartile ( 𝑄 3 ): Median of upper half of data set (right of median) Maximum: Largest value of data set

12 Interquartile Range Interquartile range:
It is the difference between the upper and lower quartiles. 𝐼𝑄𝑅= 𝑄 3 − 𝑄 1 It gives a measure of the spread within the middle 50% of a distribution. Its corresponding measure of central tendency is the median. Because the IQR excludes the top and bottom 25% of scores, outliers have little influence on it. On the other hand, its calculation implicitly only uses 50% of all values. The other half are left out.

13 Five Number Summary Odd number of values Even number of values
55, 58, 69, 79, 82, 88, 94 Even number of values 55, 58, 69, 79, 82, 88, 94, 97 Min LQ Med UQ Max Min Med = 80.5 Max LQ = (58+69)/2 = 63.5 UQ = (88+94)/2 = 91

14 Box and Whisker Plot Plot of the five number summary LQ = 58 UQ = 88
Med = 79 Min = 55 Max = 94

15 Variance Variance: Standard deviation:
The average squared distance of observations from the mean. Its corresponding measure of central tendency is the mean. Standard deviation: The square root of the variance. Variance is in squared units, so can be hard to interpret. Standard deviation is in the original units of the measurements. These measure how much the data deviate from the mean. They use all the data in their calculations. Draw distributions with large and small variances on the board.

16 Variance Two versions depending on the context of data
Population: all subjects of interest Every single person in the United States Entire UCM student body Calculation: 𝜎 2 = 𝑖=1 𝑁 𝑥 𝑖 −𝜇 2 𝑁 = 𝑆𝑆 𝑁 Population standard deviation: 𝜎= 𝜎 2 Sample: subset of population 100 random people in the US 100 random students from UCM 𝑠 2 = 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 𝑛−1 = 𝑆𝑆 𝑛−1 Sample standard deviation: 𝑠= 𝑠 2 Point out what a deviation is. Make note that SS stands for sum of squares. Make a slight allusion to why s^2 is calculated with n-1 instead of n.

17 Population Standard Deviation (σ)
Measure of how spread out the values are in a data set consisting of the entire population of interest (i.e., how much do the population values deviate from the mean?) Step 1: Calculate the deviation (distance) of each value (𝒙 𝒊 ) from the population mean (µ) µ = 3 Step 2: Square each deviation, then add them up (−2) 2 + (−1) 2 + (0) 2 +(1) 2 +(2) 2 =10 Step 3: Divide the sum of squared deviations by the number of observations (N) 10/5=2 (population variance = σ 2 ) Step 4: Take the square root of the “average squared deviation” (population variance) 2 =1.41=𝝈 (3-3) = 0 (2-3) = -1 (4-3) = 1 (1-3) = -2 (5-3) = 2 You square these values because otherwise they will add up to 0 BY DEFINITION. Show why that is.

18 𝜎= (1−3) 2 + (2−3) 2 + (3−3) 2 + (4−3) 2 + (5−3) 2 5 =1.41
Formulas Population Standard Deviation 𝜎= 𝑖=1 𝑁 ( 𝑥 𝑖 −𝜇) 2 𝑁 𝜎= (1−3) 2 + (2−3) 2 + (3−3) 2 + (4−3) 2 + (5−3) =1.41

19 Sample Standard Deviation (s)
Measure of how spread out the values are in a data set consisting of a sample of the population of interest (i.e., how much do the sample values deviate from the mean?) Step 1: Calculate the deviation (distance) of each value ( 𝒙 𝒊 ) from the sample mean ( 𝒙 ) 𝑥 =3 Step 2: Square each deviation, then add them up (−2) 2 + (−1) 2 + (0) 2 +(1) 2 +(2) 2 =10 Step 3: Divide the sum of squared deviations by number of observations minus 1 (n-1) 10/(5−1)=10/4=2.5 (sample variance = 𝑠 2 ) Step 4: Take the square root of the “corrected average squared deviation” (sample variance) 2.5 =1.58=𝒔 (3-3) = 0 (2-3) = -1 (4-3) = 1 (1-3) = -2 (5-3) = 2

20 𝑠= (1−3) 2 + (2−3) 2 + (3−3) 2 + (4−3) 2 + (5−3) 2 5−1 =1.58
Formulas Sample Standard Deviation 𝑠= 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 ) 2 𝑛−1 𝑠= (1−3) 2 + (2−3) 2 + (3−3) 2 + (4−3) 2 + (5−3) 2 5−1 =1.58 Write the computational formula/proof on the board.

21 Simplified Formula Using the rules for summations, a simpler formula for the standard deviation can be found. 𝑠= 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 ) 2 𝑛−1 = 𝑖=1 𝑛 ( 𝑥 𝑖 2 −2 𝑥 𝑖 𝑥 + 𝑥 2 ) 𝑛−1 = 𝑖=1 𝑛 𝑥 𝑖 2 − 𝑖=1 𝑛 2 𝑥 𝑥 𝑖 + 𝑖=1 𝑛 𝑥 2 𝑛−1 = 𝑖=1 𝑛 𝑥 𝑖 2 −2 𝑥 𝑖=1 𝑛 𝑥 𝑖 +𝑛 𝑥 𝑛−1 = 𝑖=1 𝑛 𝑥 𝑖 2 −2𝑛 𝑥 2 +𝑛 𝑥 2 𝑛−1 = 𝑖=1 𝑛 𝑥 𝑖 2 −𝑛 𝑥 2 𝑛−1 𝑠= ( )−5× −1 =1.58

22 Another Example Suppose we have a sample of 6 scores on a quiz:
5, 10, 3, 7, 2, 3 What is the mode for these data? 3 What is the median for these data? 4 What is the mean for these data? 5 What is the sample variance for these data? 46 5 =9.2 What is the sample standard deviation for these data? 9.2 ≈3.033

23 Population versus Sample SD
The sample SD seeks to estimate the population SD, in the same way that the sample mean seeks to estimate the population mean The distribution of the sample mean is such that it may incorrectly estimate the population mean, but it’s no more likely to overestimate than it is to underestimate It is unbiased The sample SD, on the other hand, will consistently underestimate the population SD, so we divide it by a smaller number (n-1 instead of n) to correct for this underestimation error. Without correction, the sample SD and sample variance would be biased estimators

24 Sample Standard Deviation of Grouped Data
Estimating s from grouped data (histogram) Step 1: Calculate estimated sample mean Step 2: Calculate estimated sample standard deviation (treat each midpoint as 𝑥 𝑖 and weight each squared deviation from 𝑥 by the corresponding frequency) 𝑠= 1( 2.5−16.7) 2 +1( 7.5−16.7) 2 +4(1 2.5−16.7) 2 +7( 17.5−16.7) 2 +6(2 2.5−16.7) 2 19−1 =5.6 Midpoint Frequency 2.5 1 7.5 12.5 4 17.5 7 22.5 6 Total 19 𝑥 = 2.5∗1+7.5∗1+12.5∗4+17.5∗7+22.5∗6 19 =16.7

25 Transformations What happens to the mean and variance if you add each value in the data set by a constant? New mean is the old mean + constant 𝑥 𝑛𝑒𝑤 = 𝑥 𝑜𝑙𝑑 +𝑐 New variance is equal to the old variance 𝑠 𝑛𝑒𝑤 2 = 𝑠 𝑜𝑙𝑑 2 ( 𝑥 =3, 𝑠 2 =2.5) (constant = c = 1) ( 𝑥 =4, 𝑠 2 =2.5) Show calculation on the board. ( 𝑥 𝑛𝑒𝑤 =3+1=4) ( 𝑠 𝑛𝑒𝑤 2 = 𝑠 𝑜𝑙𝑑 2 =2.5)

26 Transformations Adding constant to data set simply shifts the distribution Center changes but the spread remains the same +1 + 𝑥 =3 𝑥 =4 𝑠=1.58 𝑠=1.58

27 Transformations What happens to the mean and variance if you multiply each value in the data set by a constant? *2 *2 *2 *2 *2 New mean is the old mean * constant 𝑥 𝑛𝑒𝑤 = 𝑥 𝑜𝑙𝑑 ∗𝑏 New variance is the old variance * constan t 2 𝑠 𝑛𝑒𝑤 2 = 𝑠 𝑜𝑙𝑑 2 ∗ 𝑏 2 ( 𝑥 =3, 𝑠 2 =2.5) (constant = b = 2) ( 𝑥 =6, 𝑠 2 =10) Show calculation on the board. So what is standard deviation? How is that modified? ( 𝑥 𝑛𝑒𝑤 =3∗2=6) ( 𝑠 𝑛𝑒𝑤 2 =2.5∗4=10)

28 Transformations What happens to the mean and variance if you first multiply each value in the data set by a constant, then add each value by another constant? *2 *2 *2 *2 *2 New mean is the old mean * first constant + second constant 𝑥 𝑛𝑒𝑤 = 𝑥 𝑜𝑙𝑑 ∗𝑏+𝑐 New standard deviation is the old variance * first constan t 2 𝑠 𝑛𝑒𝑤 2 = 𝑠 𝑜𝑙𝑑 2 ∗ 𝑏 2 ( 𝑥 =3, 𝑠 2 =2.5) (first constant = b = 2) (second constant = c = 1) ( 𝑥 =7, 𝑠 2 =10) Show calculation on the board. What happens to standard deviation? ( 𝑥 𝑛𝑒𝑤 =3∗2+1=7) ( 𝑠 𝑛𝑒𝑤 2 =2.5∗4=10)

29 Another Example Suppose we have our sample of 6 quiz scores (out of 10 points) again: 5, 10, 3, 7, 2, 3 Our previous mean/variance/standard deviation was: 𝑥 =5, 𝑠 2 =9.2, 𝑠= 9.2 =3.033 Suppose that we rescaled the original quiz scores to be worth 100 points and after rescaling decided to also give everyone 5 points of extra credit. What would the new mean, variance, and standard deviation be? 𝑥 𝑛𝑒𝑤 =10 𝑥 𝑜𝑙𝑑 +5= =55 𝑠 𝑛𝑒𝑤 2 = 𝑠 𝑜𝑙𝑑 2 =100 𝑠 𝑜𝑙𝑑 2 = =920 𝑠 𝑛𝑒𝑤 = 𝑠 𝑛𝑒𝑤 2 = 𝑠 𝑜𝑙𝑑 2 = 𝑠 𝑜𝑙𝑑 2 =10 𝑠 𝑜𝑙𝑑 = =30.33


Download ppt "Descriptive Statistics (Part 2)"

Similar presentations


Ads by Google