Presentation is loading. Please wait.

Presentation is loading. Please wait.

Describing Distributions with Numbers.  Measuring Center ◦ Mean ◦ Median  Measuring Spread ◦ Quartiles ◦ Five Number Summary ◦ Standard deviation 

Similar presentations


Presentation on theme: "Describing Distributions with Numbers.  Measuring Center ◦ Mean ◦ Median  Measuring Spread ◦ Quartiles ◦ Five Number Summary ◦ Standard deviation "— Presentation transcript:

1 Describing Distributions with Numbers

2  Measuring Center ◦ Mean ◦ Median  Measuring Spread ◦ Quartiles ◦ Five Number Summary ◦ Standard deviation  Boxplots

3 3 Measuring Center: The Mean 3 The most common measure of center is the arithmetic average, or mean. To find the mean (pronounced “x-bar”) of a set of observations, add their values, and divide by the number of observations. If the n observations are x 1, x 2, x 3, …, x n, their mean is: In more compact notation: To find the mean (pronounced “x-bar”) of a set of observations, add their values, and divide by the number of observations. If the n observations are x 1, x 2, x 3, …, x n, their mean is: In more compact notation:

4

5  Mean highway mileage for the 19 2-seaters: ◦ Average: 25.8 miles/gallon  Issue here: Honda Insight 68 miles/gallon! ◦ Exclude it, the mean mileage: only 23.4 mpg ◦ What does this say about the mean?

6  Problem: Mean can be easily influenced by outliers. It is NOT a resistant measure of center.  Median  Median is the midpoint of a distribution. ◦ Resistant or robust measure of center. ◦ i.e. not sensitive to extreme observations

7  In a symmetric distribution, mean = median  In a skewed distribution, the mean is further out in the long tail than the median.  Example: house prices are usually right skewed ◦ The mean price of existing houses sold in 2014 in West Lafayete is 231,000. (Mean chases the right tail) ◦ The median price of these houses was only 169,900.

8 8 Because the mean cannot resist the influence of extreme observations, it is not a resistant measure of center. Another common measure of center is the median. The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations from smallest to largest. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. 3. If the number of observations n is even, the median M is the average of the two center observations in the ordered list. The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations from smallest to largest. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. 3. If the number of observations n is even, the median M is the average of the two center observations in the ordered list. Measuring Center: The Median

9  Quartiles: Divides data into four parts (with the Median)  p th percentile – p percent of the observations fall at or below it. ◦ Median – 50 th percentile ◦ First Quartile (Q1) – 25 th percentile (median of the lower half of data) ◦ Third Quartile (Q3) – 75 th percentile (median of the upper half of data)  The median and the two quartiles break the data into four 25% pieces.

10 Trick: Always the (n+1)/2 position from the ordered data Example:Data: (n+1)/2 = 5, so median is the 5 th position Median = 5 Example:Data: (n+1)/2 = 5.5, so median is the 5.5 th position Median = just the average of 5 and 6 = 5.5

11 Example:Data: Median = 5 = “ Q2 ” Q1 is the median of the lower half = Q3 is the median of the upper half = (ignore the median when counting) Example:Data: Median = 5.5 Q1 = Q3 =

12  5 numbers ◦ Minimum ◦ Q1 ◦ Median ◦ Q3 ◦ Maximum

13 Example: Data: Example: Data:

14 14 The median and quartiles divide the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot.  Draw and label a number line that includes the range of the distribution.  Draw a central box from Q 1 to Q 3.  Note the median M inside the box.  Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers.  Draw and label a number line that includes the range of the distribution.  Draw a central box from Q 1 to Q 3.  Note the median M inside the box.  Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers. How to Make a Boxplot Boxplots

15

16 Find the 5 # summary and make a boxplot Numbers of home runs that Hank Aaron hit in each of his 23 years in the Major Leagues:

17  Interquartile Range (IQR) = Q3 - Q1  Observation is a suspected outlier IF it is: ◦ greater than Q *IQR OR ◦ less than Q1 – 1.5*IQR

18  Are there any outliers?

19  Find 5 number summary: MinQ1MedianQ3Max  Are there any outliers?  Q3 – Q1 = 200 – 54.5 =  Times by 1.5: 145.5*1.5 = ◦ Add to Q3: =  Anything higher is a high outlier  7 obs. ◦ Subtract from Q1: 54.5 – =  Anything lower is a low outlier  no obs.

20  Seven high outliers circled…  Find and circle the eighth outlier.

21  Has outliers as dots or stars.  The line extends only to the first non-outlier.

22  Deviation :  Variance : s 2  Standard Deviation : s

23 DATA points: Mean = 1600 Finding the standard deviation by hand: ◦ Find the deviations from the mean: Deviation1 = 1792 – 1600 = 192 Deviation2 = 1666 – 1600 = 66 … Deviation7 = 1439 – 1600 = -161 ◦ Square the deviations. ◦ Add them up and divide the sum by n-1 = 6, this gives you s 2. ◦ Take square root: Standard Deviation = s =

24  Standard deviation is always non-negative  s = 0 when there is no spread  s has the same units as the original observations  s is not resistant to presence of outliers ◦ 5-number summary usually better describes a skewed distribution or a distribution with outliers.  s is used when we use the mean ◦ Mean and standard deviation are usually used for reasonably symmetric distributions without outliers.

25 Numbers of home runs that Hank Aaron hit in each of his 23 years in the Major Leagues:

26  x new = a + bx old  Common conversions ◦ Distance: 100km is equivalent to 62 miles  x miles = x km ◦ Weight: 1ounce is equivalent to grams  x g = x oz, ◦ Temperature:  _

27 Python Weight oz g  Do not change shape of distribution  However, change center and spread Example: weights of newly hatched pythons:

28  Ounces ◦ Mean weight = (1.13+ … +1.16)/5 = 1.12 oz ◦ Standard deviation =  Grams ◦ Mean weight =(32+ … +33)/5 = 31.8 g  or 1.12 * = 31.8 ◦ Standard deviation = 2.38  or * = 2.38

29  Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (IQR and standard deviation) by b.  Adding the same number a to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread (IQR and standard deviation)

30  Your Transformation: x new = a + b*x old  mean new = a + b*mean  median new = a + b*median  stdev new = |b|*stdev  IQR new = |b|*IQR |b|= absolute value of b (value without sign)

31  Winter temperature recorded in Fahrenheit ◦ mean = 20 ◦ stdev = 10 ◦ median = 22 ◦ IQR = 11  Convert into Celsius: ◦ mean = -160/9 + 5/9 * 20 = C ◦ stdev = 5/9 * 10 = 5.56 ◦ median = ◦ IQR =

32  “ proc univariate ” procedure generates all the descriptive summaries.  For the time being, draw boxplots by hand from the 5-number summary ◦ Optional: proc boxplot.  See plot.doc

33  Measures of location: Mean, Median, Quartiles  Measures of spread: stdev, IQR  Mean, stdev ◦ affected by extreme observations  Median, IQR ◦ robust to extreme observations  Five number summary and boxplot  Linear Transformations


Download ppt "Describing Distributions with Numbers.  Measuring Center ◦ Mean ◦ Median  Measuring Spread ◦ Quartiles ◦ Five Number Summary ◦ Standard deviation "

Similar presentations


Ads by Google