Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Describing Distributions with Numbers

Similar presentations


Presentation on theme: "Chapter 3 Describing Distributions with Numbers"— Presentation transcript:

1 Chapter 3 Describing Distributions with Numbers
Overview Measures of Center Measures of Variation Measures of Relative Standing Exploratory Data Analysis (EDA)

2 Thinking Challenge $400,000 11 total employees; total salaries are $770,000. The mode is $20,000 (Union argument). The median is $30,000. The mean is $70,000 (President’s argument). Different measures are used! $70,000 $50,000 ... employees cite low pay -- most workers earn only $20,000. ... President claims average pay is $70,000! $30,000 $20,000

3 Numerical Data Properties
Central Tendency (Center) Location (Position) Concerned with where values are concentrated. Variation (Dispersion) Concerned with the extent to which values vary. Shape Concerned with extent to which values are symmetrically distributed. Variation (Spread) Shape

4 Numerical Data Properties and Measures
Measures of Center Measures of Variation Shape Mean Median Mode Symmetric Skew Range Variance Standard Deviation Interquartile Range

5 Mean Measure of the center or central tendency Most common measure
Affected by extreme values (‘outliers’)

6 Example Raw Data:

7 Median Measure of the center or central tendency
Middle value in an ordered sequence If Odd n, Middle Value of Sequence If Even n, Average of 2 Middle Values Position of median in the sequence Not affected by extreme values

8 Median of a Data Set

9 Median Odd-Sized Sample
Raw Data: Ordered: Position: Median = 22.6

10 Median Even-Sized Sample
Raw Data: Ordered: Position: Median =

11 Mode Measure of the center or central tendency
Value that occurs most often Not affected by extreme values May be no mode or several modes May be used for numerical and categorical data

12 Mode

13 Mode Example No Mode Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
One Mode Raw Data: More Than 1 Mode Raw Data:

14 Mean versus Median

15 Selecting an Appropriate Measure of Center
A student takes four exams in a biology class. His grades are 88, 40, 95, and 100. If asked for his grade in the class, which measure of center is the student likely to report? The National Association of REALTORS publishes data on resale prices of U.S. homes. Which measure of center is most appropriate for such resale prices? In the 2003 Boston Marathon, there were two categories of official finishers: male and female, of which there were 10,737 and 6,309, respectively. Which measure of center should be used here?

16 Population Mean - Sample Mean
Possible interpretations for the mean of a data set

17 Notation for Sample Mean

18 Notation  denotes the sum of a set of values.
x is the variable usually used to represent the individual data values. n represents the number of values in a sample. N represents the number of values in a population.

19 Notation for Population Mean
Shape Concerned with extent to which values are symmetrically distributed. Kurtosis The extent to which a distribution is peaked (flatter or taller). For example, a distribution could be more peaked than a normal distribution (still may be ‘bell-shaped). If values are negative, then distribution is less peaked than a normal distribution. Skew The extent to which a distribution is symmetric or has a tail. Values are 0 if normal distribution. If the values are negative, then negative or left-skewed. Notation used for a sample and for the population

20 Best Measure of Center

21 Measuring Spread or Variation

22 Range Measure of spread, variation or dispersion
Difference between largest and smallest observations Ignores how data are distributed 7 8 9 10 7 8 9 10

23 Quartiles and Boxplots

24 Quartiles Measure of Spread, variation or dispersion
Split Ordered Data Set into 4 Quarters 25% 25% 25% 25% Min Q1 Q2 Q3 Max

25 How To Calculate the Quartiles

26 Quartile (Q2) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: Position: Q2 = 8.3

27 Quartile (Q1) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: Position: Q1 = 6.3

28 Quartile (Q3) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: Position: Q3 = 10.3

29 Notice that, Q1 (First Quartile) separates the bottom 25% of sorted values from the top 75%. Q2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%. Q3 (Third Quartile) separates the bottom 75% of sorted values from the top 25%.

30 Percentiles Just as there are three quartiles separating data into four parts, there are 99 percentiles, denoted P1, P2, P99, which partition the data into 100 groups. The kth percentile, Pk is the value for which k % of all observations are below that value. For instance, Q1= P25 , Q2= P50 , and Q3= P75

31 Finding the Percentile
of a Given Score The following formula gives the percentile that a given score represents. Notice that the data set must be ordered. Round the result to the nearest integer

32 Example: Ages of Best Actresses
Original Data Sorted Data Interpretation: The age of 30 years is the 34th percentile, that is, P34 = 30

33 Converting from the kth Percentile to the Corresponding Data Value
then ask the question,

34

35

36 Example: Ages of Best Actresses
Refer to the sorted ages of Best Actresses given below to find the value of the 20th percentile, P20 Original Data Sorted Data P20 is the value for which 20 % of all observations are below that value.

37 Example: Ages of Best Actresses
Refer to the sorted ages of Best Actresses given below to find the value of the 20th percentile, P20 Original Data Sorted Data

38 Example: Ages of Best Actresses
Refer to the sorted ages of Best Actresses given below to find the value of the 75th percentile, P75 Original Data Sorted Data

39 The Interquartile Range IQR
Measure of spread, variation or dispersion Also called midspread Difference between third and first quartiles Spread in middle 50% Not affected by extreme values

40 The Interquartile Range IQR
Preferred measure of variation when the median is used as the measure of center. Like the median, the interquartile range is a resistant measure.

41 Outliers

42 The Five-number Summary

43 Example: Supermarket Spending

44 The Five-number Summary is:
M = Q3 = 28 Q1 = 19 Q3 = 45 Max Min The Five-number Summary is: $3 $19 $ $ $93

45 Boxplot Min Q Median Q Max 1 3 5 6 7 9 10

46 20 customer satisfaction ratings:
Example 20 customer satisfaction ratings: Q1 = (7+8)/2 = 7.5 M = (8+8)/2 = 8 Q3 = (9+9)/2 = 9 IQR = Q3 - Q1 = = 1.5

47 Boxplot

48 Boxplot

49 Distribution shapes and boxplots

50 Modified Boxplots Some statistical packages provide modified boxplots which represent outliers as special points. A modified boxplot is constructed with these specifications: A special symbol (such as an asterisk) is used to identify outliers. The solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier.

51 Example

52 Variance and Standard Deviation
Measures of spread, variation or dispersion Most common measures Consider how data are distributed Show variation about mean

53 Sample Variance and Sample Standard Deviation

54 Properties of the Standard Deviation
The idea behind the variance and the standard deviation as measures of spread is as follows: The deviations xi − x display the spread of the values xi about their mean x. Some of these deviations will be positive and some negative because some of the observations fall on each side of the mean. In fact, the sum of the deviations of the observations from their mean will always be zero.

55 Properties of the Standard Deviation
Squaring the deviations makes them all positive, so that observations far from the mean in either direction have large positive squared deviations. The variance is the average squared deviation. Therefore both, s2 and s will be large if the observations are widely spread about their mean, and small if the observations are all close to the mean.

56 Properties of the Standard Deviation
s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger. s is not resistant. A few outliers can make s very large.

57 Example - Metabolic Rate
A person’s metabolic rate is the rate at which the body consumes energy. This rate is important in studies of weight gain, dieting, and exercise. Here are the metabolic rates of 7 men who took part in a study of dieting. The units are calories per 24 hours. These are the same calories used to describe the energy content of foods. Notice that

58 Example - Metabolic Rate
The table shows the observations xi , their deviations from the mean and the square of these deviations.

59 Example - Metabolic Rate
The figure plots these data as dots on the calorie scale, with their mean marked by an asterisk (∗). The arrows mark two of the deviations from the mean. Metabolic rates for seven men, with the mean (∗) and the deviations of two observations from the mean

60 Choosing measures of center and spread
How do we choose between the five-number summary and x and s to describe the center and spread of a distribution? Because the two sides of a strongly skewed distribution have different spreads, no single number such as s describes the spread well. The five-number summary, with its two quartiles and two extremes, does a better job.

61 Choosing a summary The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x and s only for reasonably symmetric distributions that are free of outliers.

62 Remarks The idea of the variance is straightforward: it is the average of the squares of the deviations of the observations from their mean. The details we have just presented, however, raise some questions. Why do we square the deviations? Why do we emphasize the standard deviation rather than the variance? Why do we average by dividing by n −1 rather than n in calculating the variance?

63 Remarks Why do we square the deviations? Why not just average the distances of the observations from their mean? There are two reasons, neither of them obvious. First, the sum of the squared deviations of any set of observations from their mean is the smallest that the sum of squared deviations from any number can possibly be. This is not true of the unsquared distances. So squared deviations point to the mean as center in a way that distances do not.

64 Remarks Second, the standard deviation turns out to be the natural measure of spread for a particularly important class of symmetric unimodal distributions, the normal distributions. We will meet the normal distributions in a later section. We commented earlier that the usefulness of many statistical procedures is tied to distributions of particular shapes. This is distinctly true of the standard deviation.

65 Remarks Why do we emphasize the standard deviation rather than the variance? One reason is that s, not s2, is the natural measure of spread for normal distributions. There is also a more general reason to prefer s to s2. Because the variance involves squaring the deviations, it does not have the same unit of measurement as the original observations. The variance of the metabolic rates, for example, is measured in squared calories. Taking the square root remedies this. The standard deviation s measures spread about the mean in the original scale.

66 Remarks Why do we average by dividing by n −1 rather than n in calculating the variance? Because the sum of the deviations is always zero, the last deviation can be found once we know the other n − 1. So we are not averaging n unrelated numbers. Only n−1 of the squared deviations can vary freely, and we average by dividing the total by n −1. The number n − 1 is called the degrees of freedom of the variance or standard deviation. Many calculators offer a choice between dividing by n and dividing by n − 1, so be sure to use n − 1.

67 Population Standard Deviation

68 Standardized Variables
We can associate with any variable x a new variable z, called the standardized version of x or the standardized variable, defined as follows.

69 Example Consider a simple variable x, namely, one with possible observations shown in the first row of following table.

70 Example - Continued a. Determine the standardized version of x.
b. Find the observed value of z corresponding to an observed value of x of 5. c. Obtain all possible observations of z. d. Find the mean and standard deviation of z. e. Obtain dotplots of the distributions of both x and z. Interpret the results.

71 Example - Continued a. Determine the standardized version of x.
Using the definitions of µ and σ we find that the mean and standard deviation of the variable x are µ = 3 and σ = 2. Therefore, the standardized version of x is

72 Example - Continued b. Find the observed value of z corresponding to an observed value of x of 5. The observed value of z corresponding to an observed value of x of 5 is

73 Example - Continued c. Obtain all possible observations of z.
Applying the formula z = (x − 3)/2 to each observation of the variable x shown in the first row of the table we obtain t each observation of the standardized variable z shown in the second row.

74 Example - Continued d. Find the mean and standard deviation of z.
From the second row of the table we get

75 Example - Continued e. Obtain dotplots of the distributions of both x and z. Interpret the results. The dotplots of the distributions of x and z are

76 Standard Scores or z-Scores
An important concept associated with standardized variables is that of the z-score, or standard score, which we now define.

77 Standard Scores or z-Scores
The standard score or z-score, represents the number of standard deviations that a data value, x, falls from the mean, µ. That is,

78 Empirical Rule ( %) For data with a (symmetric) bell-shaped distribution, the standard deviation has the following characteristics. About 68% of the data lie within one standard deviation of the mean. About 95% of the data lie within two standard deviations of the mean. About 99.7% of the data lie within three standard deviation of the mean.

79 Empirical Rule (68-95-99.7%) 34% 34% 2.35% 2.35% 13.5% 13.5% – 4 – 3
99.7% within 3 standard deviations 95% within 2 standard deviations 68% within 1 standard deviation – 4 – 3 – 2 – 1 1 2 3 4 34% 34% 2.35% 2.35% 13.5% 13.5%

80 Empirical Rule ( %)

81 Interpreting z-Scores
Ordinary values: z-score between -2 and 2 Unusual Values: z-score < -2 or z-score > 2

82 Using the Empirical Rule
The mean value of homes on a street is $125 thousand with a standard deviation of $5 thousand. The data set has a bell shaped distribution. Estimate the percent of homes between $120 and $130 thousand. 125 130 135 120 140 145 115 110 105 68% µ – σ µ + σ 68% of the houses have a value between $120 and $130 thousand.

83 Standard Scores – Example 1
The weight data for the 2003 U.S. Women’s World Cup soccer team is given in the fourth column of the following table.

84 Standard Scores – Example 1
So, in this case, the standardized variable is a. Find and interpret the z-score of Tiffany Roberts’s weight of 51 kg. b. Find and interpret the z-score of Cindy Parlow’s weight of 70 kg. c. Construct a graph showing the results obtained in parts (a) and (b).

85 Standard Scores – Example 1
a. The z-score for Tiffany’s weight of 51 kg is Which means that Tiffany’s weight is 2.36 standard deviations below the mean. b. The z-score for Cindy’s weight of 70 kg is Which means that Cindy’s weight is 1.52 standard deviations above the mean.

86 Standard Scores – Example 1
c. In the figure, we marked Tiffany’s weight of 51 kg with a color dot and Cindy’s weight of 70 kg with a black dot. Additionally, we located the mean, µ = kg, and measured intervals equal in length to the standard deviation, σ = 4.9 kg.

87 Dotplot for the weight data for the Women’s World Cup soccer team

88 Standard Scores – Example 2
John received a 75 on a test whose class mean was 73.2 with a standard deviation of Samantha received a 68.6 on a test whose class mean was 65 with a standard deviation of 3.9. Which student had the better test score? John’s z-score Samantha’s z-score John’s score was 0.4 standard deviations higher than the mean, while Samantha’s score was 0.92 standard deviations higher than the mean. Samantha’s test score was better than John’s.

89 Shape

90 Skewness


Download ppt "Chapter 3 Describing Distributions with Numbers"

Similar presentations


Ads by Google