Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction We learned from last chapter that histogram can be used to summarize large amounts of data. We learned from last chapter that histogram can.

Similar presentations


Presentation on theme: "Introduction We learned from last chapter that histogram can be used to summarize large amounts of data. We learned from last chapter that histogram can."— Presentation transcript:

1 Introduction We learned from last chapter that histogram can be used to summarize large amounts of data. We learned from last chapter that histogram can be used to summarize large amounts of data. In general, a histogram can be summarized by the center and the spread. In general, a histogram can be summarized by the center and the spread. In statistics, the average is often used to find the center. The standard deviation measures spread around the average. We are going to learn about these two new things in this chapter. In statistics, the average is often used to find the center. The standard deviation measures spread around the average. We are going to learn about these two new things in this chapter. Note: sometimes we use median to find the center, and the interquartile range is another measure of spread. Note: sometimes we use median to find the center, and the interquartile range is another measure of spread.

2 Two sketched histograms A histogram center at 50 with little spread out. A histogram center at 50 with more spread out.

3 Things do not always work out so well There are two peaks in this histogram. Center and spread out can not summarize the histogram in this case.

4 Average The average of a list of numbers equals their sum, divided by how many there are.

5 Example Look at the list 9, 1, 2, 2, 0. Look at the list 9, 1, 2, 2, 0. It has 5 entries. It has 5 entries. The average is (9+1+2+2+0)/5 = 14/5 = 2.8 The average is (9+1+2+2+0)/5 = 14/5 = 2.8

6 The HANES The HANES is the Health and Nutrition Examination Survey. The HANES is the Health and Nutrition Examination Survey. The objective is to get baseline data about: The objective is to get baseline data about: Demographic variables: age, education, and income. Demographic variables: age, education, and income. Physiological variables: height, weight, blood pressure, and serum cholesterol levels. Physiological variables: height, weight, blood pressure, and serum cholesterol levels. Dietary habits. Dietary habits. Prevalence of diseases. Prevalence of diseases.

7 HANES2 vs HANES5 The HANES2 sample was taken during the period 1976-1980. The HANES2 sample was taken during the period 1976-1980. The average height of the men was 5 feet 9 inches, and their average weight was 171 pounds. The average height of the men was 5 feet 9 inches, and their average weight was 171 pounds. The average height of the women was 5 feet 3.5 inches, and their average weight was 146 pounds. The average height of the women was 5 feet 3.5 inches, and their average weight was 146 pounds. The HANES5 was done in 2003-2004. The HANES5 was done in 2003-2004. Average heights went up by a fraction of an inch, while weights went up by nearly 20 pounds----both for men and women. Average heights went up by a fraction of an inch, while weights went up by nearly 20 pounds----both for men and women. This could become a serious public-health problem, because excess weight is associated with many diseases, including heart disease, cancer, and diabetes. This could become a serious public-health problem, because excess weight is associated with many diseases, including heart disease, cancer, and diabetes.

8 HANES2 vs HANES5

9 Advantage and shortcoming The average is a powerful way of summarizing data. (e.g. Many histograms of other data are compressed into the four curves in HANES.) The average is a powerful way of summarizing data. (e.g. Many histograms of other data are compressed into the four curves in HANES.) But this compression is achieved only by smoothing away individual differences. But this compression is achieved only by smoothing away individual differences. For instance, in HANES5, the average height of the men age 18-24 was 5 feet 10 inches. But 15% of them were taller than 6 feet 1 inch; another 15% were shorter than 5 feet 6 inches. This diversity is hidden by the average. For instance, in HANES5, the average height of the men age 18-24 was 5 feet 10 inches. But 15% of them were taller than 6 feet 1 inch; another 15% were shorter than 5 feet 6 inches. This diversity is hidden by the average.

10 Cross-sectional vs longitudinal In a cross-sectional study, different subjects are compared to each other at one point in time. (e.g. The HANES is a cross-sectional study.) In a cross-sectional study, different subjects are compared to each other at one point in time. (e.g. The HANES is a cross-sectional study.) In a longitudinal study, subjects are followed over time, and compared with themselves at different points in time. In a longitudinal study, subjects are followed over time, and compared with themselves at different points in time. In the HANES2, the average height of men appears to decrease after age 20, dropping about two inches in 50 years. Similarly for women. Could we conclude that an average person got shorter at this rate? In the HANES2, the average height of men appears to decrease after age 20, dropping about two inches in 50 years. Similarly for women. Could we conclude that an average person got shorter at this rate? Not really. Because the HANES is a cross-sectional study: the people in the group of age 18-24 are completely different from those in the group of age 65-74. Not really. Because the HANES is a cross-sectional study: the people in the group of age 18-24 are completely different from those in the group of age 65-74. The reason is due to so-called secular trend: over time, Americans have been getting taller. The first group was born around 50 years later than the second group. The reason is due to so-called secular trend: over time, Americans have been getting taller. The first group was born around 50 years later than the second group.

11 Remark If a study draws conclusions about the effects of age, find out whether the data are cross-sectional or longitudinal. If a study draws conclusions about the effects of age, find out whether the data are cross-sectional or longitudinal.

12 The average in a histogram We study the relationship between the average (or the median) and the histogram.

13 Example The average weight was 164 pounds. It was not 50% above average and 50% below in weight. In fact, it was only 41% above average, and 59% below. (i.e. Only 41% to the right of the line, and 59% to the left.) The line with half and half (50% : 50%), is the median. The median of a histogram is the value with half the area to the left and half to the right.

14 The location of the average Let us look at more examples to see why the average does not always lie at the middle of the histogram. If we first look at the list 1, 2, 2, 3. The histogram for this list is symmetric about the value 2. If the histogram is symmetric around a value, then that value equals the average. So the average here is 2. Furthermore, it is half and half about the average line. (i.e. Half the area lies to the left of the average, and half to the right.) So, if the histogram is symmetric, then the median equals to the average. Here the median is 2.

15 The location of the average What if we increase the value 3 on the list 1, 2, 2, 3? Say, increase to 5 or 7. This will destroy the symmetry of the histogram. The average for each histogram is marked with an arrow. We can see this arrow shifts to the right following the rectangle. The histogram will “balance” at the average. A small area far away from the average can “balance” a large area close to the average. This is because areas are weighted by their distance from the “balance” point. A histogram balances when “supported” at the average.

16 Median vs Average However, recalling the definition of the median, we find that, for all three histograms in the previous slide, the median is 2. (With 50% to the left and 50% to the right.) However, recalling the definition of the median, we find that, for all three histograms in the previous slide, the median is 2. (With 50% to the left and 50% to the right.) For the second and third histograms, the area to the right of the median is far away by comparing with the area to the left. For the second and third histograms, the area to the right of the median is far away by comparing with the area to the left. More generally, the average is to the right of the median whenever the histogram has a long right-hand tail, and things will be opposite if the histogram has a long left- hand tail. More generally, the average is to the right of the median whenever the histogram has a long right-hand tail, and things will be opposite if the histogram has a long left- hand tail.

17 Median vs Average In the weight histogram, the average was 164 pounds, and the median was 155 pounds. So the long right-hand tail is what made the average bigger than the median. In the weight histogram, the average was 164 pounds, and the median was 155 pounds. So the long right-hand tail is what made the average bigger than the median. When dealing with long-tailed distributions, statisticians might use the median rather than the average, if the average pays too much attention to the extreme tail of the distribution. (We will return to this point later.)

18 The root-mean-square

19 The formula of r.m.s. Suppose we are given a list of numbers, we may compute its average. We can also compute another statistical quantity the root-mean-square (r.m.s.). Suppose we are given a list of numbers, we may compute its average. We can also compute another statistical quantity the root-mean-square (r.m.s.). By the phrase “root-mean-square”, we may know how to do the arithmetic, if we keep in mind to read it backwars: By the phrase “root-mean-square”, we may know how to do the arithmetic, if we keep in mind to read it backwars: Square all the entries. Square all the entries. Take the mean (average) of the quares. Take the mean (average) of the quares. Take the root of the mean. Take the root of the mean. This can be expressed as an equation: This can be expressed as an equation:

20 Example Find the average and the r.m.s. size of the list 0, 5, -8, 7, -3. Find the average and the r.m.s. size of the list 0, 5, -8, 7, -3. Solution. Solution. The average of the list is (0 + 5 + (-8) + 7 + (-3)) / 5 = 1 / 5 = 0.2 The average of the list is (0 + 5 + (-8) + 7 + (-3)) / 5 = 1 / 5 = 0.2 The r.m.s. size = √((0² + 5² + (-8)² + 7² + (-3)²) / 5) = √(147 / 5) = √29.4 ≈ 5.4 The r.m.s. size = √((0² + 5² + (-8)² + 7² + (-3)²) / 5) = √(147 / 5) = √29.4 ≈ 5.4

21 Remark One may try to calculate the average neglecting signs of the list. (i.e. Compute the average of the absolute value of the entries in the list.) One may try to calculate the average neglecting signs of the list. (i.e. Compute the average of the absolute value of the entries in the list.) In our previous example, the average neglecting signs is In our previous example, the average neglecting signs is (0 + 5 + 8 + 7 +3) / 5 = 23 / 5 = 4.6 < 5.4 (The r.m.s. size.) (0 + 5 + 8 + 7 +3) / 5 = 23 / 5 = 4.6 < 5.4 (The r.m.s. size.) It turns out that the r.m.s. size is always a little bigger than the average neglecting signs----except the trivial case: all the entries are the same. It turns out that the r.m.s. size is always a little bigger than the average neglecting signs----except the trivial case: all the entries are the same. Statisticians prefer the r.m.s. to measure the overall size for the list, because the r.m.s. fits in better with the algebra than the average neglecting signs. Statisticians prefer the r.m.s. to measure the overall size for the list, because the r.m.s. fits in better with the algebra than the average neglecting signs.

22 The standard deviation The standard deviation (SD) measures the spread of a list of numbers around the average.

23 The formula of SD Suppose we are given a list of numbers. First, we calculate the average of the list. Suppose we are given a list of numbers. First, we calculate the average of the list. Then we take the entries one at a time, compute each deviation from the average: Then we take the entries one at a time, compute each deviation from the average: Deviation from average = entry – average. Deviation from average = entry – average. These deviations form a new list. These deviations form a new list. The SD is the r.m.s. of the list of deviations. The SD is the r.m.s. of the list of deviations. This can be expressed as an equation: This can be expressed as an equation: SD = r.m.s. of deviations from average. SD = r.m.s. of deviations from average.

24 Example Find the SD of the list 20, 10, 15, 15. Find the SD of the list 20, 10, 15, 15. Solution. Solution. The first step is to find the average: (20 + 10 + 15 + 15) / 4 = 60 / 4 = 15. The first step is to find the average: (20 + 10 + 15 + 15) / 4 = 60 / 4 = 15. The second step is to find the deviations: 20 – 15 = 5, 10 – 15 = -5, 15 – 15 = 0, 15 – 15 = 0. So the list of deviations is 5, -5, 0, 0. The second step is to find the deviations: 20 – 15 = 5, 10 – 15 = -5, 15 – 15 = 0, 15 – 15 = 0. So the list of deviations is 5, -5, 0, 0. The last step is to find the r.m.s. of the new list: SD = √((5²+(-5)²+0²+0²) / 4) = √((25+25+0+0) / 4) = √(50 / 4) = √12.5 ≈ 3.5 The last step is to find the r.m.s. of the new list: SD = √((5²+(-5)²+0²+0²) / 4) = √((25+25+0+0) / 4) = √(50 / 4) = √12.5 ≈ 3.5

25 Remark The SD comes out in the same units as the data. The SD comes out in the same units as the data. For example, suppose heights are measured in inches. Then the SD also shares the same units in inches as data. For example, suppose heights are measured in inches. Then the SD also shares the same units in inches as data. This is because the squaring step changes the units to inches squared, while the square root returns the result to the original units. This is because the squaring step changes the units to inches squared, while the square root returns the result to the original units. Do not confuse the SD with r.m.s., the SD is the r.m.s. of the list of deviations, not of the original numbers. Do not confuse the SD with r.m.s., the SD is the r.m.s. of the list of deviations, not of the original numbers.

26 An alternative formula of SD There is an alternative way to compute the SD, which is more efficient in some cases: There is an alternative way to compute the SD, which is more efficient in some cases: SD = √(ave of (entries²) – (ave of entries)²). SD = √(ave of (entries²) – (ave of entries)²). For example, if we look at the list 1, 2, 3, 4, 5. For example, if we look at the list 1, 2, 3, 4, 5. By our original formula, the average of the list is (1+2+3+4+5)/5=3. Then the deviations are 1 – 3 = -2, 2 – 3 = -1, 3 – 3= 0, 4 – 3 = 1, 5 – 3 = 2. So the new list is -2, -1, 0, 1, 2. Then the SD is √(((-2)²+(-1)²+0²+1²+2²)/5)=√((4+1+0+1+4)/5)=√(10/5)=√2≈1.4 By our original formula, the average of the list is (1+2+3+4+5)/5=3. Then the deviations are 1 – 3 = -2, 2 – 3 = -1, 3 – 3= 0, 4 – 3 = 1, 5 – 3 = 2. So the new list is -2, -1, 0, 1, 2. Then the SD is √(((-2)²+(-1)²+0²+1²+2²)/5)=√((4+1+0+1+4)/5)=√(10/5)=√2≈1.4 By above formula, the list of entries² is 1, 4, 9, 16, 25, and the original average is 3. So the average of list of entries² is (1+4+9+16+25)/5=11, and the square of the original average is 9. Therefore, SD = √(11- 9) = √2 ≈ 1.4 By above formula, the list of entries² is 1, 4, 9, 16, 25, and the original average is 3. So the average of list of entries² is (1+4+9+16+25)/5=11, and the square of the original average is 9. Therefore, SD = √(11- 9) = √2 ≈ 1.4

27 Using a statistical calculator Most statistical calculator produce not the SD, but the slightly larger number SD+ (which will be introduced later). Most statistical calculator produce not the SD, but the slightly larger number SD+ (which will be introduced later). To find out what your machine is doing, put in the list -1, 1. If it gives you 1, it is working out the SD. If it gives you 1.41…, it is working out the SD+. To find out what your machine is doing, put in the list -1, 1. If it gives you 1, it is working out the SD. If it gives you 1.41…, it is working out the SD+. If you are getting SD+, and you want the SD, you have to multiply by a conversion factor. This depends on the number of entries on the list. If you are getting SD+, and you want the SD, you have to multiply by a conversion factor. This depends on the number of entries on the list. For instance, with 10 entries, the conversion factor is √(9/10). With 20 entries, it is √(19/20). For instance, with 10 entries, the conversion factor is √(9/10). With 20 entries, it is √(19/20). In general, we have the formula: In general, we have the formula:

28 The properties of the SD The SD measures the size of deviations from the average: it is a sort of average deviation. The SD measures the size of deviations from the average: it is a sort of average deviation. The SD says how far away numbers on a list are from their average. Most entries on the list will be somewhere around the SD away from the average. Very few will be more than two or three SDs away. The SD says how far away numbers on a list are from their average. Most entries on the list will be somewhere around the SD away from the average. Very few will be more than two or three SDs away.

29 Quantitative version To many data sets, roughly 68% of the entries on a list are within one SD of the average, the other 32% are further away. To many data sets, roughly 68% of the entries on a list are within one SD of the average, the other 32% are further away. Roughly 95% are within two SDs of the average, the other 5% are further away. Roughly 95% are within two SDs of the average, the other 5% are further away. Note: This is correct for many lists, but not all.

30 Back to the HANES5 sample The average height of women age 18 and over (age 18+) was about 63.5 inches, and the SD was close to 3 inches. The average height of women age 18 and over (age 18+) was about 63.5 inches, and the SD was close to 3 inches. By the property of SD, the SD of 3 inches says that, many of the women differed from the average height by within 3 inches. (i.e. Many of them lied in the interval between 60.5 inches and 66.5 inches.) By the property of SD, the SD of 3 inches says that, many of the women differed from the average height by within 3 inches. (i.e. Many of them lied in the interval between 60.5 inches and 66.5 inches.) Also, few women differed from the average height by more than 6 inches. (i.e. Only a few number of them lied outside the interval between 57.5 inches and 69.5 inches.) Also, few women differed from the average height by more than 6 inches. (i.e. Only a few number of them lied outside the interval between 57.5 inches and 69.5 inches.)

31 The histogram for the heights of women age 18+ The average is marked by a vertical line, and the region within one SD of the average is shaded. The average is marked by a vertical line, and the region within one SD of the average is shaded. The area of the shaded region is about 72%. So about 72% of the women differed from the average height by one SD or less. The area of the shaded region is about 72%. So about 72% of the women differed from the average height by one SD or less.

32 The histogram for the heights of women age 18+ Now the region within two SDs of average is shaded. Now the region within two SDs of average is shaded. The area of the shaded region is about 97%. So about 97% of the women differed from the average height by two SDs or less. The area of the shaded region is about 97%. So about 97% of the women differed from the average height by two SDs or less.

33 Summary of the data About 72% of the women differed from average by one SD or less. About 72% of the women differed from average by one SD or less. About 97% of the women differed from average by two SDs or less. About 97% of the women differed from average by two SDs or less. In fact, in the sample, there was only one woman who was more than three SDs away from the average, and none were more than four SDs away. In fact, in the sample, there was only one woman who was more than three SDs away from the average, and none were more than four SDs away. Remark: For this data set, the 68%-95% rule works quite well. (Later in the course, we will talk about where the 68% and 95% come from.) Remark: For this data set, the 68%-95% rule works quite well. (Later in the course, we will talk about where the 68% and 95% come from.)

34 Summary A list of numbers (usually a data set) can be summarized by its average and standard deviation. A list of numbers (usually a data set) can be summarized by its average and standard deviation. Average = sum of entries / number of entries. Average = sum of entries / number of entries. The average locates the “center” of a histogram (at the “balance” point). The average locates the “center” of a histogram (at the “balance” point). The median is another way to locate the center of a histogram, with half the area to the left and half to the right. The median is another way to locate the center of a histogram, with half the area to the left and half to the right. The r.m.s. of a list measures how big the entries are (the overall size of the list). The r.m.s. of a list measures how big the entries are (the overall size of the list). The r.m.s. of a list = √(ave. of (entries²)). The r.m.s. of a list = √(ave. of (entries²)). The SD measures distance from the average. And SD = r.m.s. of the deviations from the average. The SD measures distance from the average. And SD = r.m.s. of the deviations from the average. Roughly 68% are within on SD around the average, and about 95% are within two SDs. This works for many lists, but not all. Roughly 68% are within on SD around the average, and about 95% are within two SDs. This works for many lists, but not all. If a study draws conclusions about the effects of age, find out whether the data are cross-sectional or longitudinal. If a study draws conclusions about the effects of age, find out whether the data are cross-sectional or longitudinal.


Download ppt "Introduction We learned from last chapter that histogram can be used to summarize large amounts of data. We learned from last chapter that histogram can."

Similar presentations


Ads by Google