Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction A histogram is a graph that summarizes data.

Similar presentations


Presentation on theme: "Introduction A histogram is a graph that summarizes data."— Presentation transcript:

1 Introduction A histogram is a graph that summarizes data.

2 How to read histograms A histogram does not need a vertical scale. But there is a horizontal scale.

3 Example: The incomes from 50,000 American families in 1973.

4 Basic properties The histogram: a set of blocks. Horizontal scale: shows income in thousands of dollars. Class intervals: bottom edges. (e.g. The first block covers the range from $0 to $1,000; the second covers from $1,000 to $2,000; and so on until the last from $25,000 to $50,000.) Areas of the blocks: represent percentages. (e.g. The area of a block is proportional to the number of families with incomes in the corresponding class interval.)

5 How blocks work? The percentage of families earned between $10,000 and $15,000: looks like 1/4 of the total area, so it is about 25%. Compare families with incomes between $10,000 and $15,000 to families with incomes between $15,000 and 25,000: area is about the same. Families with incomes under $7,000: about 25%.

6 Histogram with a vertical scale supplied Example: the vertical scale counts as percentage %.

7 Drawing a histogram

8 Distribution table: shows the percentage in each class interval. Endpoint convention: the left endpoint is included and the right endpoint is excluded. For example: $0 is included and $1,000 is excluded in the first line. $1,000 is included in the next interval.

9 Put down a horizontal axis Warning: this is a wrong example! This is wrong in scaling: the length from 7 to 10 should be three times as long as the length from 6 to 7. We also need to make a note to the axis.

10 Draw the blocks Warning: this is a wrong example! This is wrong in tempting to make the heights equal to the percents in the table. e.g. The graph seems to state that there were many more families with incomes over $25,000 than under $7,000. (The reason is that some class intervals are longer, and this will make the block too big.)

11 Use a common unit Since the lengths of the class intervals are different, we use a common unit. In our example, the common unit will be the thousand-dollar sub- interval. e.g. The class interval from $7,000 to $10,000 contains three of these sub-intervals: $7,000 to $8,000, $8,000 to $9,000, and $9,000 to 10,000. From the table, 15% of the families were in the whole class interval, so each of the thousand-dollar sub-interval (the unit) will have 5% (NOT 15%). Therefore, the height of the block will be 5.

12 Use a common unit e.g. If we look at the class interval from $10,000 to $15,000. This contains five of the thousand-dollar sub- intervals. From the table, 26% of the families were in the whole class interval, so each the unit sub-interval will have 26/5 = 5.2 (%). Then the height of the block will be 5.2, NOT 26 percent per thousand dollars (the unit of the vertical scale).

13 Divide the percentage by the length So, as we saw in the example, to figure out the height of a block over a class interval, divide the percentage by the length of the interval. This way, the area of the block equals the percentage. The histogram represents the distribution. Often, this is a good first approximation.

14 The Density Scale A vertical scale that represents percentage.

15 Example

16 Educational level: years of schooling completed.(Not include kindergarten.) The vertical scale represents the density: units are percent per year. Pay attention to the endpoint convention: e.g. the class interval 8-9 years represents all the people who finished eighth grade, but not ninth grade. (Including people who dropped out part way through ninth grade. The ninth grade are included in the next class interval.) By the endpoint convention, for example, the height of the class interval 13-16 years is 6% per year: i.e. 6% finished the first year of college, another 6% finished the second, and another 6% finished the third.

17 The height of a block represents crowding: percentage per horizontal unit The histogram is highest over the interval 12-13 years, so the crowding is greatest there. (People with high-school degrees.) Two peaks: 8-9 years (finishing middle school), 16-17 years (finishing college). The 3 peaks show how people tend to stop their schooling at one of the three possible graduations rather than dropping out in between.

18 Height vs Area Area represents percent: one block covers a larger area than another means it represents a larger percent of the cases. Height shows the crowding: in our example, if people lined up on the horizontal axis, some of the class intervals will be more crowded. Keep apart the notion of the crowding and the number in an interval: e.g. class intervals 8-9 years and 9-12 years. The first block is more crowded since it is taller, and the second block has many more people since it has a larger area.

19 How density scale work Calculating the area of a block: e.g. the block over the interval 9-12 years (the people who get through their first year of high school but didn’t graduate). The height is about 4% per year (each of the three one-year sub-intervals 9-10, 10-11, and 11-12 holds nearly 4% of the people), so the whole block must hold nearly 3 x 4% = 12% of the people.

20 More Examples Solution. The height of the block is 2% per thousand dollars. The width is 25 – 15 = 10 (thousands of dollars). So the answer is 10 x 2% = 20%.

21 More Examples Solution. The histogram is almost a triangle. Its height is about 4% per pound, and its base is 200 lb – 100 lb = 100 lb. So the area is ½ x 100 lb x 4% per lb = 200%. This is impossible, the total area should be only 100%.

22 Properties of density scale With the density scale on the vertical axis, the area of the blocks come out in percent. The area under the histogram over an interval equals the percentage of cases in that interval. The total area under the histogram is 100%.

23 Variables A variable is a characteristic which changes from person to person in a study.

24 Examples How old are you? Variable: age. How many people are there in your family? Variable: family size. What is your family’s total income? Variable: family income. Are you married? Variable: marital status. Do you have a job? Variable: employment status.

25 Quantitative and Qualitative The variables are given by numbers: quantitative variables. e.g. Age, family size, and family income. The variables are given by descriptive words or phrases: qualitative variables. e.g. Marital status (single, married, widowed, divorced, separated) and employment status (employed, unemployed, not in the labor force).

26 Discrete and Continuous For quantitative variables, the values can only differ by fixed amounts, then these are called discrete variables. e.g. Family size is discrete. On the other hand, the values can differ continuously, then they are called continuous variables. e.g. Age is continuous: the difference in age between two people can be arbitrarily small----a year, a month, a day, an hour,…

27

28 To plot a histogram We start with a distribution table. Collect the raw data: a list of cases. Choose class intervals: it is common to start with 10 or 15 class intervals. (With too many or too few intervals, the histogram will not be informative.) For a continuous variable, we use endpoint convention. For a discrete variable, we center the class intervals at the possible values.

29 Example: family size Above shows how to select the class intervals for a discrete variable. e.g. A family can not have 2.5 members, so there is no problem with endpoints.

30 Example: family size The bars seem to stop at 8. This is because there are so few families with 9 or more people.

31 Controlling for a variable In the 1960s, many women began using oral contraceptives. The pill alters the body’s hormone balance, so it is important to see what the side effects are. Kaiser Clinic in Walnut Greek, California ran an observational study on this research. One issue was the effect of the pill on blood pressure.

32 The observational study Investigators compared the results for two groups of women: users who take the pill (the treatment group); non-users who don’t take the pill (the control group). Blood pressure tends to go up with age, and the non- users were on the whole older than the users. (About 70% of the non-users were over 30, compared to 50% of the users.) So the effect of age is confounded with the effect of the pill. Then it is necessary to make a separate comparison for each age group: control for the variable age. We only look at the women with age 25-34.

33 The effect of the pill The systolic blood pressures of 1,747 users and 3,040 non- users. (Age: 25 – 34.) Comparison to the histogram for the non-users shifted to the right by 5mm (millimeters).

34 Conclusion Figure on the left: the 2 histograms have very similar shapes. But user’s histogram is higher to the right of 120mm, lower to the left. If 5mm were added to the blood pressure of each non-user, then this would shift their histogram 5mm to the right, as shown in the figure on the right. Figure on the right: the 2 histograms match up quite well. So the conclusion is that using the pill adds about 5mm to the blood pressure of each woman.

35 Remark The conclusion must be treated with caution. Because it is an observational study, not a controlled experiment. Observational studies can be misleading about cause-and-effect relationships. There could be some factor other than the pill or age which is affecting the blood pressures.

36 Summary Histogram consists of a set of blocks, and the area of each block represents the percentage of cases in the corresponding class interval. The total area is 100%. The height represents the crowding in that class interval. It equals to the area divided by the length of that interval. A variable is a characteristic of the subjects in a study. It can be either qualitative or quantitative. A quantitative variable can be either discrete or continuous.

37 Reading materials Pages 47-49 of the textbook. (Will be uploaded on the webpage of the course.)


Download ppt "Introduction A histogram is a graph that summarizes data."

Similar presentations


Ads by Google