Download presentation
Presentation is loading. Please wait.
1
1. Displaying data with graphs
The Practice of Statistics in the Life Sciences Second Edition © 2012 W.H. Freeman and Company
2
Objectives (PSLS Chapter 1)
Picturing Distributions with Graphs Individuals and variables Two types of data: categorical and quantitative Ways to chart categorical data: bar graphs and pie charts Ways to chart quantitative data: histograms, dotplots, and stemplots Interpreting histograms Graphing time series: time plots
3
Individuals and variables
Individuals are the objects described in a set of data. Individuals may be people, animals, plants or things. Freshmen, newborns, golden retrievers, fields of corn, cells A variable is any property that characterizes an individual. A variable can take different values for different individuals. Age, gender, blood pressure, blood type, leaf length, flower color
4
Two types of variables A variable can be either quantitative
Some quantity assessed or measured for each individual. We can then report the average of all individuals. Age (in years), blood pressure (in mm Hg), leaf length (in cm) categorical Some characteristic describing each individual. We can then report the count or proportion of individuals with that characteristic. Gender (male, female), blood type (A, B, AB, O), flower color (white, yellow, red)
5
How do you decide if a variable is categorical or quantitative?
Ask: What are the n individuals examined (in the sample or population)? What is being recorded about those n individuals? Is that a number ( quantitative) or a statement ( categorical)? Individuals studied Diagnosis Age at death Patient A Heart disease 56 Patient B Stroke 70 Patient C 75 Patient D Lung cancer 60 Patient E 80 Patient F Accident 73 Patient G Diabetes 69 Each individual is given a description Each individual is given a meaningful number
6
Who/what are the individuals?
A study examined the condition of deer after a particularly nasty winter. Sex and condition (good and poor) of a random sample of 61 deer are noted. Data from such a study could appear in either of these two formats: Raw data Frequency table Individuals: the 61 deer Variables: sex (categorical) and condition (categorical) Note: “Count” is NOT the variable studied – it’s a summary statistic for the data set. Who/what are the individuals? What are the variables, and are they quantitative or categorical?
7
Ways to chart categorical data
Most common ways to graph categorical data: Bar graphs Each characteristic, or level, is represented by a bar. The height of a bar represents either the count of individuals with that characteristic, the frequency, or the percent of individuals with that characteristic, the relative frequency. Pie charts A pie chart can only represent how one categorical variable breaks down into its components. Each characteristic is represented by a slice, and the size of a slice represents what percent of the whole is made up by that characteristic. When a variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.).
8
Bar graph only Bar graph or pie chart Do you like…? Subject Carrots
Peas Spinach 1 yes 2 no 3 4 5 6 Percent who like 67% 50% 33% Percent who don't Bar graph only Which one do you prefer? Subject Preference 1 Peas 2 Carrots 3 4 Spinach 5 6 Percent who prefer 50% 33% 17% The first table shows the answers to three separate questions. We can use one bar graph to display all the results at once. We couldn’t use a pie chart, because there is more than one categorical variable here. The second table shows the answers to just one question. We can display the results either in a bar graph or in a pie chart. Bar graph or pie chart
9
Who/what are the individuals?
Percent of current marijuana users in each of four age groups: USA, 2004 The individuals are Americans 12 years or older. The variables are age group (categorical) and marijuana use (categorical). This is a bar graph. We could not represent these data into a pie chart because we’ve ask about marijuana use in 4 separate age groups (a pie chart could only display the results for each age group separately). Who/what are the individuals? What are the variables, and are they quantitative or categorical? What type of graph is this? Could these data be represented in a pie chart?
10
Common ways to chart quantitative data
Histograms This is a summary graph for a single variable. Histograms are useful to understand the pattern of variability in the data, especially for large data sets. Dotplots and stemplots These are graphs for a the raw data. They are useful to describe the pattern of variability in the data, especially for small data sets. Line graphs: time plots Use them when there is a meaningful sequence, like time. The line connecting the points helps emphasize any change over time. Other graphs to display numerical summaries (see chapter 2)
11
Making a stemplot STEM LEAVES Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Stemplots can be “split” to give greater detail. Variables with a wide range might be rounded or truncated before plotting, to keep the stemplot more manageable. Original data: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70
12
Step 2: Assign the values to stems and leaves
How fast do skin wounds heal? Biologists measured the rate at which new cells closed a razor cut made in the skin of an anesthetized newt. Here are the data from 18 newts, measured in micrometers per hour: Step 1: Sort the data Step 2: Assign the values to stems and leaves 4 0
13
Making a dotplot Create a single axis representing the quantitative variable’s range Represent each data point as a dot positioned according to its numerical value When two or more data points have the same value, stack them up Sorted data Software sometimes stacks two data points with similar (but not identical) values. This is equivalent to rounding the data before plotting.
14
Making a histogram 1) The range of values that the quantitative variable takes is divided into equal-size intervals, or classes. This makes up the horizontal axis. 2) The vertical axis represents either the frequency (counts) or the relative frequency (percents of total). 3) For each class on the horizontal axis, draw a column. The height of the column represents the count (or percent) of data points that fall in that class interval.
15
Guinea pig survival time (in days) after inoculation with a pathogen (n = 72)
Let’s build a histogram with classes of size 50, starting at zero (zero is included in the first class). The first class is 0 to 50, with 0 included but 50 excluded (obviously, no value can belong to more than one interval). Only two guinea pigs fall in this interval: The guinea pigs with survival times 43 and 45, respectively.
16
There are two histograms of the same data set, using the same choice of classes. The histogram on the top displays the relative frequency (percent of guinea pigs), whereas the histogram on the bottom displays the frequency (count of guinea pigs). Notice that the shape of the two histograms is identical.
17
Choosing the classes for a histogram
It is an iterative process – try and try again. Not too many classes with either 0 or 1 counts Not overly summarized that you loose all the information Not so detailed that it is no longer summary Statistical Applets: One Variable Statistical Calculator Try starting with 5 to10 classes, then refine your class choice. (There isn’t a unique or “perfect” solution)
18
Interpreting histograms
We look for the overall pattern and for striking deviations from that pattern. We describe the histogram’s: Shape Center Spread Possibly outliers Shape: unimodal, bimodal, multimodal; symmetric, skewed; irregular Center: look for the approximate center of mass Spread: range of values taken by the variable Outliers: individuals that do not fit the overall pattern
19
Most common unimodal distribution shapes
Symmetric distribution All these are unimodal, or single-peaked. The side of a skew is the side of the tail of the distribution. Left skew The left side extends much farther out than the right side. Right skew The right side (side with larger values) extends much farther out than the left side.
20
Describe the shape of these histograms.
Top (young women with anorexia): Symmetric Middle (adolescent girls): Skewed to the right Bottom (fish/perch): Complex, bimodal distribution Not all distributions have a simple shape (especially with few observations).
21
Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. Alaska and Florida have unusual percents of elderly in their population. A large gap in the distribution is typically a sign of an outlier. Outliers might be interesting or they might be mistakes. See the discussion in Chapter 2 for more in-depth coverage. Alaska Florida
22
Graphing time series Data collected over time are displayed in a time plot, with time on the horizontal axis and the variable of interest on the vertical axis. We look for a possible trend (a clear overall pattern) and possible cyclical variations (variations with some regularity over time) This graph shows clear evidence of both: Trend: We notice a clear upward pattern (trend) indicating a gradual increase in monthly CO2 level Cyclical variations: We observe lots of month-to-month variations, but these variations are quite regular. There are as many peaks are there are years in the data set, so these are seasonal variations; the yearly peak in CO2 level is in late winter-early spring (the first data point in March is near the annual maximum, whereas the last data point in August 09 is already on the local down cycle). Monthly atmospheric CO2 levels recorded at the Mauna Loa Hawaii observatory (March 1958 – August 2009)
23
Describe these two graphs.
Shark River water salinity in the Everglades National Park, over a seven-day period in the fall of 2009. Describe these two graphs. Top: Time plot of water salinity (quantitative). Note that data collection is done every 15 minutes. We see no evidence of an overall trend over such a short period. A cyclical pattern is obvious: salinity peaks twice a day, corresponding with the tides from the Gulf of Mexico ocean water (the data were collected near the mouth of the river). Bottom: Time plot of C-section delivery rates in the United States. Note that the data are annual rates. There is a clear upward trend indicating a gradual increase in the proportion of births delivered using C-section. Shorter (maybe more problematic) pregnancies have a higher rate overall, but the trend is visible in all three groups. We find no evidence of cyclical variation.
24
Scales matter How you stretch the axes and choose your scales can give a different impression. Death rates from cancer (U.S., 1945 – 95) A picture is worth a thousand words, BUT there is nothing like hard numbers. Look at the scales.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.