Numerical Methods for Describing Data Distributions

Numerical Methods for Describing Data Distributions
Chapter 3 Numerical Methods for Describing Data Distributions Created by Kathy Fritz

What was the class average? What were the high and low scores?
Suppose that you have just received your score on an exam in one of your classes. What would you want to know about the distribution of scores for this exam? Measures of center describe were the data distribution is located along the number line. A measure of center provides information about what is “typical”. Measures of spread describe how much variability there is in a data distribution. A measure of spread provides information about how much individual values tend to differ from one another. What was the class average? What were the high and low scores? You want to know a “typical” exam score or a number that best describes the entire set of scores You want to know about the variability of the data set.

The stress of the final years of medical training can contribute to depression and burnout. The authors of the paper “Rates of Medication Errors Among Depressed and Burnt Out Residents” (British Medical Journal [2008]: 488) studied 24 residents in pediatrics. Medical records of patients treated by these residents during a fixed time period were examined for errors in ordering or administering medications. The accompanying dotplot displays the total number of medication errors for each of the 24 residents. Which is more appropriate as the “typical” value of this data set? Explain. Mean = 1 or Median = 0

Choosing Appropriate Measures for Describing Center and Spread
If the shape of the data distribution is … Approximately symmetric Skewed or has outliers Describe Center and Spread Using … Mean and standard deviation Median and interquartile range

Describing Center and Spread For Data Distributions That Are Approximately Symmetric
Mean Standard Deviation

Mean The sample mean is the arithmetic average of values in a data set. It is denoted by the symbol 𝑥 (pronounced as x-bar). Formula: The population mean, m (the Greek letter mu), is the arithmetic average of all the x values in an entire population. Some notation: x = the variable of interest n = the sample size x1, x2, …, xn are the individual observations in the data set

Measuring Variability
Consider the three sets of six exam scores displayed below: Each data set has a mean exam score of 75. Does that completely describe these data sets?

largest observation – smallest observation
Range The simplest numeric measure of variability is range. Range = largest observation – smallest observation What are the ranges of the three data sets?

The sum of the deviations from the mean will always equal zero!
If we found each of the deviations from the mean (-25, -15, -5, 5, 15, 25), what is the sum of these deviations? The most widely used measures of variability are based on how far each observation deviates from the sample mean, called deviations from the mean. Be sure to subtract so that data values above the mean have a positive deviation from the mean . . . . . . and that data values below the mean have a negative deviation from the mean. What does a negative deviation tell you about how the data value compares to the mean? The sum of the deviations from the mean will always equal zero! Deviation = -25 Deviation = 5 Mean = 75

Variance and Standard Deviation
Suppose that we are interested in finding the “typical” or average deviation from the mean. So, to calculate the “typical” or average deviation from the mean, we must first square each deviation. Then the all the squared deviations are positive. The deviations from the mean were -25, -15, -5, 5, 15, and 25. The squares of these deviations from the mean are 625, 225, 25, 25, 225, 625. Now we can average these. If the spread of the population were from 50 to 100, samples would rarely have the same spread. The samples would have a smaller spread (less variability). By dividing by a smaller number n - 1, we get a better estimate of the true “typical” deviation from the mean. Wait a minute . . . If the data values represented the entire population, then we would divide by the sample size (n). However, more often than not, the data values represent a sample from the population and we divide by (n – 1). Why? Can we just calculate the arithmetic average for the deviations from the mean? Why or why not? Since the sum of the deviations from the mean is always zero, you cannot just add the deviations and then divide by the number of deviations. What do you do?

The standard deviation is a more natural measure of variability than the variance because it is expressed in the same units as the original data values.

The data values deviate from the mean of 75, on average, units.

Notation to remember Population Sample Mean Variance
Standard Deviation m s 2 s2 s s

Putting it Together Select appropriate measures of center and spread (look at the shape of the distribution) Compute the values of the selected measures Interpret the values in context.

The sample standard deviation is This is relatively large compared to the values in the data set, indicating a lot of variability from bat to bat in number of drinking attempts. Since the data distribution is approximately symmetric with no outliers, we should use the mean and standard deviation as the measures of center and variability. On average, the bats in the sample made attempts to drink from the smooth, metal surface.

Describing Center and Spread For Data Distributions That Are Skewed or Have Outliers
Median Interquartile Range

Median The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then . . .

The sample mean can be sensitive to even a single outlier.
Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site. Course management software kept track of how often each student accessed any of these web pages. The data set below (in order from smallest to largest) is the number of times each of the 40 students had accessed the class web page during the first month. The sample mean can be sensitive to even a single outlier. The median is quite insensitive to outliers. 3 4 5 7 8 12 13 14 16 18 19 20 21 22 23 26 36 37 42 84 331 The median is the average of these two middle values. Why is the sample mean so much larger than the sample median? Median = 13 These braces split the data set into two equal halves with 20 values above and 20 values below.

Interquartile Range Interquartile range (iqr) is based on quantities called quartiles which divide the data set into four equal parts (quarters). Lower quartile (Q1) = the median of the lower half of the data Upper quartile (Q3) = the median of the upper half of the data In n is odd, the median of the entire data set is excluded from both halves when computing quartiles. iqr = Q3 – Q1 The sample standard deviation, s, can also be greatly affected by the presence of even one outlier. The interquartile range is a measure of variability that is resistant to the effects of outliers.

How many data values are in the first quarter of the data set?
The interquartile range also measures how spread out the middle half of the data set is. If the interquartile range is small, then the middle half of the data set is tightly clustered together. If the interquartile range is large, then the middle half of the data set is more spread out. Recall the website data set: Q1 Median 4.5 3 4 5 7 8 12 13 14 16 18 19 20 21 22 23 26 36 37 42 84 331 13 20.5 Q3 How many data values are in the first quarter of the data set? . . . in the second quarter of the data set? . . . in the third quarter of the data set? . . . in the fourth quarter of the data set? The lower quartile (Q1) is the median of the lower 20 data values. Since the interquartile range of 16 is relatively large, the middle half of the data set for the number of visits to the class website is fairly spread out. The upper quartile (Q3) is the median of the upper 20 data values. The interquartile (iqr) is the difference of the upper and lower quartile. iqr = 20.5 – 4.5 = 16

Putting it Together The Chronicle of Higher Education (Almanac Issue, ) published the accompanying data on the percentage of the population with a bachelor’s degree or graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The data distribution is shown in the histogram below. Step 1: Select We will use the median and interquartile range as measures of center and spread for this data distribution since it is skewed and has an outlier.

Putting it Together Step 2: Calculations median = 26 iqr = 6
Step 3: Interpret The median for this data set is 26. For half the states, the percentage of the population with a bachelor’s or graduate degree is 26% or less. For the other half of the states, 26% or more of the population have a bachelor's or graduate degree. The interquartile range of 6 indicates that the middle half of the data is spread out over an interval of 6 percentage points.

Boxplots General Boxplots Modified Boxplots

A boxplot is a graph of the five-number summary.
The five-number summary consists of the following: Smallest observation in the data set (minimum) Lower quartile (Q1) Median Upper quartile (Q3) Largest observation in the data set (maximum) A boxplot is a graph of the five-number summary.

Boxplots When to Use Univariate numerical data How to construct
Compute the values in the five-number summary Draw a horizontal line and add an appropriate scale. Draw a box above the line that extends from the lower quartile (Q1) to the upper quartile (Q3) Draw a line segment inside the box at the location of the median. Draw two line segments, called whiskers, which extend from the box to the smallest observation and from the box to the largest observation What to look for center, spread, and shape of the data distribution and if there are any unusual features

Draw a box from the lower quartile to the upper quartile.
The authors of the paper “Striatal Volume Predicts Level of Video Game Skill Acquisition” (Cerebral Cortex [2010]: ) studied a number of factors that affect performance in a complex video game. One factor was practice strategy. Forty college students who never played the game Space Fortress were assigned at random to one of two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed. The data distribution for the first group (improving total score) is shown in the dotplot below, along with the median and the lower and upper quartiles. Draw a box from the lower quartile to the upper quartile. Draw a line inside the box for the median We already have a horizontal line and an appropriate scale. Draw line segments for the whiskers

Comparative Boxplots A comparative boxplot is two or more boxplots drawn on the same numerical scale. Recall the video game study. There were two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed. The improvement scores for the first group are more consistent than the improvement scores for the second group. Both distributions are approximately symmetric. However, the median improvement score for the second group is much larger than the median improvement score for the first group. 1st 2nd

Outliers An observation is an outlier if it is . . .
greater than upper quartile + 1.5(iqr) Or less than lower quartile – 1.5 iqr A modified boxplot is a boxplot that shows outliers.

Modified boxplots How to construct
Compute the values in the five-number summary Draw a horizontal line and add an appropriate scale. Draw a box above the line that extends from the lower quartile (Q1) to the upper quartile (Q3) Draw a line segment inside the box at the location of the median. Determine if there are any outliers in the data set. Add whiskers that extend from the box to the smallest observation that is not an outlier and largest observation that is not an outlier. If there are outliers, add dots to the plot to indicate the positions of the outliers.

Big Mac prices in U.S. dollars for 44 different countries were given in the article “Big Mac Index 2010”. The following 44 Big Mac prices are arranged in order from the lowest price (Ukraine) to the highest price (Norway). The median is the average of the two blue numbers. 1.84 1.86 1.90 1.95 2.17 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20 1.84 1.86 1.90 1.95 2.17 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20 1.84 1.86 1.90 1.95 2.17 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3,84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20 Compute the five-number summary. Smallest observation = 1.84 There are no outliers on the lower end of the data set, but there are three outliers on the upper end: 6.19, 6.56, 7.20. Check if there are any outliers. Lower quartile = 2.455 Median = 3.205 Upper quartile = 3.835 iqr = – = 1.38 (1.38) = 0.385 (1.38) = 5.905 Largest observation = 7.20

Big Mac Prices Continued . . .
There are three outliers at $6.19, $6.56, and $ The typical or median price for a Big Mac is $3.205 and the interquartile range of the prices is There is quite a bit of variability in the Big Mac prices. The distribution of prices is skewed right. Smallest observation = 1.84 Upper quartile = 3.835 Lower quartile = 2.455 Median = 3.205 Largest observation = 7.20 Draw the whiskers from the box to the smallest and largest observations that are not outliers. Draw a horizontal line with an appropriate scale. Add dots for the outliers. How does the mean price of Big Macs compare to the median price of $3.205? Draw the box from Q1 to Q2 and add a line at the median. Interpret the graph.

Discuss the similarities and differences.
The salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams. Discuss the similarities and differences. See page 198 for more information.

Measures of Relative Standing
z -scores Percentiles

z -scores When you obtain your score after taking a test, you probably want to know how it compares to the scores of others. The z -score corresponding to a particular data value is Does your score place you among the top 5% of the class or only among the top 25%? Is your score above or below the mean, and by how much? The process of subtracting the mean and then dividing by the standard deviation is sometimes referred to as standardization. A z-score is one example of a standardized score. The z -score tells you how many standard deviations the data value is from the mean. Answering these questions involves measuring the position, or relative standing, of a particular value in a data set. One measure of relative standing is a z-score.

What do these z-scores mean?
The data value is 2.3 standard deviations below the mean -2.3 1.8 The data value is 1.8 standard deviations above the mean

Which is the better offer?
Suppose that two graduating seniors, one a marketing major and one an accounting major, are comparing job offers. The accounting major has an offer for $45,000 per year, and the marketing major has an offer for $43,000 per year. Accounting: mean = 46,000 standard deviation = 1500 Marketing: mean = 42,500 standard deviation = 1000 Which is the better offer? Z-scores help us to compare these two offers. Relative to their distributions, the marketing offer is actually more attractive than the accounting offer, even though the marketing major may not be happy!

Empirical Rule The z-score is particularly useful when the data distribution is mound shaped and approximately symmetric. If the data distribution is mound shaped and approximately symmetric, then . . . Approximately 68% of the observations are within 1 standard deviation of the mean Approximately 95% of the observations are within 2 standard deviation of the mean Approximately 99.7% of the observations are within 3 standard deviation of the mean 68% 99.7% 95%

Empirical Rule This illustrates the percentages given by the Empirical Rule.

Remember the Empirical Rule is only an approximation of the actual percentages – but it is a good estimate as long as the data distribution is mound shaped and approximately symmetric. Number of Standard Deviations Interval Actual Empirical Rule 1 to 72.1% Approximately 68% 2 to 96.2% Approximately 95% 3 to 99.2% Approximately 99.7%

Percentiles For a number r between 0 and 100, the rth percentile is a value such that r percent of the observations fall AT or BELOW that value. This diagram illustrates the 90th percentile.

What value of head circumference is at the 75th percentile? 37.0 cm
In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys. Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6 Percentile 5 10 25 50 75 90 95 What value of head circumference is at the 75th percentile? 37.0 cm What is the median value of head circumference? 35.8 cm

Common Mistakes

Avoid these Common Mistakes
Watch out for categorical data that look numerical! Often, categorical data is coded numerically. For example gender might be coded as 0 = female and 1 = male, but this does not make gender a numerical variable. Categorical data CANNOT be summarized using the mean and standard deviation or the median and interquartile range. 1

Measures of center don’t tell all. Although measures of center, such as the mean and the median, do give you a sense of what might be typical value for a variable, this is only one characteristic of a data set. Without additional information about variability and distribution shape, you don’t really know much about the behavior of the variable. ? ? center

Data distributions with different shapes can have the same mean and standard deviation. For example, consider the following two histograms: Both histograms have the same mean of 10 and standard deviation of 2, but VERY different shapes.

Both the mean and the standard deviation are sensitive to extreme values in a data set, especially if the sample size is small. If the data distribution is markedly skewed or if the data set has outliers, the median and interquartile range are a better choice for describing center and spread. Mean & standard deviation Extreme values

Measures of center and measures of variability describe values of a variable, not frequencies in a frequency distribution or heights of bars in a histogram. For example, consider the following two frequency distributions and histograms: Distribution A has a larger standard deviation even though the frequencies are equal.

Be careful with boxplots based on small sample sizes. Boxplots convey information about center, variability, and shape, but interpreting shape information is problematic when the sample size is small. n = 20 n = 5 n = 10

Not all distributions are mound shaped. Using the Empirical Rule in situations where you are not convinced that the data distribution is mound shaped and approximately symmetric can lead to incorrect statements.

Watch for outliers! Unusual observations in a data set often provide important information about the variable under study, so it is important to consider outliers in addition to describing what is typical. Outliers can also be problematic because the values of some summaries are influenced by outliers and because some methods for drawing conclusions from data are not appropriate if the data set has outliers.

Numerical Methods for Describing Data Distributions

Similar presentations

Presentation on theme: "Numerical Methods for Describing Data Distributions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Numerical Methods for Describing Data Distributions

Similar presentations

Presentation on theme: "Numerical Methods for Describing Data Distributions"— Presentation transcript:

Similar presentations

About project

Feedback