Presentation is loading. Please wait.

Presentation is loading. Please wait.

2. Numerical descriptors

Similar presentations


Presentation on theme: "2. Numerical descriptors"— Presentation transcript:

1 2. Numerical descriptors
The Practice of Statistics in the Life Sciences Third Edition © 2014 W.H. Freeman and Company

2 Objectives (PSLS Chapter 2)
Describing distributions with numbers Measure of center: mean and median Measure of spread: quartiles and standard deviation The five-number summary and boxplots IQR and outliers Dealing with outliers Choosing among summary statistics Organizing a statistical problem

3 Measure of center: the mean
The mean, or arithmetic average To calculate the average (mean) of a data set, add all values, then divide by the number of individuals. It is the “center of mass.”

4 Measure of center: the median
The median is the midpoint of a distribution—the number such that half of the observations are smaller, and half are larger. 1) Sort observations from smallest to largest. n = number of observations 2) The location of the median is (n + 1)/2 in the sorted list ______________________________ If n is odd, the median is the value of the center observation If n is even, the median is the mean of the two center observations The data are sorted from small to large. The grey column has the ranks, the orange column the data points.  n = 25 (n+1)/2 = 13 Median = 3.4 n = 24  (n+1)/2 = 12.5 Median = ( )/2 = 3.35

5 Comparing the mean and the median
The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean Median The mean and the median are (approximately) the same only if the distribution is symmetric. The mean is not resistant to skew and outliers because the mean is computed using ALL the numerical values in the data set. The median only requires finding the middle value, and thus is not directly affected by values on the edges of the distribution. Mean and median for skewed distributions Left skew Right skew Mean Median Mean Median

6 The median laughter group size is A) 2 B) 2.5 C) 3 D) 3.5 E) 4
A study of freely forming groups in bars all over Europe recorded the group size (number of individuals in the group) of all 501 groups in the study that were naturally laughing. The median laughter group size is A) 2 B) 2.5 C) 3 D) 3.5 E) 4 The median laughter group size is 2: There are 500 groups, so the median is carried by groups 250 and 251 in the ordered list, both of which are in the first column. The mean would be larger than the median: The data show a strong skew, which would influence the mean but not the median. [Note that the mean is about 2.72.] The average laughter group size is A) smaller than the median. B) about the same as the median. C) larger than the median. 6

7 Measure of spread: quartiles
The first quartile, Q1, is the median of the values below the median in the sorted data set. The third quartile, Q3, is the median of the values above the median in the sorted data set. Q1= first quartile = 2.2 M = median = 3.4 You should know that different technology platforms may use slightly different definitions for the quartiles. So, don’t be surprised if you get a different answer when using technology, or even from one software to another. Q3= third quartile = 4.35

8 How fast do skin wounds heal?
Here are the skin healing rate data from 18 newts measured in micrometers per hour: Sorted data: Median = ??? Quartiles = ??? With n = 18, the location of the median is (n + 1)/2 = 9.5. So, the median is the midpoint of values #9 and #10 in the sorted list: 26 and 27, respectively. Therefore, median = 26.5 micrometers per hour. The first quartile is the median of the points below the median, so points #1 through #9; this corresponds to location #5. Therefore, Q1 = 22 micrometers per hour. The third quartile is the median of the points above the median in the sorted list, so points #10 through #18; this corresponds to location #14. Therefore, Q3 = 33 micrometers per hour.

9 Measure of spread: standard deviation
The standard deviation is used to describe the variation around the mean. To get the standard deviation of a SAMPLE of data: 1) Calculate the variance s2 2) Take the square root to get the standard deviation s Standard deviation measures spread by looking at how far the observations are from their mean. Although variance is a useful measure of spread, its units are units squared. The standard deviation (square root of the variance) is more intuitive, because it has the same units as the raw data and the mean. The following is for your information only, and is not discussed in the book. Why do we divide by n  1 instead of n? We are dividing by the number of independent pieces of information that go into the estimate of a parameter. This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself). But why the term “degrees of freedom”? When we calculate the variance of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n  1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample. All of the other (n  1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the sample variance is said to have only (n  1) degrees of freedom. Learn how to obtain the standard deviation of a sample using technology.

10 A person’s metabolic rate is the rate at which the body consumes energy.
Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours). *

11 Center and spread in boxplots
max = 6.1 Boxplot Q3= 4.35 median = 3.4 Five-number summary: min, Q1, M, Q3, max. Boxplots are sometimes also called “box-and-whiskers” plots. Q1= 2.2 min = 0.6 “Five-number summary”

12 IQR and suspected outliers
The interquartile range (IQR) is the distance between the first and third quartiles (the length of the box in the boxplot) IQR = Q3 – Q1 An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier? Suspected low outlier: any value < Q1 – 1.5 IQR Suspected high outlier: any value > Q IQR

13 * Distance to Q3 7.9-4.35 = 3.55 Interquartile range Q3 – Q1
8 * Distance to Q3 = 3.55 Q3 = 4.35 Interquartile range Q3 – Q1 = 2.15 Some software programs create “modified boxplots,” in which suspected outliers (according to the 1.5 IQR rule) are displayed by a star or asterisk. The “whiskers” then only extend to the next value in the sorted list. Q1 = 2.2 Individual #25 has a survival of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5  IQR = years.  Individual #25 is a suspected outlier.

14 Unusual individual or typo?
Anonymous class survey: weight (lbs) and height (in) were used to compute BMI. A modified boxplot helps distinguish between points that are part of a skewed pattern and the presence of an outlier. The three values around are close to the rest of the pattern and appear to simply be part of the skew. The largest value is clearly an outlier, far from the rest of the data. Height Weight Sex BMI 60 230 Male 44.9 Unusual individual or typo? height of 60 in is the shortest for men weight of 230 lbs is almost the heaviest

15 Dealing with outliers What should you do if you find outliers in your data? It depends in part on what kind of outliers they are: Human error in recording information Human error in experimentation or data collection Unexplainable but apparently legitimate wild observations  Are you interested in ALL individuals?  Are you interested only in typical individuals? Don’t discard outliers just to make your data look better, and don’t act as if they did not exist. Refer to the Chapter 2 discussion on the treatment of outliers.

16 Choosing among summary statistics
Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers.  Plot the mean and use the standard deviation for error bars. Otherwise, use the median and the five-number summary, which can be plotted as a boxplot. Boxplot Mean ± s.d.

17 Deep-sea sediments. Phytopigment concentrations in deep-sea sediments collected worldwide show a very strong right-skew. Which of these two values is the mean and which is the median? 0.015 and grams per square meter of bottom surface Which would be a better summary statistic for these data? We know that the mean is not a robust measure of center, and that it is influenced by skews and outliers. Because the data are strongly right-skewed, we expect the mean to be larger than the median. Therefore, is the mean, and the median. Given the skewed nature of these data, the median would probably be a better summary statistic (depending on intended use).

18 What summary statistics would you use for each of these two variables?
Researchers grafted human cancerous cells onto 20 healthy adult mice. Then 10 of the mice were injected with tumor-specific antibodies (anti-CD47) while the other 10 mice were not (IgG). Here is what a table of the raw data would look like. What summary statistics would you use for each of these two variables? Presence of metastases: This is a categorical variable. Compute the count of mice with metastases (10 versus 1) or the proportion of mice with metastases (1 versus 0.1) for each group. Number of metastases: This is a quantitative variable. Compute the mean and standard deviation of the number of metastases for each group (2.4 and 0.97 versus 0.1 and 0.32). The 5 number summary can be computed as well but, with just 10 values in each group, itwould not summarize the data much.

19 Organizing a statistical problem
1. State: What is the practical question, in the context of a real-world setting? 2. Plan: What specific statistical operations does this problem call for? 3. Solve: Make the graphs and carry out the calculations needed for this problem. 4. Conclude: Give your practical conclusion in the real-world setting.


Download ppt "2. Numerical descriptors"

Similar presentations


Ads by Google