Presentation is loading. Please wait. # Lecture 4 Chapter 2. Numerical descriptors

## Presentation on theme: "Lecture 4 Chapter 2. Numerical descriptors"— Presentation transcript:

Lecture 4 Chapter 2. Numerical descriptors

Objectives (PSLS Chapter 2)
Describing distributions with numbers Measure of center: mean and median (Meas. Cent. Award) Measure of spread: quartiles, standard deviation, IQR (Meas. Var. Award) The five-number summary and boxplots (SUMS Award) Dealing with outliers (outliers award) Choosing among summary statistics (All Numeric Awards) Organizing a statistical problem (Foundational)

Measure of center: the mean
The mean, or arithmetic average To calculate the average (mean) of a data set, add all values, then divide by the number of individuals. It is the “center of mass.” n is the sample size x is the variable

Measure of center: the median
The median is the midpoint of a distribution—the number such that half of the observations are smaller, and half are larger. 1) Sort observations from smallest to largest. n = number of observations 2) The location of the median is (n + 1)/2 in the sorted list ______________________________ If n is odd, the median is the value of the center observation If n is even, the median is the mean of the two center observations The data are sorted from small to large. The grey column has the ranks, the orange column the data points.  n = 25 (n+1)/2 = 13 Median = 3.4 n = 24  (n+1)/2 = 12.5 Median = ( )/2 = 3.35

Comparing the mean and the median
The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean and median for skewed distributions Mean Median The mean and the median are (approximately) the same only if the distribution is symmetric. The mean is not resistant to skew and outliers because the mean is computed using ALL the numerical values in the data set. The median only requires finding the middle value, and thus is not directly affected by values on the edges of the distribution. Left skew Right skew Mean Median Mean Median

Measure of spread: quartiles
The first quartile, Q1, is the median of the values below the median in the sorted data set. The third quartile, Q3, is the median of the values above the median in the sorted data set. Q1= first quartile = 2.2 M = median = 3.4 You should know that different technology platforms may use slightly different definitions for the quartiles. So, don’t be surprised if you get a different answer when using technology, or even from one software to another. Q3= third quartile = 4.35

How fast do skin wounds heal?
Here are the skin healing rate data from 18 newts measured in micrometers per hour: Sorted data: Median = ??? Quartiles = ??? With n = 18, the location of the median is (n + 1)/2 = 9.5. So, the median is the midpoint of values #9 and #10 in the sorted list: 26 and 27, respectively. Therefore, median = 26.5 micrometers per hour. The first quartile is the median of the points below the median, so points #1 through #9; this corresponds to location #5. Therefore, Q1 = 22 micrometers per hour. The third quartile is the median of the points above the median in the sorted list, so points #10 through #18; this corresponds to location #14. Therefore, Q3 = 33 micrometers per hour.

Measure of spread: standard deviation
The standard deviation is used to describe the variation around the mean. To get the standard deviation of a SAMPLE of data: 1) Calculate the variance s2 2) Take the square root to get the standard deviation s Standard deviation measures spread by looking at how far the observations are from their mean. Although variance is a useful measure of spread, its units are units squared. The standard deviation (square root of the variance) is more intuitive, because it has the same units as the raw data and the mean. The following is for your information only, and is not discussed in the book. Why do we divide by n  1 instead of n? We are dividing by the number of independent pieces of information that go into the estimate of a parameter. This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself). But why the term “degrees of freedom”? When we calculate the variance of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n  1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample. All of the other (n  1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the sample variance is said to have only (n  1) degrees of freedom. Learn how to obtain the standard deviation of a sample using a spread sheet.

A person’s metabolic rate is the rate at which the body consumes energy.
Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours). *

Center and spread in boxplots
max = 6.1 Boxplot Q3= 4.35 median = 3.4 Five-number summary: min, Q1, M, Q3, max. Boxplots are sometimes also called “box-and-whiskers” plots. Q1= 2.2 min = 0.6 “Five-number summary”

Boxplots and skewed data
Boxplots for a symmetric and a right-skewed distribution Boxplots show symmetry or skew.

IQR and outliers The interquartile range (IQR) is the distance between the first and third quartiles (the length of the box in the boxplot) IQR = Q3 – Q1 An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier? Suspected low outlier: any value < Q1 – 1.5 IQR Suspected high outlier: any value > Q IQR

* Distance to Q3 7.9-4.35 = 3.55 Interquartile range Q3 – Q1
8 * Distance to Q3 = 3.55 Q3 = 4.35 Interquartile range Q3 – Q1 = 2.15 Some software programs create “modified boxplots,” in which suspected outliers (according to the 1.5 IQR rule) are displayed by a star or asterisk. The “whiskers” then only extend to the next value in the sorted list. Q1 = 2.2 Individual #25 has a survival of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5  IQR = years.  Individual #25 is a suspected outlier.

Dealing with outliers: Baldi and Moore’s Suggestions
What should you do if you find outliers in your data? It depends in part on what kind of outliers they are: Human error in recording information Human error in experimentation or data collection Unexplainable but apparently legitimate wild observations  Are you interested in ALL individuals?  Are you interested only in typical individuals? Learn. Does the outlier tell you something interesting about biology? Don’t discard outliers just to make your data look better, and don’t act as if they did not exist. Refer to the Chapter 2 discussion on the treatment of outliers.

Choosing among summary statistics: B & M
Because the mean is not resistant to outliers or skew, use it is often used to describe distributions that are fairly symmetrical and don’t have outliers.  Plot the mean and use the standard deviation for error bars. Otherwise, use the median and the five-number summary, which can be plotted as a boxplot. Describe a distribution with its S.U.M.S. (shape, unusual points, middle, and spread). Boxplot Mean ± s.d.

Deep-sea sediments. Phytopigment concentrations in deep-sea sediments collected worldwide show a very strong right-skew. Which of these two values is the mean and which is the median? 0.015 and grams per square meter of bottom surface Which would be a better summary statistic for these data? We know that the mean is not a robust measure of center, and that it is influenced by skews and outliers. Because the data are strongly right-skewed, we expect the mean to be larger than the median. Therefore, is the mean, and the median. Given the skewed nature of these data, the median would probably be a better summary statistic (depending on intended use).

Download ppt "Lecture 4 Chapter 2. Numerical descriptors"

Similar presentations

Ads by Google