Presentation on theme: "Objectives 1.2 Describing distributions with numbers"— Presentation transcript:
1 Objectives 1.2 Describing distributions with numbers Measures of center: mean, medianMean versus medianMeasures of spread: quartiles, standard deviationFive-number summary and boxplotChoosing among summary statisticsChanging the unit of measurement
2 Numerical descriptions of distributions Describe the shape, center, and spread of a distribution…Center: mean, median and mode.Spread: range, IQR, standard deviation (SD).We treat these as aids to understanding the distribution of the variable at hand…The mean is often called the "average" and is in fact the arithmetic average ("add all the values and divide by the number of observations").
3 Measure of center: sample mean: Example 1 The mean or arithmetic averageTo calculate the average, or mean, add all values, then divide by the number of individuals. It is the “center of mass.”Sum of heights is 301.2divided by 5 women = 301.2/5=60.24 inchesMost of you know what a mean, or common arithmetic average is.You should know how to calculate the mean both by hand and using your calculator. See Dr. Baldi.
4 Measure of center: sample mean: Example 2 woman(i)heightx=158.21464915371617181920212210231124122513SMathematical notation:(Sample mean)There is some standard math notation for referring to the mean and the numbers used to calculate it.We number the individuals using the letter I, here I goes from 1 to 25.The total number is n, or 25We refer to the variable height, associated with each individual, using x. Doesn’t have to be I or x, but usually is. I will always make it clear to you what is what, as in column headings here.The x’s get numbered to match the individual.The mean, x BAR, is the sum of the individual heights, or the x sub I’s, divided by the totalNumber of individuals n.A shorthand way to write the same equation is below, where the summation symbol means to sum the x values, or heights, as I goes from 1 to n.Mean height is about 5’4”Learn right away how to get the mean using your calculators.
5 Your numerical summary must be meaningful! Height of 25 women in a classThe distribution of women’s heights appears coherent and symmetrical. The mean is a good numerical summary.
6 The Median (M) is often called the "middle" value and is the value at the midpoint of the observations when they are ranked from smallest to largest value….Steps to get median:arrange the data from smallest to largestif n is odd then the median is the single observation in the center (at the (n+1)/2 position in the ordering)if n is even then the median is the average of the two middle observations (at the (n+1)/2 position; i.e., in between…)E.g1: 5, 1, 7, 4, 3E.g2: 5, 1, 7, 4, 3, 8
7 Measure of center: the median Note: for a median, 50% of the data are less than it and 50% of the data are bigger than itExample1: with the data listed below, what are the mean and median?2, 3, 5, 1.Example2: with the data listed below, what are the mean and median?2, 3, 5, 1, 100.Example3: with the data listed below, what are the mean and median?-100, 2, 3, 5, 1, 100.Question: What can we conclude from the examples above?Mean is sensitive to outliers;Median is robust to outliers.
8 Measure of center: the median The median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger.1. Sort observations by size.n = number of observations______________________________2.a. If n is odd, the median is observation (n+1)/2 down the list n = 25(n+1)/2 = 26/2 = 13Median = 3.42.b. If n is even, the median is the mean of the two middle observations.n = 24 n/2 = 12Median = ( ) /2 = 3.35
9 Mean and median of a distribution with outliers Without the outliersWith the outliersPercent of people dyingThe median, on the other hand, is only slightly pulled to the right by the outliers (from 3.4 to 3.6).The mean is pulled to the right a lot by the outliers (from 3.4 to 4.2).
10 Impact of skewed data Mean and median of a symmetric Disease X: Mean and median are the same.Mean and median of a symmetricMultiple myeloma:… and a right-skewed distributionThe mean is pulled toward the skew.
11 We can describe the shape, center and spread of a density curve in the same way we describe data… e.g.,the median of a density curve is the “equal-areas” point - the point on the horizontal axis that divides the area under the density curve into two equal (.5 each) parts.The mean of the density curve is the balance point - the point on the horizontal axis where the curve would balance if it were made of a solid material. (See figures 1.24b and 1.25 below)
12 The mean is pulled toward the skew. Skewness: The mean is pulled toward the skew.The mean is pulled toward the skew.Mode = Mean = MedianSYMMETRICMeanModeModeMeanMedianMedianSKEWED LEFT(negatively)SKEWED RIGHT(positively)
13 Measure of spread: the quartiles Spread: percentiles, quartiles (Q1 and Q3), IQR,5-number summary (and boxplots), range, standard deviationpth percentile of a variable is a data value such that p% of the values of the variable fall at or below it.The lower (Q1) and upper (Q3) quartiles are special percentiles dividing the data into quarters (fourths). get them by finding the medians of the lower and upper halves of the dataIQR = interquartile range = Q3 - Q1 = spread of the middle 50% of the data. IQR is used with the so-called 1.5*IQR criterion for outliers - know this!
14 Examples to find 5-# summary and Boxplot Eg1: Dataset: 3, 2, 1, 5, 6.Find the Median, Q1, Q3 and IQR.Find the 5-# summary.Draw a Boxplot for Eg1.Eg2: Dataset: 3, 2, 1, 5, 6, 8.Find the Median, Q1, Q3 and IQR.Find the 5-# summary.Draw a Boxplot for Eg1.
16 Measure of spread: the quartiles The first quartile, Q1, is the value in the sample that has 25% of the data at or below it ( it is the median of the lower half of the sorted data, excluding M).The third quartile, Q3, is the value in the sample that has 75% of the data at or below it ( it is the median of the upper half of the sorted data, excluding M).Q1= first quartile = 2.2M = median = 3.4Q3= third quartile = 4.35
18 Five-number summary and boxplot Largest = max = 6.1BOXPLOTQ3= third quartile= 4.35M = median = 3.4Q1= first quartile= 2.2Five-number summary:min Q1 M Q3 maxSmallest = min = 0.6
19 Boxplots for skewed data Comparing box plots for a normal and a right-skewed distributionBoxplots remain true to the data and depict clearly symmetry or skew.
20 5-number summary: min. , Q1, median, Q3, max when plotted, the 5-number summary is a boxplot we can also do a modified boxplot to show outliers (mild and extreme). Boxplots have less detail than histograms and are often used for comparing distributions… e.g., Fig. 1.17, p.47 and below...
21 Suspected outliers: how to detect outliers Outliers are troublesome data points, and it is important to be able to identify them.One way to raise the flag for a suspected outlier is to compare the distance from the suspicious data point to the nearest quartile (Q1 or Q3). We then compare this distance to the interquartile range (distance between Q1 and Q3).We call an observation a suspected outlier if it falls more than 1.5 times the size of the interquartile range (IQR) above the first quartile or below the third quartile. This is called the “1.5 * IQR rule for outliers.”Add in a one other thing we know - the spread - the largest and smallest values, and make a box plot.Now, why would you want to make one of these?
22 Modified Boxplot Modified boxplot (helps detect outliers) Calculate 1.5*IQRQ1 – 1.5*IQRQ3+1.5*IQRDraw box and line (similar to before).Draw whiskers to minimum and maximum observation within (Q1 – 1.5*IQR, Q3+1.5*IQR).Observations outside this range should be plotted as dots separately.
23 Q1: Is there any suspected outliers? Modified BoxplotQ1: Is there any suspected outliers?Q2: If yes, then find the following values:Calculate 1.5*IQR;Lower bound = Q1 – 1.5*IQR;Upper bound = Q3+1.5*IQR;Find Min*=min within lower/upper bounds;Find Max*=max within lower/upper bounds;Q3: Can we verify any outliers?Q4: Now draw the Modified Boxplot:Draw Min* and Max*, Q1, Med, Q3.For all observations outside this range should be plotted as dots separately.Q3 = 4.35Q1 = 2.2
24 Modified Boxplot Distance to Q3 7.9 − 4.35 = 3.55 Interquartile range 8Distance to Q37.9 − 4.35 = 3.55Q3 = 4.35Interquartile rangeQ3 – Q14.35 − 2.2 = 2.15Q1 = 2.2Individual #25 has a value of 7.9 years, which is 3.55 years above the third quartile. This is more than years, 1.5 * IQR. Thus, individual #25 is an outlier by our 1.5 * IQR rule.
25 Measure of spread: the standard deviation The standard deviation “s” is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers.1. First calculate the variance s2.2. Then take the square root to get the standard deviation s.Boxplots are used to show the spread around a median - can use no matter what the distribution, and is a good way to contrast variables having different distributions.But if your distribution is symmetrical, you can use the mean as the center of your distribution, you can use a different (and more common) measure of spread around the mean - standard deviation.The Standard Deviation measures spread by looking at how far the observations are from their mean.Go through calc. This is women’s height data again, First, N is again the number of observations. From this we calculate the degrees of freedom, which is just n-1.Come back to this in a second.Take difference from mean, square it so all are positive, add them up. Then divide not by number of observations by by n-1 = dfAlthough variance is a useful measure of spread, it’s units are units squared.So we like to take the square root and use that number, the SD, which has the same units as the mean.Height squared is not intuitive.Now, as to why dividing by n-1 instead of n. When we got the mean it was easy to imagine why we divided by N intuitively.But actually, what we are doing even there is dividing by the number of independent pieces of information that go into the estimate of a parameter.This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself. For example, if the variance, s2 , is to be estimated from a random sample of N independent scores, then the degrees of freedom is equal to the number of independent scores (N) minus the number of parameters estimated as intermediate steps (here, we have estimated the mean) and is therefore equal to N-1.But why the term “degrees of freedom”?When we calculate the s-square of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n - 1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample. All of the other (n - 1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the statistic s-square is said to have only (n - 1) degrees of freedom.I know this is hard to understand. I don’t expect you to understand it completely. But in a second I will come back to it to show you the effect of dividing by n-1 rather than n, and perhaps that will make is easier to accept.Mean± 1 s.d.
26 Example 1: to calculate sample SD Calculations … For data: 1, 2, 3, 4, 5. Q: Find the sample variance and sample SD.1Order iMean = 3Sum of squared deviations from mean = 10Degrees freedom (df) = (n − 1) = 4s2 = sample variance = 10/4 = 2.5s = sample standard deviation= √2.5 = 1.58Make sure to know how to get the standard deviation using your calculator.
27 Example 2: Use hand to calculate sample SD for the following data set: 3, 4, 5, 8. 1. First calculate the variance s2.2. Then take the square root to get the standard deviation s.Make sure to know how to get the standard deviation using your calculator.
28 How to use calculator to find statistics… In order to find sample mean, sample SD, and 5-# summary, we can use calculator to help as following:Stat Edit choose 1: Edit… input your data into L1;Stat Calc choose 1: 1-Var Stats Enter Enter.Read your outputs carefully.Note: X-bar means sample mean;Sx means sample SD;n means sample size.Q: find the sample mean, sample SD, and 5-# summary for the following data:Example1: Data are: 3, 4, 5, 8.Example 2: Data are: 1, 3, 5, 6, 7, 8.
30 How to perform data analysis: ALWAYS PLOT DATA BEFORE DECIDING ON A NUMERICAL SUMMARY.How to choose summary statistics?Use: 5-number summary is better than the mean and s.d. for skewed data;Use mean & s.d. for symmetric data.