Describing Data from One Variable

Describing Data from One Variable
Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 4 Describing Data from One Variable

Sections a Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists Ch 4. Describing Data From One Variable 4.1 Measures of Location Objectives: To calculate the mean, median, and mode. To determine the most appropriate measure of center.

Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists Measures of Location: If we think about a data set as a group of data values that cluster around some central value, then the central value provides a focal point for the set, a location of sorts. Unfortunately, the notion of central value is a vague concept, which is as much defined by the way it is measured as by the notion itself. There are several statistical measures that are used to define the notion of center: the arithmetic mean, trimmed mean, median, and mode.

Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists The Arithmetic Mean: Suppose there are n observations in a data set, consisting of the observations ; then the arithmetic mean is The mean is what we typically call the “average” of a data set. To calculate the mean, simply add all the values and divide by the total number in the data set. Mean should only be used for quantitative data. Outliers have a dramatic effect on the mean value.

Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists The Arithmetic Mean: If we use mathematical notation, the formula can be simplified to where is the data value in the data set and (pronounced sigma) is a mathematical notation for adding values. There are two symbols that are associated with mean: Here refers to the size of the sample and refers to the size of the population. Otherwise, the calculations are made in precisely the same way.

Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists Example: Calculate the sample mean of the following heights in inches. 63, 68, 71, 67, 63, 72, 66, 67, 70 Solution: When calculating the mean, round to one more decimal place than what is given in the data.

Deviations from the mean (xi – 9)
Describing Data from One Variable Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists Deviation: Given some point A and a data point x, then x – A represents how far x deviates from A. This difference is also called a deviation. The table below shows the deviations from the mean for the following sample data values: 4, 10, 7, 15. The mean of this data set is 9. Notice that the sum of the deviations is zero. This illustrates why the mean is a measure of central tendency. If we calculate the deviations about any other value the sum of the deviations will not equal zero. Data Value xi Deviations from the mean (xi – 9) 4 – 5 10 1 7 – 2 15 6

Describing Data from One Variable Section 4.1 Measures of Location
HAWKES LEARNING SYSTEMS math courseware specialists The Median: The median of a set of data values is the middle value in an ordered array. The same number of values is on either side of the median value. Median is the sum of the two middle values in the data divided by two. Count is even Arrange the data in ascending order. Count the number of values in the data Count is odd Median is the middle value in the data.

Section 4.1 Measures of Location Example: Calculate the median of the following sets of data. Solution: Solution:

For calculating 10% trimmed mean, arrange the data in ascending order
Describing Data from One Variable Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists The Trimmed Mean: The trimmed mean ignores an equal percentage of the highest and lowest values in calculating the mean. For calculating 10% trimmed mean, arrange the data in ascending order Calculate the arithmetic mean of the remaining 80% of the values. Delete the lowest 10% of the values Delete the highest 10% of the values

Section 4.1 Measures of Location Example: Consider the following data: 16 18 20 21 23 24 32 36 42 mean = median = 23 Find the 10% trimmed mean. Solution: Since there are 10 observations, removing the highest 10% and lowest 10% means only removing one observation from each end of the data.

Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists Resistant Measures: Statistical measures which are not affected by outliers are said to be resistant. The mean is not a resistant measure. The trimmed mean is a resistant measure.

Section 4.1 Measures of Location HAWKES LEARNING SYSTEMS math courseware specialists The Mode: The mode of a data set is the most frequently occurring value. The mode is the only measure of centralness that can be applied to nominal data. When a data set has two modes it is said to be bimodal. When the data set has more than two modes it is said to be multimodal.

Section 4.1 Measures of Location Example: Calculate the mode of each data set. Solution: There are two modes: 63 and 67. The data set is bimodal. Solution: 51 occurs three times. The mode is 51. Solution: Each value appears only once. There is no mode.

Section 4.1 Measures of Location The Relationship between the Mean and Median: The shape of the data determines how the mean, median, and mode are related. For a bell-shaped distribution, the mean, median, and mode are identical.

Section 4.1 Measures of Location Skewed Distributions: Not all data produce distributions which follow a bell-shaped curve. If the distribution of the data has a long tail to the right, it is said to be skewed to the right, or positively skewed. Conversely, if the distribution has a long tail on the left, the data is said to be skewed to the left, or negatively skewed. If the data is positively skewed, the median will be smaller than the mean. If the data is negatively skewed, the median will be larger than the mean.

Section 4.2 Selecting a Measure of Location Selecting a Measure of Location: The objective of using descriptive statistics is to provide measures which convey useful summary information about the data. When selecting a statistic to represent the central value of a data set, the first question involves what type of data is being analyzed. The arithmetic mean is frequently, but not always, the most reasonable measure of centralness.

Applicable Level of Measurement
Describing Data from One Variable Section 4.2 Selecting a Measure of Location Selecting a Measure of Location: Measure of Location Applicable Level of Measurement Qualitative Quantitative nominal ordinal interval ratio mean  median mode t-mean To the right is a table that defines the applicable levels of measurement for each measure of location. Measure of Location not sensitive very sensitive mean  median mode t-mean To the left is a table that defines the sensitivity to outliers for each measure of location.

Section 4.2 Selecting a Measure of Location Selecting a Measure of Location: The mean and median are the same value when the data is symmetrical. When the data is nominal or ordinal, the mean should not be calculated. When the data is at least interval and there are no outliers the mean is a reasonable choice. When the data is at most ordinal, then the median is the best choice. The median is a good measure of central tendency since it is not sensitive to outliers. The median can be applied to all levels of measurement except nominal. The mode can be applied to all levels of data, but is not very useful for quantitative data. If the data is nominal, there is only one choice, the mode.

Section 4.2 Selecting a Measure of Location Time Series Data and Measures of Centralness: The graph below shows the average gas price over a number of years. In this non-stationary time series, the central value of the process is trending upward. One way to capture this movement is with a moving average.

Section 4.2 Selecting a Measure of Location Moving Average: A moving average is obtained by adding consecutive observations for a number of periods and dividing the result by the number of periods included in the average. The table below shows the average US gas price from 1991 to 2002 along with the 2 and 3 period moving averages. Year Average US Gas Price 2 Period Moving Average 3 Period Moving Average 1991 1.09 1997 1.18 1.195 1.167 1992 1.10 1.095 1998 1.01 1.333 1993 1.07 1.085 1.087 1999 1.14 1.075 1.110 1994 1.08 1.083 2000 1.49 1.315 1.213 1995 1.11 2001 1.38 1.435 1.337 1996 1.21 1.160 1.133 2002 1.34 1.360 1.403 The two-period moving average for 1992 averages the time series in 1991 and 1992:

Section 4.2 Selecting a Measure of Location Moving Average: The chart below displays the time series and the two and three-period moving averages. Notice that both of the averages follow the time series quite closely.

Sections b Measures of Dispersion HAWKES LEARNING SYSTEMS math courseware specialists Ch 4. Describing Data From One Variable 4.1 Measures of Location Objective: To compute the range, variance, and standard deviation.

Section 4.3 Measures of Dispersion Measuring Variation: Many of the good measures of variation use the concept of deviation from the mean. If the mean is a focal point or base, use it as a common basis from which to measure variation. The distance that a point is from its mean is called the deviation from the mean. The sum of the positive deviations equals the sum of the absolute values of the negative deviations. The deviations will always sum to zero. Many of the variability measures average the deviations in some form.

Deviations from the mean
Describing Data from One Variable Section 4.3 Measures of Dispersion Example: A data set and its deviations from the mean are calculated in the table below. Notice that the sum of the deviations is zero. Data set: 3, 12, 20, 15, 0 Mean = 10 Data Values Deviations from the mean data – mean = deviation 3 3 – 10 = 12 12 – 10 = 20 20 – 10 = 15 15 – 10 = 0 – 10 = – 7 2 10 5 – 10

Section 4.3 Measures of Dispersion Mean Absolute Deviation: The sample mean absolute deviation (MAD) is Computes the average distance from the mean of a data set. If data set A has a larger deviation than B, then it is reasonable to believe that data set A has more variability than data set B. Intuitive measure of variation. Theoretical development has been hampered due to the difficulty that absolute values pose to calculus. Sensitive to outliers and not a resistant measure.

Section 4.3 Measures of Dispersion Example: Suppose six people participated in a 1000 meter run. Their times, measured in minutes, are given below. The mean time is minutes. Calculate the mean absolute deviation. Time in min. Deviation Absolute Deviation % of total 4 10 9 11 7 4 – = – 4.333 4.333 38.23 10 – = 1.667 1.667 14.71 9 – = 0.667 0.667 5.88 11 – = 2.667 2.667 23.53 9 – = 0.667 0.667 5.88 7 – = – 1.333 1.333 11.77 Total 11.334 100.00 =

Section 4.3 Measures of Dispersion Variance and Standard Deviation: Standard deviation and variance are the most common measures of variability. The standard deviation and variance also provide numerical measures of how the data varies around the mean. If the data is tightly packed around the mean, the standard deviation and variance will be relatively small. If the data is widely dispersed about the mean, the standard deviation and variance will be relatively large.

Section 4.3 Measures of Dispersion Variance: The variance of a data set containing the complete set of population data is given by: and is called the population variance. The variance of a data set containing sample data is given by: and is called the sample variance.

Section 4.3 Measures of Dispersion Example: Given the following times in minutes of 6 persons running the 1000 meter course, compute the sample variance. The sample mean is 4, 10, 9, 11, 9, 7 Data Deviations Squared Deviations % of total 4 10 9 11 7 4 – = – 4.333 59.93 10 – = 1.667 2.7789 8.87 9 – = 0.667 0.4449 1.42 11 – = 2.667 7.1129 22.70 9 – = 0.667 0.4449 1.42 7 – = – 1.333 1.7769 5.67 Total 31.33 100.00

Section 4.3 Measures of Dispersion Standard Deviation: The standard deviation is the square root of the variance. There are two measures of variance, so there will be two standard deviations. The sample standard deviation The population standard deviation It is important to remember the symbols above since standard deviation is a fundamental statistical concept.

Section 4.3 Measures of Dispersion Standard Deviation: Standard deviation is the square root of the average squared deviation. It can also be used to measure how far the data values are from the mean. Relatively few data values will be more than two deviation units from the mean. Like the variance, the standard deviation is sensitive to outliers. The presence of outliers tarnishes the interpretation of the standard deviation as a typical deviation.

Section 4.3 Measures of Dispersion Range: The range is the difference between the largest and smallest data values. Example: Calculate the range of the following data set. 4, 6, 16, 9, 24, 8, 0, 12, 1 Solution: The largest value is 24 and the smallest value is 0. Range = 24 – 0 = 24.

Section 4.4 Measures of Relative Position Objectives: Determine the percentiles and locations of specific data points. Find the quartiles of the data. Determine the z-score as a measure of relative position.

Section 4.4 Measures of Relative Position Pth Percentile: Given a set of data x1, x2,…,xn, the Pth percentile is a value say, X, such that at least P percent of the data is less than or equal to X and at least (100 – P) percent of the data is greater than or equal to X. The most often used measure of relative position is the percentile.

Section 4.4 Measures of Relative Position Pth Percentile: To determine the Pth percentile: Form an ordered array by placing the data in order from smallest to largest To find the location of the Pth percentile in the ordered array, let where n is the number of observations in the ordered data. If is not an integer, then round to the next greatest integer. If is an integer, then average the data value in the location with the data value in the location. Remember, is not the percentile, is the location of the percentile in the ordered array.

Section 4.4 Measures of Relative Position Determining the Pth Percentile Flow Chart: Average the data value in the location with the data value in the location To find the Pth percentile in the ordered data, calculate, where n is the number of observations in the ordered data. Is an integer? Yes Arrange the data in ascending order. No Round up to next greatest integer. Find the data value in the location.

Section 4.4 Measures of Relative Position Example: Find the 50th percentile for the following data set. 3, 5, 0, 1, 9, 2, 7 Solution: Since the location is not an integer, the value is rounded up to 4. 0, 1, 2, 3, 5, 7, 9 Thus, the fourth observation in the ordered array would be the median. The median value (which is the 50th percentile) equals 3.

Section 4.4 Measures of Relative Position Example: Find the 50th percentile for the following data set. 3, 5, 0, 1, 9, 2, 7, 6 Solution: 0, 1, 2, 3, 5,6, 7, 9 Since the location is an integer, we average the 4th value and the 5th value of the ordered array. The 50th percentile for this data set is 4.

Section 4.4 Measures of Relative Position Percentile: The percentile of some data value x is given by:

Section 4.4 Measures of Relative Position Example: Find the percentile of 45 for the following data set. 67, 45, 63, 58, 35, 54, 27, 66, 21, 48 Solution: The values less than or equal to 45 are: 21, 27, 35, 45, 48, 54, 58, 63, 66, 67 So the number of values less than or equal to 45 is 4.

Section 4.4 Measures of Relative Position Quartiles: The 25th, 50th, and 75th percentiles are known as quartiles and are denoted as Q1, Q2, and Q3. Quartiles serve as markers to divide the data. Q1 separates the lowest 25 percent. Q2 represents the median (50th percentile). Q3 marks the beginning of the top 25 percent of the data. Since quartiles are nothing more than percentiles, we construct them in the same way as before.

Section 4.4 Measures of Relative Position Example: Find Q1, Q2, and Q3 for the following data set of test scores. 50, 50, 62, 75, 77, 82, 86, 87, 88, 88 Solution:

Interquartile range = Q3 – Q1.
Describing Data from One Variable Section 4.4 Measures of Relative Position Interquartile Range: The interquartile range, which describes the range of the middle fifty percent of the data, is given by Interquartile range = Q3 – Q1. For the previous example the interquartile range is 87 – 62 = 25. A data point is considered an outlier if it is 1.5 times the interquartile range above the 75th percentile or 1.5 times the interquartile range below the 25th percentile.

Section 4.4 Measures of Relative Position Box Plots: An important use of quartiles is the construction of box plots. Box plots are graphical summaries of data which looks like a box. It provides an alternative method to the histogram for displaying data. A box plot is a graphical summary of central tendency, the spread, the skewness, and the potential existence of outliers in the data. Below is a box plot of the test scores data set. The plot is constructed from five summary measures: largest data value smallest data value 25th percentile 75th percentile median

Section 4.4 Measures of Relative Position Example: Find the outliers in this new data set of test scores. 12, 50, 62, 75, 77, 82, 86, 87, 88, 126 Q1 = 62, Q2 = 79.5, Q3 = 87, and interquartile range = 25 Solution: Larger than 75th percentile times the interquartile range = 124.5 Smaller than 25th percentile – 1.5 times the interquartile range = 24.5 The outliers of this data set are 12 and 126.

Remember: Describing Data from One Variable
Section 4.4 Measures of Relative Position Z-Scores: The z-score transforms the data value into the number of standard deviations that value is from the mean. Remember: Describing the number of standard deviations is a fundamental concept of statistics. It is used as a standardization technique. If the z-score is negative, the value is less than the mean. If the z-score is positive, the value is greater than the mean. The z-score is unit free measure.

Section 4.4 Measures of Relative Position Example: Suppose you scored an 86 on your biology test and a 94 on your psychology test. The mean and standard deviation of the two tests are given to the right. Course Mean Standard Deviation Biology 74 10 Psychology 82 11 What are the z-scores for your two tests? On which test did you perform relatively better? Solution: The z-score for the biology test is: The z-score for the psychology test is: Even though the raw score on the psychology test is larger than the raw score on the biology test, the performance on the biology test was slightly better.

Sections Applying the Standard Deviation HAWKES LEARNING SYSTEMS math courseware specialists Objectives: To calculate the coefficient of variation and use it to compare the variation of different data sets. To calculate the mean, variance, and standard deviation of grouped data. To use the empirical rule and Chebyshev’s Theorem to describe the variability of data.

Section 4.5 Using the Standard Deviation Empirical Rule: If the distribution is bell-shaped: One sigma rule: about 68% of the data should lie within one standard deviation of the mean. A deviation of more than one sigma is to be expected once every three observations. Two sigma rule: about 95% of the data should lie within two standard deviations of the mean. A deviation of more than two sigma is to be expected about once every twenty observations. Three sigma rule: about 99.7% of the data should lie within three standard deviations of the mean. A deviation of more than three sigma is to be expected about once every 333 observations, slightly less than 0.3% of the time.

Section 4.5 Using the Standard Deviation Chebyshev’s Theorem: The proportion of any data set lying within standard deviations of the mean is at least = 2: At least (or 75%) of the data values lie within 2 standard deviations of the mean, for any data set. = 3: At least (or 88.9%) of the data values lie within 3 standard deviations of the mean, for any data set.

Section 4.8 The Coefficient of Variation Coefficient of Variation: The coefficient of variation compares the variation in data sets. For sample data: For a population: The coefficient of variation standardizes the variation measure.

Section 4.9 Analyzing Grouped Data Finding the Mean of Grouped Data: Finding the mean of grouped data involves finding the midpoint of each of the classes in the frequency distribution and then weighting each of these midpoints by the number of observations in the class. Let For a population the mean of grouped data is given by If the grouped data represent sample observations the mean is given by

Section 4.9 Analyzing Grouped Data Finding the Variance of Grouped Data: Let The population variance of grouped data is given by the expression The sample variance is given by

Section 4.10 Proportions Proportions: A proportion measures the fraction of a group that possesses some characteristic. To calculate the proportion, simply count the number in the group that possess the characteristic and divide the count by the number in the group. Let The symbol is pronounced p-hat.

Section 4.10 Proportions Example: Suppose your statistics class is composed of 48 students of which 4 are left-handed. What proportion of the class is left-handed? What proportion of the class is right-handed? Solution: Then .083 is the proportion of people in the class that are left-handed. Then .917 is the proportion of people in the class that are right-handed.

Describing Data from One Variable

Similar presentations

Presentation on theme: "Describing Data from One Variable"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Describing Data from One Variable

Similar presentations

Presentation on theme: "Describing Data from One Variable"— Presentation transcript:

Similar presentations

About project

Feedback