Numerical Methods for Describing Data

Numerical Methods for Describing Data
Chapter 4 Numerical Methods for Describing Data

Describing the Center of a Data Set
Section 4.1 Describing the Center of a Data Set

Population characteristic—a fixed value about a population that is typically unknown
Suppose we want to know the MEAN length of all the fish in Lake Sam Rayburn . . . Is this a value that is known? Can we find it out? At any given point in time, how many values are there for the mean length of fish in the lake?

Statistic—a value calculated from a sample
Suppose we want to know the MEAN length of all the fish in Lake Sam Rayburn. What can we do to estimate this unknown population characteristic?

Measures of Central Tendency
mode--the observation that occurs the most often Can be more than one mode If all values occur only once – there is no mode Not used as often as mean & median

Measures of Central Tendency
The mean of a set of numerical observations is just the familiar arithmetic average: the sum of the observations divided by the number of observations.

Important Notations x = the variable for which we have sample data
n = the number of observations in the sample (the sample size) x1 = the first observation in the sample x2 = the second observation in the sample… xn = the nth (last) observation in the sample

Battery Life Example We might have a sample consisting of n = 4 observations on x = battery lifetime (in hours): x1 = 5.9, x2 = 7.3, x3 = 6.6, x4 = 5.7 x1 is just the first observation in the data set and not necessarily the smallest observation xn is the last observation but not necessarily the largest

More Notation The sum of x1, x2,… ,xn can be denoted by x1 + x2 + … + xn, but this could be a daunting task for a large sample. The Greek letter S (pronounced sigma) is traditionally used in mathematics to denote summation. In particular, S x denotes the sum of all the x values in the data set under consideration

Sample Mean The sample mean of a numerical sample, x1, x2, x3, …, xn, denoted 𝑥 , is

Fancytown Example During a two-week period, 10 houses were sold in Fancytown. Calculate the sample mean. The average (or mean) price for this sample of 10 houses in Fancytown is $291,000.

Lowtown Example During a two-week period, 10 houses were sold in Lowtown. Calculate the sample mean. The average (or mean) price for this sample of 10 houses in Lowtown is $295,000. Outlier

Reflections on the Sample Mean Calculations
Looking at the dotplots of the samples for Fancytown and Lowtown we can see that the mean, $295,000 appears to accurately represent the “center” of the data for Fancytown, but it is not representative of the Lowtown data. Clearly, the mean can be greatly affected by the presence of even a single outlier. Outlier

Describing the Center of a Data Set with the arithmetic mean
The population mean, denoted by µ, is the average of all x values in the entire population.

Important Note The value of 𝑥 varies from sample to sample.
There is only one value for µ.

Drawback with the Mean One potential drawback to the mean as a measure of center for a data set is that its value can be greatly affected by the presence of even a single outlier (an unusually large or small observation) in the data set.

Describing the Center of a Data Set with the median
The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then

Population Median The population median is the middle value in the ordered list consisting of all population observations. The population median plays the same role for the population as the sample median plays for the sample.

The Median The stability of the median is what sometimes justifies its use as a measure of center in some situations. Income distributions are commonly summarized by reporting the median rather than the mean, because otherwise a few very high salaries could result in a mean that is not representative of a typical salary

Median Calculation Consider the Fancytown data. Calculate the median house value for Fancytown. First, we put the data in numerical increasing order to get: 225, , , ,000 291, , , ,000 311, ,000

median = 291000+299000 2 = $295,000 Median Calculation
Since there is an even number of data values, the median is the mean of the two values in the middle. median = = $295,000

Another Median Calculation
Consider the Lowtown data. Calculate the median house value for Lowtown. We put the data in numerical increasing order to get: 93, , , ,000 100, , , ,000 122, ,000,000

Imagine a ruler with pennies placed at 3”, 4”, 5”, 6”, 8” and 10”.
To balance the ruler on your finger, you would need to place your finger at the mean of 6. The mean is the balance point of a distribution

Comparing the Sample Mean & Sample Median

Comparing the Sample Mean & Sample Median
Notice from the preceding pictures that the median splits the area in the distribution in half and the mean is the point of balance. Typically, when a distribution is skewed positively, the mean is larger than the median, when a distribution is skewed negatively, the mean is smaller then the median, and when a distribution is symmetric, the mean and the median are equal.

Mean vs. Median In a skewed distribution, the mean is pulled in the direction of the skewness. In a symmetrical distribution, you should report the mean! In a skewed distribution, the median should be reported as the measure of center!

The Trimmed Mean A trimmed mean is computed by first ordering the data values from smallest to largest, deleting a selected number of values from each end of the ordered list, and finally computing the mean of the remaining values. The trimming percentage is the percentage of values deleted from each end of the ordered list.

The Trimmed Mean Purpose is to remove outliers from a data set
To calculate a trimmed mean: Multiply the percent to trim by n Truncate that many observations from BOTH ends of the distribution (when listed in order) Calculate the mean with the shortened data set

So remove one observation from each side!
Find the mean of the following set of data: Find the 10% trimmed mean. Mean = 23.8 10%(10) = 1 So remove one observation from each side!

FancyTown Trimmed Mean
Calculate the 10% trimmed mean for FancyTown

Summary of Trimmed Means
A trimmed mean with a small to moderate trimming percentage—between 5% and 25%--is less affected by outliers than the mean, but it is not as insensitive as the median.

NO YES Is the median affected by extreme values?
Is the mean affected by extreme values? YES

Sample Proportion for Categorical Data
The sample proportion of success, denoted by p, is p = number of successes in the sample (S) 𝑛 Where S is the label used for the response designated as success. The population proportion of successes is denoted by p.

Tampering with Automobile Antipollution Equipment Example
The use of antipollution equipment on automobiles has substantially improved air quality in certain areas. Unfortunately, many car owners have tampered with smog control devices to improve performance. Suppose that a sample of 15 cars is selected and that each car is classified as S or F, according to whether or not tampering has taken place. The resulting data are: S F S S S F F S S F S S S F F If we consider the variable of successes, the sample proportion (of successes) is:

Example Tampering with Automobile Antipollution Equipment
That is, 60% of the sample responses are S’s. In 60% of the cars sampled, there has been tampering with the air pollution control devices.

Describing Variability in a Data Set
Section 4.2 Describing Variability in a Data Set

Why is the study of variability important?
Does this can of soda contain exactly 12 ounces? There is variability in virtually everything Allows us to distinguish between usual & unusual values Reporting only a measure of center doesn’t provide a complete picture of the distribution.

Notice that these three data sets all have the same mean and median (at 45), but they have very different amounts of variability.

Describing Variability
The simplest numerical measure of the variability of a numerical data set is the range, which is defined to be the difference between the largest and smallest data values. range = maximum - minimum

Calculating Range Calculate the range for each data set from the previous example: The first two data sets have a range of 50 (70-20) but the third data set has a much smaller range of 10.

Describing Variability
The n deviations from the sample mean are the differences: 𝑥 1 - 𝑥 , 𝑥 𝑥 , … , 𝑥 𝑛 - 𝑥 Note: The sum of all of the deviations from the sample mean will be equal to 0, except possibly for the effects of rounding the numbers. This means that the average deviation from the mean is always 0 and cannot be used as a measure of variability.

Calculating Deviations from the Sample Mean
Suppose we caught a sample of 6 fish from the lake with the following lengths: 3”, 4”, 5”, 6”, 8”, 10” Calculate the deviations from the sample mean. What must we find first?

Now find how each observation deviates from the mean.
The mean is considered the balance point of the distribution because it “balances” the positive and negative deviations. This is the deviation from the mean. x (x - x) 3 4 5 6 8 10 Sum 3-6 -3 -2 Find the rest of the deviations from the mean -1 What is the sum of the deviations from the mean? 2 Will this sum always equal zero? 4 YES

Notes on Deviations A particular deviation is positive if the x value exceeds 𝑥 and negative if the x value is less than 𝑥 . In general, the greater the amount of variability in the sample, the larger the magnitudes (ignoring the signs) of the deviations.

Measures of Variability
What can we do to the deviations so that we could find an average? Suppose we caught a sample of 6 fish that we caught from the lake with the following lengths: 3”, 4”, 5”, 6”, 8”, 10” The mean length is 6 inches. Recall that we calculated the deviations from the mean. What was the sum of these deviations? Can we find an average deviation? Another measure of the variability in a data set uses the deviations from the mean ( 𝑥 ). Population variance is denoted by s2. The estimated average of the deviations squared is called the variance. Degree of freedom

Degrees of freedom will be revisited again in Chapter 8.
The customary way to prevent negative and positive deviations from counteracting one another is to square them before combining. When calculating sample variance, we use degrees of freedom (n – 1) in the denominator instead of n because this tends to produce better estimates. Degrees of freedom will be revisited again in Chapter 8. Suppose that everyone in the class caught a sample of 6 fish from the lake. Would each of our samples contain the same fish? See page 189 for more information. Would our mean lengths be the same? The samples would also have different ranges!

Remember the sample of 6 fish that we caught from the lake . . .
Find the variance of the length of the fish. First square the deviations x (x - x) (x - x)2 3 -3 4 -2 5 -1 6 8 2 10 Sum Finding the average of the deviations would always equal 0! 9 4 1 16 What is the sum of the deviations squared? Divide this by 5. 34 s2 = 6.8

Sample Standard Deviation
The sample standard deviation, denoted s is the positive square root of the sample variance. The population standard deviation is denoted by .

Sample Variance A large amount of variability in the sample is indicated by a relatively large value of s2 or s, whereas a value of s2 or s close to 0 indicates a small amount of variability. For most statistical purposes, s is the desired quantity, but s2 must be computed first. The most commonly used measures of center and variability are the mean and standard deviation, respectively.

Measures of Variability
Calculate the standard deviation for the fish sample. s2 = 6.8 inches2 so s = inches The fish in our sample deviate from the mean of 6 inches by an average of inches.

Apple Weight Example A sample of 10 Macintosh apples were randomly selected and weighed (in ounces). Calculate the standard deviation of the sample.

Interquartile Range Interquartile range (iqr)--the range of the middle half of the data. What advantage does the interquartile range have over the standard deviation? The iqr is resistant to extreme values.

iqr The iqr is based on quantities called quartiles.
The lower quartile separates the bottom 25% of the data set from the upper 75%, and the upper quartile separates the top 25% of the data set from the bottom 75%.

Quartiles

Finding Quartiles The quartiles for sample data are obtained by dividing the n ordered observations into a lower half and an upper half: if n is odd, the median is excluded from both halves.

Quartiles and the Interquartile Range
Lower Quartile (Q1) = median of the lower half of the data set. Upper Quartile (Q3) = median of the upper half of the data set. The interquartile range (iqr), a resistant measure of variability is given by iqr = upper quartile – lower quartile = Q3 – Q1 Note: If n is odd, the median is excluded from both the lower and upper halves of the data.

Quartiles and IQR Example
A sample of 15 students with part time jobs were randomly selected and the number of hours worked last week was recorded. Find the interquartile range for this set of data. 19, 12, 14, 10, 12, 10, 25, 9, 8, 4, 2, 10, 7, 11, 15 The data is put in increasing order to get 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25

Quartiles and IQR Example
With 15 data values, the median is the 8th value. Specifically, the median is 10. Lower Half Upper Half 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 Lower quartile Q1 Median Upper quartile Q3 Lower quartile = Upper quartile = 14 Iqr = = 6

Find the interquartile range for this set of data.
The Chronicle of Higher Education ( issue) published the accompanying data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia. Find the interquartile range for this set of data.

24 26 30 First put the data in order & find the median. Find the lower quartile (Q1) by finding the median of the lower half. Find the upper quartile (Q3) by finding the median of the upper half. iqr = 30 – 24 = 6

Quartiles and iqr The resistant nature of the interquartile range follows from the fact that up to 25% of the smallest sample observations and up to 25% of the sample observations can be made more extreme without affecting the value of the interquartile range.

Special Note on Rounding
Protection against adverse rounding effects can almost always be achieved by using four digits of decimal accuracy.

Summarizing a Data Set: Boxplots
Section 4.3 Summarizing a Data Set: Boxplots

Boxplots A boxplot is a picture that conveys information about the most important features of a data set: center, spread, extent of skewness, and presence of outliers.

Boxplots What are some advantages of boxplots? ease of construction
convenient handling of outliers construction is not subjective (like histograms) used with medium or large size data sets (n > 10) useful for comparative displays

Boxplots When to Use: Univariate numerical data
The five-number summary is the smallest observation, first quartile, median, third quartile, and largest observation When to Use: Univariate numerical data How to construct a Skeleton Boxplot: Calculate the five number summary Draw a horizontal (or vertical) scale Construct a rectangular box from the lower quartile (Q1) to the upper quartile (Q3) Draw a line inside the rectangle at the median value Draw lines from the lower quartile to the smallest observation and from the upper quartile to the largest observation To describe: comment on the center, spread, and shape of the distribution and if there is any unusual features Use for moderate to large data sets. Don’t use with data sets of n < 10.

Draw lines for the whiskers Draw a line for the median
Remember the data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia. First draw a scale Draw a box from Q1 to Q3 Draw lines for the whiskers Draw a line for the median

Outliers An observation is an outlier if it is more than iqr away from the closest end of the box (less than the lower quartile minus 1.5 iqr or more than the upper quartile plus 1.5 iqr). An outlier is extreme if it is more than 3 iqr from the closest end of the box, and it is mild otherwise.

Modified Boxplots A modified boxplot represents mild outliers by shaded circles and extreme outliers by open circles. Whiskers extend on each end to the most extreme observations that are not outliers.

Draw lines for the whiskers Place a solid dot for the outlier
Remember the data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia. To describe: The distribution of percent of the population with a bachelor’s degree or higher for the U.S. states and District of Columbia is positively skewed with an outlier at 47%. The median percentage is at 26% with a range of 30%. There is one outlier at the upper end at the distribution, but none at the lower end. Is it extreme? Draw lines for the whiskers Place a solid dot for the outlier First, draw the scale, box and the line for the median Next calculate the fences for outliers. 24-1.5(6) = 15 30+1.5(6) = 39 30+3(6) = 48

Symmetrical boxplots Approximately symmetrical boxplot Notice that the range of the lower half and the range of the upper half of this distribution are approximately equal so we can say that it is approximately symmetrical. Notice that all 3 boxplots are identical, but their corresponding histograms are very different. Can you determine the number of modes from a boxplot? However, the range of the two halves of this distribution are definitely different sizes, so it would be skewed in the direction of the longest side. Skewed boxplot

Discuss the similarities and differences.
The salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams. Discuss the similarities and differences. See page 198 for more information.

Modified Boxplot Example
Consider the ages of the 79 students from the classroom data set from Chapter 3. Create a modified boxplot for the data below. Iqr = 22 – 19 = 3 Lower quartile – 3 iqr = Lower quartile – 1.5 iqr =14.5 Upper quartile + 3 iqr = Upper quartile iqr = 26.5 Lower Quartile Median Upper Quartile Moderate Outliers Extreme Outliers

Here is the modified boxplot for the student age data. Largest data value that isn’t an outlier Smallest data value that isn’t an outlier Mild Outliers Extreme Outliers

50 45 40 35 30 25 20 15 Here is the same boxplot reproduced with a vertical orientation.

Comparative Boxplot Example
By putting boxplots of two separate groups or subgroups we can compare their distributional behaviors. Describe the similarities and differences among the two groups. The distributional pattern of female and male student weights have similar shapes, although the females are roughly 20 pounds lighter (as a group). Females Males G e n d r Student Weight

Comparative Boxplot Example

Section 4.4 Interpreting Center and Variability: Chebyshev’s Rule, the Emperical Rule, and z Scores

Interpreting Center & Variability
This rule can be used with any distribution – no matter it’s shape! Chebyshev’s Rule–-The percentage of observations that are within k standard deviations of the mean is at least where k > 1 If k = 2, then at least 75% of the observations are within 2 standard deviations of the mean.

Interpreting Variability Chebyshev’s Rule
For specific values of k Chebyshev’s Rule reads At least 75% of the observations are within 2 standard deviations of the mean. At least 89% of the observations are within 3 standard deviations of the mean. At least 90% of the observations are within 3.16 standard deviations of the mean. At least 94% of the observations are within 4 standard deviations of the mean. At least 96% of the observations are within 5 standard deviations of the mean. At least 99% of the observations are with 10 standard deviations of the mean.

For a sample of families with one preschool child, it was reported that the mean child care time per week was approximately 36 hours with a standard deviation of approximately 12 hours. Using Chebyshev’s rule, at least 75% of the sample observations must be between 12 and 60 hours (within 2 standard deviations of the mean). At most, what percent of the observations are greater than 72 hours? At least 89% of the observations are between 0 & 72 hours. Since time can’t be negative, at most 11% of the observations are above 72 hours.

Example - Chebyshev’s Rule
Consider the student age data Color code: within 1 standard deviation of the mean within 2 standard deviations of the mean within 3 standard deviations of the mean within 4 standard deviations of the mean within 5 standard deviations of the mean

Example - Chebyshev’s Rule
Summarizing the student age data Interval Chebyshev’s Actual within 1 standard deviation of the mean  0% 72/79 = 91.1% within 2 standard deviations of the mean  75% 75/79 = 94.9% within 3 standard deviations of the mean  88.8% 76/79 = 96.2% within 4 standard deviations of the mean  93.8% 77/79 = 97.5% within 5 standard deviations of the mean  96.0% 79/79 = 100% Notice that Chebyshev gives very conservative lower bounds and the values aren’t very close to the actual percentages.

What’s my area? Input the following command into a graphing calculator in order to graph a normal curve with a mean of 20 and standard deviation of 3: Y1 = normalpdf(X,20,3) (Window x: [10,30] y: [0,0.2]) Use the command 2nd trace, 7 to find the area under the curve for: (Round to 4 decimal places.) Lower limit: 17 Upper limit: 23 Area: ____________________ Lower limit: 14 Upper limit: 26 Area: ____________________ Lower limit: 11 Upper limit: 29 Area: ____________________

What pattern do you notice?
What’s my area? Graph a normal curve with a mean of 50 and standard deviation of 5. Y1 = normalpdf(X,50,5) (x: [30,70] y: [0,0.1]) Find the area under the curve for the following: Lower limit: 45 Upper limit: 55 Area: ________ Lower limit: 40 Upper limit: 60 Area: ________ Lower limit: 35 Upper limit: 65 Area: ________ What pattern do you notice?

Chebyshev’s Rule Chebyshev’s Rule states that 75% of the observations in a data set are within 2 standard deviations of the mean, however, in many data sets substantially more than 75% of the values satisfy this condition

Interpreting Center & Variability
Empirical Rule- Approximately 68% of the observations are within 1 standard deviation of the mean Approximately 95% of the observations are within 2 standard deviation of the mean Approximately 99.7% of the observations are within 3 standard deviation of the mean 99.7% 95% 68% Can ONLY be used with distributions that are mound shaped!

The height of male students at PWSH is approximately normally distributed with a mean of 71 inches and standard deviation of 2.5 inches. What percent of the male students are shorter than 66 inches? b) Taller than 73.5 inches? c) Between 66 & 73.5 inches? About 2.5% About 16% About 81.5%

Empirical Rule vs. Chebyshev’s Rule
The Empirical Rule makes “approximately” instead of “at least” statements, and the percentages for k = 1, 2, and 3 standard deviations are much higher than those allowed by Chebyshev’s Rule.

Empirical Rule vs. Chebyshev’s Rule
In contrast to Chebyshev’s Rule, dividing the percentages in half is permissible because a normal curve is symmetric.

Empirical Rule Another reminder!!
The Empirical Rule can only be used If the histogram of values in a data set is reasonably symmetric and unimodal (specifically, is reasonably approximated by a normal curve)

Empirical Rule It is unusual to see an observation from a normally distributed population that is farther than 2 standard deviations from the mean (only 5%), and it is very surprising to see one that is more than 3 standard deviations away.

z Scores The z score is how many standard deviations the observation is from the mean. A positive z score indicates the observation is above the mean and a negative z score indicates the observation is below the mean. The z score corresponding to a particular observation in a data set is calculated as:

What do these z scores mean?
2.3 standard deviations below the mean -2.3 1.8 -4.3 1.8 standard deviations above the mean 4.3 standard deviations below the mean

z Scores Computing the z score is often referred to as standardization and the z score is called a standardized score.

Sally is taking two different math achievement tests with different means and standard deviations. The mean score on test A was 56 with a standard deviation of 3.5, while the mean score on test B was 65 with a standard deviation of Sally scored a 62 on test A and a 69 on test B. On which test did Sally score the best? z-score on test A z-score on test B She did better on test A.

Measures of Relative Standing
percentiles--A value in the data set where r percent of the observations fall AT or BELOW that value.

In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys. Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6 Percentile 5 10 25 50 75 90 95 What percent of newborn boys had head circumferences greater than 37.0 cm? 25% 10% of newborn babies have head circumferences bigger than what value? 38.2 cm

Numerical Methods for Describing Data

Similar presentations

Presentation on theme: "Numerical Methods for Describing Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Numerical Methods for Describing Data

Similar presentations

Presentation on theme: "Numerical Methods for Describing Data"— Presentation transcript:

Similar presentations

About project

Feedback