 Introduction to Summary Statistics

Presentation on theme: "Introduction to Summary Statistics"— Presentation transcript:

Introduction to Summary Statistics
Introduction to Engineering Design Unit 3 – Measurement and Statistics Introduction to Summary Statistics

Introduction to Summary Statistics
The collection, evaluation, and interpretation of data Statistical analysis of measurements can help verify the quality of a design or process

Introduction to Summary Statistics
Central Tendency “Center” of a distribution Mean, median, mode Variation Spread of values around the center Range, standard deviation, interquartile range Distribution Summary of the frequency of values Frequency tables, histograms, normal distribution [click] The average value of a variable (like rainfall depth) is one type of statistic that indicates central tendency – it gives an indication of the center of the data. The median and mode are two other indications of central tendency. But often we need more details on how much a quantity can vary. [click] Statistical dispersion is the variability or spread of data. The range, standard deviation and interquartile range are indications of dispersion. [click] Even more detail about a variable can be shown by a frequency distribution which shows a summary of the data values are distributed throughout the range of values. Frequency distributions can be shown by frequency tables, histograms, box plots, or other summaries.

Introduction to Summary Statistics
Mean Central Tendency The mean is the sum of the values of a set of data divided by the number of values in that data set. The mean is the most frequently used measure of central tendency. It is strongly influenced by outliers which are very large or very small data values that do not seem to fit with the majority of data.

Introduction to Summary Statistics
Mean Central Tendency

Introduction to Summary Statistics
Mean Central Tendency Data Set Sum of the values = 243 Number of values = 11 243 Mean = = = 11

Introduction to Summary Statistics
Mode Central Tendency Measure of central tendency The most frequently occurring value in a set of data is the mode Symbol is M Data Set:

Introduction to Summary Statistics
Mode Central Tendency The most frequently occurring value in a set of data is the mode Data Set: There are two occurrences of 21. [click] Every other data value has only one occurrence. Therefore, 21 is the mode. [click] Mode = M = 21

Introduction to Summary Statistics
Mode Central Tendency The most frequently occurring value in a set of data is the mode. Bimodal Data Set: Two numbers of equal frequency stand out Multimodal Data Set: If more than two numbers of equal frequency stand out

Introduction to Summary Statistics
Mode Central Tendency Determine the mode of 48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55 Mode = 63 Determine the mode of 48, 63, 62, 59, 58, 2, 63, 5, 60, 59, 55 Mode = 63 & 59 Bimodal Determine the mode of 48, 63, 62, 59, 48, 2, 63, 5, 60, 59, 55 Mode = 63, 59, & Multimodal

Median Central Tendency
Introduction to Summary Statistics Median Central Tendency Measure of central tendency The median is the value that occurs in the middle of a set of data that has been arranged in numerical order Symbol is x, pronounced “x-tilde” ~ The median divides the data into two sets which contain an equal number of data values.

Median Central Tendency
Introduction to Summary Statistics Median Central Tendency The median is the value that occurs in the middle of a set of data that has been arranged in numerical order. Data Set:

Median Central Tendency
Introduction to Summary Statistics Median Central Tendency A data set that contains an odd number of values always has a Median. Data Set:

Median Central Tendency
Introduction to Summary Statistics Median Central Tendency For a data set that contains an even number of values, the two middle values are averaged with the result being the Median. Data Set:

Introduction to Summary Statistics
Range Variation Measure of data variation. The range is the difference between the largest and smallest values that occur in a set of data. Symbol is R Data Set: Range = R = 44 – 3 = 41

Standard Deviation Variation
Introduction to Summary Statistics Standard Deviation Variation Measure of data variation. The standard deviation is a measure of the spread of data values. A larger standard deviation indicates a wider spread in data values

Standard Deviation Variation
Presentation Name Course Name Unit # – Lesson #.# – Lesson Name Standard Deviation Variation σ = standard deviation xi = individual data value ( x1, x2, x3, …) μ = mean N = size of population

Standard Deviation Variation
Introduction to Summary Statistics Standard Deviation Variation Procedure: Calculate the mean, μ. Subtract the mean from each value and then square each difference. Sum all squared differences. Divide the summation by the size of the population (number of data values), N. Calculate the square root of the result. Note that this is the formula for the population standard deviation, which statisticians distinguish from the sample standard deviation. This formula provides the standard deviation of the data set used in the calculation. We will later differentiate between the population standard deviation and the sample standard deviation.

Introduction to Summary Statistics
Standard Deviation Calculate the standard deviation for the data array 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 1. Calculate the mean. 2. Subtract the mean from each data value and square each difference. ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = Since we are given a finite data set, and are not told otherwise, we assume we have a data value for each member of the entire population. We use the POPULATION standard deviation. Find the mean of the data. [3 clicks] Find the difference between the mean and each data value and square each difference. [many clicks]

Standard Deviation Variation
Introduction to Summary Statistics Standard Deviation Variation 3. Sum all squared differences. = 5,024.55 4. Divide the summation by the number of data values. 4. Sum the squared differences. [click] 5. Divide the sum by the number of data values [click] 6. Take the square root of the division. [click] 5. Calculate the square root of the result.

A Note about Standard Deviation
Introduction to Summary Statistics A Note about Standard Deviation Two distinct calculations Population Standard Deviation The measure of the spread of data within a population. Used when you have a data value for every member of the entire population of interest. Sample Standard Deviation An estimate of the spread of data within a larger population. Used when you do not have a data value for every member of the entire population of interest. Uses a subset (sample) of the data to generalize the results to the larger population. [click] We have just calculated the population standard deviation of a data set. For that calculation, we were calculating the standard deviation of the data set only – the data set was the entire population. [click] If you are given data for only a portion of the population of interest and would like to estimate the standard deviation for the larger population, you would use the sample standard deviation formula.

A Note about Standard Deviation
Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population [click] So, for instance, if you are asked to measure the height of each student in your class, and then are asked to find the standard deviation of those heights, you would use the POPULATION standard deviation. You would have a data value for every member of the population – the students in your class. We call it POPULATION standard deviation because the value is based on the entire population. [click] However, if you are asked to estimate the height of all of the high school students in your county (and you believed that your class provides a good representation on which to base that estimate) you would use the SAMPLE standard deviation. In this case, you would have only a sample (subset) of the heights of the entire population since your class is a subset of the county high school population. We call this the SAMPLE standard deviation because the value is based on a sample of the entire population. [click] Notice that the main difference in the two formulas is the denominator. The population uses N, the population size. The sample standard deviation uses n – 1 which is one less than the size of the sample used in the calculation.

Sample Standard Deviation Variation
Introduction to Summary Statistics Sample Standard Deviation Variation So, the procedure to find the sample standard deviation is basically the same procedure we used to find the population standard deviation. Notice that here we use the sample mean and the n – 1 denominator.

Sample Mean Central Tendency
Introduction to Summary Statistics Sample Mean Central Tendency Essentially the same calculation as population mean The sample mean is simply the mean of the data values in the sample. This calculation is essentially no different than the mean calculation presented earlier. We are just using a different variable to represent the number of data values. [click] Here we use lower case n to represent the number of data values in the sample, as opposed to upper case N which represents the size of the larger population. The sample is always smaller than the population that the sample represents. But when you have all of the data values for a population (n = N) you are really calculating the population mean.

Sample Standard Deviation
Introduction to Summary Statistics Sample Standard Deviation Estimate the standard deviation for a population for which the following data is a sample. 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 1. Calculate the sample mean. 2. Subtract the sample mean from each data value and square the difference. ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = ( )2 = In this case, the data is a sample of a larger population for which we would like to estimate the standard deviation. Therefore, we use the sample standard deviation formula. Find the sample mean of the data. [3 clicks] Although we now call it a sample mean because it is based on a sample of the population, the mean is the same as we calculated previously. Find the difference between the sample mean and each data value and square each difference. [many clicks] Again, because the data is the same (although now considered just a sample of the larger population) and the mean is the same a previously calculated, this step yields the same result as the corresponding step in the previous population standard deviation calculation.

Sample Standard Deviation Variation
Introduction to Summary Statistics Sample Standard Deviation Variation 3. Sum all squared differences. = 5,024.55 4. Divide the summation by the number of sample data values minus one. 3. Sum the squared differences. [click] Again, this step yields the same result as the corresponding step in the population standard deviation calculation. 4. Divide the sum by the number of data values in the sample (the sample size).[click] Here we use n – 1 = 10 in the denominator. 5. Take the square root of the division. [click] So the standard deviation of the larger population predicted by the sample standard deviation formula is Slightly different than the population standard deviation of the data set (which was 21.4). 5. Calculate the square root of the result.

A Note about Standard Deviation
Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population The two different formulas for standard deviation can be confusing and are often misapplied. However, it is important to note that if you compare the population standard deviation to the estimated standard deviation of that same population provided by a sample standard deviation, the sample standard deviation generally tends to get closer and closer to the population standard deviation as the sample size increases. [click] In mathematical terms we say, as the sample size approaches the population size (n → N), the sample standard deviation approaches the population standard deviation (s → σ). We use the arrow to represent “approaches”. In other words, the larger the sample, the better the estimate of the population standard deviation. As n → N, s → σ

A Note about Standard Deviation
Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample Given the ACT score of every student in your class, use the population standard deviation formula to find the standard deviation of ACT scores in the class. σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population So, for example, suppose you are interested in the spread of the ACT scores of the students in your class. If you have the ACT score of every student in your class you would use the population standard deviation formula to determine the standard deviation. [click]

A Note about Standard Deviation
Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample Given the ACT scores of every student in your class, use the sample standard deviation formula to estimate the standard deviation of the ACT scores of all students at your school. σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population However, if you wanted to estimate ACT scores of all of the students in your high school using only the scores of student from your class (and you felt that your class is a good representation of the students that attend your school), you would use the sample standard deviation formula to estimate the standard deviation of the larger population – and you might get fairly close to the actual standard deviation for the entire school. [click]

Histogram Distribution
Introduction to Summary Statistics Histogram Distribution A histogram is a common data distribution chart that is used to show the frequency with which specific values, or values within ranges, occur in a set of data. An engineer might use a histogram to show the variation of a dimension that exists among a group of parts that are intended to be identical.

Histogram Distribution
Introduction to Summary Statistics Histogram Distribution Large sets of data are often divided into limited number of groups. These groups are called class intervals. -6 to -16 -5 to 5 6 to 16 Class Intervals

Histogram Distribution
Introduction to Summary Statistics Histogram Distribution The number of data elements in each class interval is shown by the frequency, which occurs along the Y-axis of the graph 7 5 Frequency 3 1 -16 to -6 -5 to 5 6 to 16

Histogram Distribution
Introduction to Summary Statistics Histogram Distribution Example 1, 7, 15, 4, 8, 8, 5, 12, 10 1, 4, 5, 7, 8, 8, 10, 12,15 Let’s create a histogram to represent this data set. [click] From this example, we will break the data into ranges of 1 to 5, 2 to 10, and 11 to 15. So, we will place labels on the x-axis to indicate these ranges. [click] Let’s reorder the data to make it easier to divide into the ranges. [click] [click] 4 3 Frequency 2 1 1 to 5 6 to 10 11 to 15

Histogram Distribution
Introduction to Summary Statistics Histogram Distribution The height of each bar in the chart indicates the number of data elements, or frequency of occurrence, within each range 1, 4, 5, 7, 8, 8, 10, 12,15 The height of each bar in a histogram indicates the number of data elements, or frequency of occurrence, within each range. Now, looking at the data, you can see there are three data values in the set that are in the range 1 to 5. [click] There are four data values in the range 6 to 10. [click] And there are three data values in the range 11 to 15. [click] Note, that you can always determine the number of data points in a data set from the histogram by adding the frequencies from all of the data ranges. In this case we add three, four and two to get a total of nine data points. 4 3 Frequency 2 1 1 to 5 6 to 10 11 to 15

Histogram Distribution
Introduction to Summary Statistics Histogram Distribution This histogram represents the side length of twenty seven small wooden cubes. The minimum measurement was in. [click] and the maximum length measurement was inches. Each data value between the min and max fall within one of the classes on the horizontal (x-) axis. Note that in this case, each value indicated on the horizontal axis actually represents a class interval - a range of values. For instance, represents the values ≤ x < Due to the precision of the measurement device, the data are recorded to the thousandth (e.g ) and will therefore correspond to one of the values shown on the axis. The bars are drawn to show the number of cubes that have a side length within each interval – the frequency of occurrence. For instance, of the twenty seven cubes, two have a side length of in. (and therefore fall within the ≤ x < class interval). [click] And four have a measured length of in. (and fall within the ≤ x < class interval). [click] MINIMUM = in. MAXIMUM = in. Class Intervals

Introduction to Summary Statistics
Dot Plot Distribution 3 -1 -3 3 2 1 -1 -1 2 1 1 -1 -2 1 2 1 -2 -4 Another way to represent data is a dot plot. A dot plot is similar to a histogram in that it shows frequency of occurrence of data values. To represent the data in the table, we place a dot for each data point directly above the data value. [click] So there is one dot for each data point. [click many times] -6 -5 -4 -3 -2 -1 1 2 3 4 5 6

Introduction to Summary Statistics
Dot Plot Distribution 3 -1 -3 3 2 1 -1 -1 2 1 1 -1 -2 1 2 1 -2 -4 Dot plots are easily changed to histograms by replacing the dots with bars of the appropriate height indicating the frequency. [click] 5 Frequency 3 1 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution “Is the data distribution normal?” Translation: Is the histogram/dot plot bell-shaped? Does the greatest frequency of the data values occur at about the mean value? Does the curve decrease on both sides away from the mean? Is the curve symmetric about the mean?

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution Bell shaped curve Frequency In this example the data values are fairly evenly distributed about the mean. Approximately half of the values that are not mean values are less than the mean and approximately half are greater than the mean. And, the frequency of occurrence decreases as the value of the data point moves farther away from the mean. The data appears to form a bell shaped curve. [click] This data set looks to be normally distributed. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution Does the greatest frequency of the data values occur at about the mean value? Mean Value Frequency The highest frequency of values in this example occur at zero, which is the mean value of the data set. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution Does the curve decrease on both sides away from the mean? Mean Value Frequency The highest frequency of values in this example occur at zero, which is the mean value of the data set. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution Is the curve symmetric about the mean? Mean Value Frequency The highest frequency of values in this example occur at zero, which is the mean value of the data set. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements

What if things are not equal?
Introduction to Summary Statistics What if things are not equal? Although a normal distribution is the most common probability distribution in statistics and science, some quantities are not normally distributed. A visual analysis can help you decide if a normal distribution is a good representation for your data (although mathematical tests are usually necessary). If the data is skewed you should not assume a normal distribution of data. This histogram shows that the data is skewed to the right, that is, there is a longer “tail” to the right. Histogram Interpretation: Skewed (Non-Normal) Right

Introduction to Summary Statistics
Normal Distribution Distribution If the data are normally distributed: 68% of the observations fall within 1 standard deviation of the mean. 95% of the observations fall within 2 standard deviations of the mean. 99.7% of the observations fall within 3 standard deviations of the mean. Many quantities tend to follow a normal distribution – heights of people, test scores, errors in measurement, etc. Given normally distributed data, 68% of the data values should fall within 1 standard deviation of the mean, 95% should fall within 2 standard deviations of the mean and 99.7 % should fall within 3 standard deviations of the mean. Of course, with small samples/populations, these percentages may not hold exactly true because the number of values will not allow divisions to this precision.

Normal Distribution Example
Introduction to Summary Statistics Normal Distribution Example Data from a sample of a larger population Let’s assume that this data was gathered from a sample taken from a larger population. Assume that we are interested in finding statistics for the larger population. The sample data values are fairly evenly distributed about the mean. Approximately half of the values that are not mean values are less than the mean and approximately half are greater than the mean. And, the frequency of occurrence decreases as the value of the data point moves farther away from the mean. The data appears to form a bell shaped curve. [click] This data set looks to be normally distributed. The mean is 0.08. The sample standard deviation formula is used to estimate the standard deviation of the larger population and is found to be 1.77.

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution = 1.88 = -1.69 68 % s -1.77 s +1.77 Since the data appears to be normally distributed, we can estimate that approximately 68% of the population data will fall within one standard deviation of the mean. [4 clicks] That is, about two thirds of the data will be between 1.69 and 1.88. Data Elements

Normal Distribution Distribution
Introduction to Summary Statistics Normal Distribution Distribution = = 3.62 95 % And, again, because the data is assumed to be normally distributed, we can estimate that 95% of the population data will fall within 2 standard deviations of the mean. That is, approximately 95% of the data will fall between 3.46 and 3.62 - 3.54 + 3.54 Data Elements