Presentation is loading. Please wait.

Presentation is loading. Please wait.

Numerical Statistics Given a set of data (numbers and a context) we are interested in how to describe the entire set without listing all the elements.

Similar presentations


Presentation on theme: "Numerical Statistics Given a set of data (numbers and a context) we are interested in how to describe the entire set without listing all the elements."— Presentation transcript:

1 Numerical Statistics Given a set of data (numbers and a context) we are interested in how to describe the entire set without listing all the elements. Two important characteristics of the set are its center and how “spread out” it is.

2 Measures of Center The mode of a set of data is the observation, or observations, which occur most often The Mean = Sum of all values/number of observations Median middle observation after the observations are ordered smallest to largest.

3 Median Order the data smallest to largest.
If there are an odd number (n) of observations in a data set, the median is the (n+1)/2 value If there is an even number of observations, take the average (mean) of the two numbers closest to the middle. The Excel command Median can also be used.

4 Mode(s) of a data set The mode of a set of data is the value that occurs most often. Sometimes this number is unique but other times it is not. The mode of the set {1, 2, 2, 3} is 2. While the set {1, 2, 3} has three modes 1, 2, and 3. The Excel command Mode can be used to calculate the mode of a set of numbers but it will only find one mode not multiple modes.

5 Geometric Descriptions
Given a histogram (or probability density curve) the mode(s) is(are) the highest point(s) on the curve. The median is point at which the area under the curve to the right of the median is equal to the area under the curve to the left of the median. The mean is the point at which a cut-out of the curve would balance.

6 The Mean is the Balance point

7 Measures of Spread: Range
The simplest way to describe how spread-out a data set is, is to calculate the largest value minus the smallest value. The result is called the range of the data set. For example, {1, 2, 6, 8, 9} has a range of 9-1=8.

8 Percentiles The p-th percentile of a set of data is the value for which, at most, p% of the data points are less than that value, and at most (100-p)% of the data points are greater than that value.

9 Percentiles as area

10 For example, consider the 8 numbers 1, -1, 2, -2, 3, -3, 4,-4
To find the 30th percentile, we first order the data smallest to largest. -4, -3, -2, -1, 1, 2, 3, 4. Next, we calculate the product of the sample size (8) and the desired proportion 0.3 (30%). The result is Since 2.4 is not an integer, we take the smallest integer larger than 2.4, 3. Therefore the third number from the left in the list above, -2, is the 30th percentile.

11 Calculation of Percentiles
To calculate the pth percentile of a set of data: 1. Order the data smallest to largest. 2. Multiply the size of the sample by the desired percentage represented as a decimal. For example, if the data set contains 50 points, to find the 30th percentile we multiply 50 * .30 = 15. 3. If this number (sample size times percentage) is not an integer, round it up to the next integer and find the corresponding data value from the ordered list. If this number is an integer, locate the corresponding data value from the ordered list. Take the average of this data value with the next larger data value on the ordered list.

12 Percentiles in Excel Excel calculates percentiles in a non-standard fashion by treating the data as continuous rather than discrete. The difference is mainly that the percentiles calculated by Excel are rarely points in the data set, whereas the method previously described always results in the percentile being one of the data points or half way between two adjacent points. We therefore recommend not using the percentile worksheet function in Excel.

13 Quartiles The 25th, 50th, and 75th percentiles are also known as the first, second and third quartiles of the data set. To calculate these values we recommend using the Excel to find the median which divides the data in half, and then finding the median of the lower and upper halves.

14 Quartiles For example to find the quartiles of the data set 1,2,3,4,5,6,7,8. The median is 4.5 this is the second quartile. Taking the median of the lower four numbers gives the first quartile 2.5. Taking the median of the upper four numbers gives the 6.5. Note these are close to, but not equal to, the values returned by running the Excel worksheet function quartile (which is special case of Excel’s percentile function).

15 Measuring Spread: Interquartile Range
The interquartile range is the distance between the first and third quartiles. Note this actually shows the range of the middle half of the data. This measure is mainly used to describe data sets that have a few scattered points far away from a central group. For example the set below has a range of 125 suggesting the data has more spread than is actually the case. {1,1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 127}

16 Quartiles and Box Plots (Five number summaries)
A box plot is a graphical representation showing the lowest value, the first, second and third quartile and the largest value. These five numbers are represented by drawing a box from the first to the third quartile, with a vertical line at the median, and then extending “whiskers” to the smallest and largest values.

17 Box Plot

18 Box Plots in Excel Excel does not have a built in function to construct box plots but there is a worksheet called boxplot.xls that you can use to construct box plots from Excel data. This worksheet makes use of one added convention: elimination of outliers. As we have seen some data sets have scattered points that are far from the center of the data. If these points are more than 1.5*interquartile range they are considered “outliers” and are shown as single points unconnected to the box on a boxplot.

19 Results from Boxplot.xls
2.67 5.34 8.01 10.68 13.35 16.02 18.69 21.36 24.03 26.7 No Outliers Outliers

20 Measures of Spread: Population Variance
To calculate the variance of a population, first calculate the population mean, , then measure the distance between each observation and the mean, square these distances , then finally calculate the average of these squared distances by summing then dividing by the number of observations.

21 Standard deviation The population standard deviation , is calculated by taking the square root of the population variance . Or by using the Excel command =stdev( reference cells) Generally the standard deviation is about one fourth to one sixth of the range. Although it is difficult to calculate it by hand, we shall see it is by far the most important measure of spread of a data set.


Download ppt "Numerical Statistics Given a set of data (numbers and a context) we are interested in how to describe the entire set without listing all the elements."

Similar presentations


Ads by Google