Measures of Location Statistics of location Statistics of dispersion

Measures of Location Statistics of location Statistics of dispersion
INFERENTIAL STATISTICS & DESCRIPTIVE STATISTICS Statistics of location Statistics of dispersion When a set of data has been collected, the first thing we will want to do is to summarise that data. This can be done with frequency distributions, as we discussed in the previous chapter on data types. However, we often want a numerical summary of the data. These data are referred to as descriptive stats, and they are divided into two categories: stats of location, and stats of dispersion. Stats of location summarise the central point of the data along a number line, and stats of dispersion summarise how the observations are distributed about that central point. You will remember that we said previously that we use descriptive stats to summarise the important characteristics of a data set, and inferential stats to generalise about a greater population from that which we observe in a smaller sample of that population. In this chapter therefore, we will discuss a few different measures of location, or central tendency, as they are sometimes known, and we will also look at the ways in which data are dispersed around the measures of location or central tendency. Summarise a central point Summarises distribution around central point

Measures of Location ARITHMETIC MEAN
Sum all observation, then divide by number of observations For a sample: For a population: The first of the measures of location we will examine is the arithmetic mean, known to lay people as simply “the average”. This represents the centre of the observations in a sample frequency distribution. Calculating the mean is very simple. You simply sum, or add, all the observations, and then divide by the number of observations. If X is the letter we use to denote our sample variable, then X with a bar over it would represent the sample mean of all of our sample observations. Remember that we use different notation when talking about the sample, than we do when talking about a population. We use greek letters for population parameters, and arabic letters for sample statistics. The sample mean therefore, is designated as “x”, and the population mean as mu. Since calculating the mean is so simple, and because it has other properties that are useful when it comes to inferential stats, it is the most commonly reported statistic of location. One problem with the mean though, is that extreme values will greatly influence its value.

Measures of Location X=7.07 No. of People Nightly Hours of Sleep 2 4 6
2 4 6 8 10 12 14 16 18 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 Nightly Hours of Sleep No. of People X=7.07 In the example we have here, we have added all the observation values, in other words, the number of hours each person slept, and have divided by the number of people involved in the sample, to get an average of 7.07 hours of sleep per night, for this particular sample group.

Measures of Location MEDIAN MEDIAN
1 2 3 4 5 33 34 35 36 37 38 39 40 Score Frequency For N = 15 the median is the eighth score = 37 Value that has equal no. of observations (n) on either side The second measure we will examine is called the median. This is defined as the value that has an equal number of observations on either side of it. It divides the frequency distribution in half, relative to the number of observations.

Measures of Location MEDIAN
1 2 3 4 5 33 34 35 36 37 38 39 40 Score Frequency For N = 16 the median is the average of the eighth and ninth scores = 37.5 Value that has equal no. of observations (n) on either side If there are an even number of observations, then there is no one observation that fits the criterion of having an equal number of observations larger as there are smaller. In this case, the value must be calculated by averaging the middle two observations. In the example we have here, the average of the eighth and ninth observations was calculated. You will also note that unlike the mean, the median is unaffected by a few very large or very small, values.

Measures of Location MODE
the most frequently occurring score value corresponds to the highest point on the frequency distribution For a given sample N=16: The mode = 39 1 2 3 4 5 33 34 35 36 37 38 39 40 41 42 43 44 45 Score Frequency The last measure of location that we will examine is the mode.This is simply the most common observation in the data. If there are two most common values then the distribution is said to be bimodal and it has two specific peaks. The mode is not often used as it contains very little useable information and because of that, you seldom see it reported in scientific literature, although it is often interesting to report the number of modes detected in a population or sample, if there is more than one.

Measures of Location Measures of central tendency Summary Advantages
Disadvantages Mode quick & easy to compute useful for nominal data poor sampling stability Median not affected by extreme scores somewhat poor sampling stability Mean sampling stability related to variance inappropriate for discrete data affected by skewed distributions The mean is by far the most commonly reported statistic. It is easy to work with mathematically, but the disadvantages are that it is greatly affected by extreme values. The median is less commonly reported, and the advantage is that it is not greatly affected by outliers. The mode is rarely used since it does not convey much information about the set of data.

Measures of Location DISPERSION
These are measures of how the observations are distributed around the mean Besides the measures of location, such as the mean, of a sample, there is also a way in which to measure how the observations are distributed around the mean. In other words, we want to know whether most of the observations lie close to the mean, or are they distributed far from the mean? This characteristic is called the dispersion of the population, and there are several ways in which it can be calculated. As usual, most of the time we will be dealing with a sample only of the population, and so what we will be calculating will be a sample statistic that estimates the population parameter that is actually the measure of dispersion.

Measures of Location DISPERSION: Range
The first of these measures that we will examine, is called the range. This is simply the lowest value subtracted from the highest value. It is then obviously greatly affected by the any outliers, and gives very little specific information about how the observations cluster around the mean. It is therefore a poor estimator of the population, and is therefore seldom used. If it is reported, it should be reported together with other measures of dispersion.

Measures of Location DISPERSION: Variance mean = 50
Score Deviation Amy 10 -40 Theo 20 -30 Max 30 -20 Henry 40 -10 Leticia 50 Charlotte 60 Pedro 70 Tricia 80 Lulu 90 SUM mean = 50 To see how ‘deviant’ the distribution is relative to another, we could sum these scores But this would leave us with a big fat zero The second of these measures that we will examine is the variance. This measure describes the dispersion of the data about an estimate of central tendency such as the mean. If the data points are all close to the mean, then variability is low. If data points are dispersed widely around the mean, then variability is high. If we want an estimate of dispersion about the mean, the first thing to do is to take each data point in the sample, and subtract the mean. This quantifies the distance of each point from the mean. To get an over-all picture of variability, we could sum these values, or scores. Unfortunately, this adds up to zero.

Measures of Location DISPERSION: SS= ∑(X-X)2 Variance
Score Deviation Sq. of deviation Amy 10 -40 1600 Theo 20 -30 900 Max 30 -20 400 Henry 40 -10 100 Leticia 50 Charlotte 60 Pedro 70 Tricia 80 Lulu 90 SUM 6000 So we use squared deviations from the mean, which are then summed This is the sum of squares (SS) In order to calculate the variance therefore, we must first calculate the squared deviations of each observation from the mean and then we must sum these values. This becomes then, the sample sum of squares, which we commonly abbreviate to SS. This is a very important term, and will be used often. This is then our first estimate of variability. SS= ∑(X-X)2

Measures of Location DISPERSION: Variance For a sample:
(to correct for the fact that sample variance tends to underestimate pop variance) Next, we will divide the sample sum of squares by the sample size minus one, in order to get the sample variance, which we denote as s squared. If we wanted the population variance, we would divide the population sum of squares by the size of the population, and this is denoted by sigma squared. This is often impossible though, and the best estimate of the population variance is to take the sample SS and divide by the sample size minus one, as we’ve already described. The variance is also referred to as the “mean square”. Dividing the sample SS by sample size minus one yields an unbiased estimator of the population variance, and the term (n-1) is called the degrees of freedom. As the sum of squares can vary from zero to infinity, the variance itself can vary from zero to infinity. You can never have a variance with a negative value. For a population: We take the “average” squared deviation from the mean and call it VARIANCE

Measures of Location DISPERSION: Standard deviation
The standard deviation is the square root of the variance The standard deviation measures spread in the original units of measurement, while the variance does so in units squared. Variance is good for inferential stats. Standard deviation is nice for descriptive stats. The sample variance is an excellent estimate of variability, but it has the square of the original units of the data, which can be difficult to interpret. For example, if you have data in grams, the variance has the unit “square grams”, and who knows what that is? The solution here is simple – just take the square root of the sample variance, and this we call the standard deviation. Since the sample variance is s2, the standard deviation is symbolised as s.

Measures of Location DISPERSION N = 28 X = 50 s2 = 140.74 s = 11.86
2 4 6 8 10 12 14 20 30 40 50 60 70 80 90 100 Scores # of People N = 28 X = 50 s2 = s = 11.86 s2 = s = 23.57 Here we have two sets of data, with the distribution represented graphically as bar charts. They each have the same number of observations, and the same sample mean, but the distribution of the data in each of the data sets is clearly different, and one would not know that by looking at the sample mean only. But by calculating the sample variance, and then also the standard deviation, we see immediately that the two sets of data differ in terms of dispersion.

Measures of Location DISPERSION Mean Variance Standard Deviation
For a sample: For a population: Remember that the SS, Variance, and Standard Deviation quantities are all statistics – they are estimates of population parameters. We generally use the formulas here for samples when working with data, because we are generally working at the sample level, seldom, if ever, at the population level. We do however, need to be aware of the formulas for the population parameters.

Measures of Location DISPERSION s = n
The Standard Error, or Standard Error of the Mean, is an estimate of the standard deviation of the sampling distribution of means, based on the data from one or more random samples e.g. 15 students each compile data sets of the heights of 20 people Numerically, it is equal to the square root of the quantity obtained when s squared is divided by the size of the sample. Uptil now, we have discussed the frequency distribution, or dispersion, of values of data. However, we are also often concerned about the frequency distribution, or dispersion of statistics. For example, suppose every person in this class went outside, stopped 20 random people and asked how tall they were. If there were 15 people in this class, and they now each had a sample of n=20, there would be 20 x 15 = 300 data points. We could construct a frequency distribution of the values of 300 data points. Or we could have each person in the class calculate the mean of their particular set of data points. So we would have 15 different means, and we could prepare a frequency distribution of the values of the means – in other words, how often did the mean occur? Since the mean is a statistic, this would be a frequency distribution, or dispersion of the values of a sample statistic. Since this is a mouthful, we have a much shorter term that we use: sampling distribution. A sampling distribution is therefore a frequency distribution of the values of a sample statistic. Next, we could calculate the standard deviation of our 15 means. We would use the formula we have used uptil now, except that our mean would be a “mean of the means”, and each x value would be one of the 15 means (standard deviation symbolized as s subscript x) The standard deviation of the values of a statistic is called the standard error. In our particular example, we have calculated the the standard error of the mean. and n X = s

Measures of Location Statistics of location Statistics of dispersion

Similar presentations

Presentation on theme: "Measures of Location Statistics of location Statistics of dispersion"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Measures of Location Statistics of location Statistics of dispersion

Similar presentations

Presentation on theme: "Measures of Location Statistics of location Statistics of dispersion"— Presentation transcript:

Similar presentations

About project

Feedback