Measures of Dispersion

Slides:



Advertisements
Similar presentations
Population vs. Sample Population: A large group of people to which we are interested in generalizing. parameter Sample: A smaller group drawn from a population.
Advertisements

Statistical Techniques I EXST7005 Start here Measures of Dispersion.
Agricultural and Biological Statistics
Measures of Dispersion
Introduction to Summary Statistics
Measures of Variation 1.)Range ( R ) - the difference in value between the highest(maximum) and the lowest(minimum) observation. R = Highest – Lowest 2.)Mean.
Measures of Dispersion
DESCRIBING DATA: 2. Numerical summaries of data using measures of central tendency and dispersion.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
PSY 307 – Statistics for the Behavioral Sciences
Measures of Variability or Dispersion
Variability Measures of spread of scores range: highest - lowest standard deviation: average difference from mean variance: average squared difference.
Descriptive Statistics
Biostatistics Unit 2 Descriptive Biostatistics 1.
Measures of Variability
Data observation and Descriptive Statistics
Learning Objectives In this chapter you will learn about the importance of variation how to measure variation range variance standard deviation.
Measures of Variability: Range, Variance, and Standard Deviation
Chapter 4 SUMMARIZING SCORES WITH MEASURES OF VARIABILITY.
Measures of Central Tendency
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 4 Summarizing Data.
Quiz 2 Measures of central tendency Measures of variability.
Describing distributions with numbers
Measurement Tools for Science Observation Hypothesis generation Hypothesis testing.
Part II Sigma Freud & Descriptive Statistics
Numerical Descriptive Techniques
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
CRIM 483 Measuring Variability. Variability  Variability refers to the spread or dispersion of scores  Variability captures the degree to which scores.
URBP 204A QUANTITATIVE METHODS I Statistical Analysis Lecture I Gregory Newmark San Jose State University (This lecture accords with Chapters 2 & 3 of.
Smith/Davis (c) 2005 Prentice Hall Chapter Six Summarizing and Comparing Data: Measures of Variation, Distribution of Means and the Standard Error of the.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Describing Behavior Chapter 4. Data Analysis Two basic types  Descriptive Summarizes and describes the nature and properties of the data  Inferential.
1 PUAF 610 TA Session 2. 2 Today Class Review- summary statistics STATA Introduction Reminder: HW this week.
By: Amani Albraikan 1. 2  Synonym for variability  Often called “spread” or “scatter”  Indicator of consistency among a data set  Indicates how close.
Descriptive Statistics1 LSSG Green Belt Training Descriptive Statistics.
Skewness & Kurtosis: Reference
Measures of Central Tendency Measures of Dispersion.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 2 – Slide 1 of 27 Chapter 3 Section 2 Measures of Dispersion.
INVESTIGATION 1.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Agenda Descriptive Statistics Measures of Spread - Variability.
Practice Page 65 –2.1 Positive Skew Note Slides online.
Numeric Summaries and Descriptive Statistics. populations vs. samples we want to describe both samples and populations the latter is a matter of inference…
Introduction to Statistics Santosh Kumar Director (iCISA)
Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Measures of Location INFERENTIAL STATISTICS & DESCRIPTIVE STATISTICS Statistics of location Statistics of dispersion Summarise a central pointSummarises.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
1.  In the words of Bowley “Dispersion is the measure of the variation of the items” According to Conar “Dispersion is a measure of the extent to which.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
1 Day 1 Quantitative Methods for Investment Management by Binam Ghimire.
CHAPTER 2: Basic Summary Statistics
Averages and Variability
Bio-Statistic KUEU 3146 & KBEB 3153 Bio-Statistic Data grouping and presentations Part II: Summarizing Data.
Descriptive Statistics(Summary and Variability measures)
2.4 Measures of Variation The Range of a data set is simply: Range = (Max. entry) – (Min. entry)
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
Chapter Six Summarizing and Comparing Data: Measures of Variation, Distribution of Means and the Standard Error of the Mean, and z Scores PowerPoint Presentation.
Analysis of Quantitative Data
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Descriptive Statistics (Part 2)
Central Tendency and Variability
Description of Data (Summary and Variability measures)
Descriptive Statistics
Measures of Location Statistics of location Statistics of dispersion
Numerical Descriptive Measures
CHAPTER 2: Basic Summary Statistics
Numerical Descriptive Measures
Presentation transcript:

Measures of Dispersion

What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful dispersion statistics calculate the quantitative spread of the data values around the arithmetic mean and are called measures of deviation. The various measures of deviation calculate the arithmetic differences between each data value and the arithmetic mean of the dataset.

Why bother with measuring deviation? Consider the following datasets: 3+3+3+3+3 1+1+1+2+10 First we calculate their arithmetic means using: 𝑥 = 𝑥 𝑛 𝑥 = 3+3+3+3+3 5 =3 𝑥 = 1+1+1+2+10 5 =3 Are they the same? According to the mean they are.

𝑠= (3−3 ) 2 +(3−3 ) 2 +(3− 3) 2 +(3− 3) 2 +(3−3 ) 2 4 = 0 Then we calculate their standard deviations using: Same means, very different standard deviations. So are the datasets the same – or not? 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠= (3−3 ) 2 +(3−3 ) 2 +(3− 3) 2 +(3− 3) 2 +(3−3 ) 2 4 = 0 𝑠= (1−3 ) 2 +(1−3 ) 2 +(1− 3) 2 +(2− 3) 2 +(10−3 ) 2 4 = 3.94

Measures of Dispersion and Deviation The Range (a measure of dispersion): The range is the difference between the lowest value (called MIN) and the highest value (called MAX) in a dataset. The Standard Deviation (a measure of deviation): Measures the average difference between a data value and the arithmetic mean of all data values. The Variance (a measure of deviation): Squares the average difference between a data value and the arithmetic mean of the data set. Thus it is the standard deviation squared.

The Range (Range = MAX-MIN)

The Range The range describes the span of your dataset, from the minimum value (MIN) to the maximum value (MAX) using: Range = MAX – MIN Used as a measure of data dispersion NOT deviation, because deviation implies a difference between your data values and something, e.g. the arithmetic mean. The Range is used in finding histogram (or bar chart) classes.

  Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 Even the range is telling us more about the data than just the central tendency measures do. Compare dataset #1 with #3.

The Standard Deviation (s )

The Standard Deviation The standard deviation measures the average difference between a data value and the arithmetic mean of all data values. It is given by: 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 Where: s is the sample standard deviation x is a value in the dataset is the arithmetic mean of the dataset n is the number of values in the dataset 𝑥 If you’re wondering how the ∑(x-x)2 thing works it is saying subtract the data value from the mean then square it then add up all these squared values. It is not saying subtract all data values from the mean, sum them, then square that value – you’d obviously get zero. The standard deviation and the variance are related insofar as the s is the square root of the variance (or the variance is s2). s is the most widely used measure of deviation, though it should always be used in conjunction with the variance.

Interpreting the Standard Deviation Formula 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑥 Subtract each data value x from the arithmetic mean and sum them: But this returns a set of plus and minus differences that add to zero. So to remove the signs we square each difference and sum the squared differences … … then take their square root to return the magnitudes of the original values. 𝑠= (𝑥− 𝑥 ) 𝑠 = 𝑥− 𝑥 2 𝑠= (𝑥− 𝑥 ) 2 𝑛−1

A reminder of the effect of squaring… # #2 1 2 4 3 9 16 5 25 6 36 7 49 8 64 81 10 100 11 121 12 144 13 169 14 196 15 225 256 17 289 18 324 19 361 20 400 … it emphasizes higher values An exponential progression An arithmetic progression

Why Squares and Roots? This is a list of numbers, x. x-mean x-mean squared sqrt of x-mean squared 1 -9.5 90.25 9.5 2 -8.5 72.25 8.5 3 -7.5 56.25 7.5 4 -6.5 42.25 6.5 5 -5.5 30.25 5.5 6 -4.5 20.25 4.5 7 -3.5 12.25 3.5 8 -2.5 6.25 2.5 9 -1.5 2.25 1.5 10 -0.5 0.25 0.5 11 12 13 14 15 16 17 18 19 20 10.5 0.0 The difference x-x produces negative numbers and a sum of zero, but ‾ …taking the square root of the squared data values simply returns them to the original numbers, and also removes the sign. … the square of a number is always positive, and… square … differences between squares increase more rapidly than differences between original numbers, so… number

s values do not indicate skewness. They do indicate kurtosis.   Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 s $3,852.18 $14,290.36 $21,559.86 Low s means that the data are clustered around mean (data are leptokurtic or ‘peaked’) High s means that the data are spread out around the mean (data are platykurtic or ‘flat’) REMEMBER s values do not indicate skewness. They do indicate kurtosis.

Standard deviation calculations the hard way

Review Slide Standard Deviation and the ‘Shape’ of Data ‘Small’ standard deviation Frequency ‘Normal’ standard deviation ‘Large’ standard deviation 𝒙 This ‘peakedness’ of the distribution is called kurtosis. Use the kurtosis statistic to test for normality.

The Variance (s2)

The Variance Squares the average difference between a data value and the arithmetic mean of the data set. It is given by: 𝑠𝟐= 𝑥− 𝑥 2 𝑛−1 Where: s2 is the sample variance x is a value in the dataset is the arithmetic mean of the dataset n is the number of values in the dataset 𝑥 Since it uses the arithmetic mean, it is subject to the same effect of extreme values – except much more because of the effect of squaring.

Interpreting the Variance Formula 𝑠2= 𝑥− 𝑥 2 𝑛−1 𝑥 Subtract each data value x from the arithmetic mean and sum them. But this returns a set of plus and minus differences that adds to zero. So to remove the signs we square each difference thus: …and sum the squared differences. 𝑠 2 = (𝑥− 𝑥 ) 𝑠 2 = 𝑥− 𝑥 2

Variance and SD Compared 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠2= 𝑥− 𝑥 2 𝑛−1 By squaring the differences you remove the negative signs and exaggerate more extreme differences to make them more obvious for analysis. By taking the square root you return the differences to their original magnitude but the signs are removed so the differences no longer sum to zero. In comparing the two, when the s is small, the difference between the variance (s2) and the s is smaller than if the s is large – that’s what happens when you square numbers.

  Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 s $3,852.18 $14,290.36 $21,559.86 s2 $14,839,285.71 $204,214,285.71 $464,827,464.41 Note that the highest s is 5.6 times the lowest whereas the highest s2 is 31 times the lowest – this is the effect of squaring extreme values

N and n-1 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠2= 𝑥− 𝑥 2 𝑛−1 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠2= 𝑥− 𝑥 2 𝑛−1 Why do the sample standard deviation and sample variance (in fact, sample anything) formulas have n-1 as the denominator? Because n-1 gives a more conservative estimate of deviation by increasing the standard deviation and variance values. If you have a larger standard deviation or variance, you have a higher standard to pass in making your case. Why? Because if you are testing to see if a data value is 1.96 s away from the mean of its dataset, then a larger s means the data value has to meet a stricter test – i.e. it has to be higher. What it’s saying is that if you want to find out if, for example, a value in a dataset is different than the others, then that value has to be 1.96 sd from the mean of that dataset. Thus if the sd is larger, then the value has to be larger to be significantly different. Never mind right now why the number 1.96 comes up – we’ll deal with it in a couple of weeks.

Sample versus population – n-1 versus N Sample size (n) Value of numerator in standard deviation formula Biased estimate of population standard deviation (i.e. dividing by N) Unbiased estimate of population standard deviation (dividing by n-1) Difference between biased and unbiased estimates 10 500 7.07 7.45 .38 100 2.24 2.25 .01 1000 0.7071 0.7075 .0004 Source: After Salkind, page 40. ∑(𝒙− 𝒙 ) 𝟐 √(500/10)= √(500/(10-1))= 5.0% 0.4% 0.056% √(500/100)= √(500/(100-1))= √(500/1000)= √(500/(1000-1))= Note: 1. With n-1 the standard deviation is higher. 2. The larger the sample, the smaller the effect of n-1 N

Sample versus population – n-1 versus N Sample size (n) Value of numerator in standard deviation formula Biased estimate of population standard deviation (i.e. dividing by N) Unbiased estimate of population standard deviation (dividing by n-1) Difference between biased and unbiased estimates 10 500 7.07 7.45 .38 100 2.24 2.25 .01 1000 0.7071 0.7075 .0004 Source: After Salkind, page 40. (𝒙− 𝒙 ) 𝟐 Note: 1. With n-1 the standard deviation is higher. 2. The larger the sample, the smaller the effect of n-1 N

Interpreting Variance & Standard Deviation s gives the average difference between each data value and the mean of a dataset and s2 squares it and so exaggerates it. The larger the values, the more spread out the values are and the larger the differences between them. If the values are equal to zero then there are no differences between your data values. The standard deviation and the variance each require an arithmetic mean to work, not the median or the mode. Therefore they require the same rigour as the mean and are sensitive to extreme values as well, especially the variance.

The Coefficient of Variation (Cv)

Calculating the Coefficient Of Variation The equation for the sample coefficient of variation is: 𝑪𝒗= 𝒔 𝒙 * 100 And, for the population: 𝑪𝒗= 𝜹 𝝁 * 100

Interpreting The Coefficient Of Variation The coefficient of variation expresses the standard deviation as a percentage of the mean. Allows easy comparison of standard deviations with one another.

Interpreting The Coefficient Of Variation By way of example: Compare a s of $2,400 on a per capita average income of $55,000 against an s of $300 on a per capita average income of $2,000 – how to interpret? Here the coefficients of variation are 4.4% and 15% indicating a much wider range of variability in the poorer nation – that is a much wider gap between rich and poor. Case in point: the coefficient of variation for global GNI is 108.9%! This indicates an extraordinary gap between rich and poor nations.

  Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 s $3,852.18 $14,290.36 $21,559.86 s2 $14,839,285.71 $204,214,285.71 $464,827,464.41 Cv 9.31% 31.24% 52.11% Note that the highest Cv is 5.3 times the lowest indicating that dataset#3 is considerably more variable that dataset #1 – the effect of the two extreme values is evident.

Summary Stats So Far Arithmetic mean and standard deviation are fundamental to statistics. Form the heart of descriptive statistics. Are the essential building blocks of all other statistical methods – look for them as elements in future formulas. Other measures of dispersion have their roles, are more robust, but not as powerful.

All Geography students are deviants.

All Geography students are above average deviants. mg!