Presentation on theme: "Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics."— Presentation transcript:
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics provide the basis of nearly any quantitative analysis of data.
Descriptive Statistics The purpose of descriptive statistics is to organize and summarize data so that the data are more readily comprehended. That is, descriptive statistics describe distributions with numbers.
The Process of Becoming Familiar with the Data ???
The Process of Becoming Familiar with the Data Screening for valid values Missing data Value labels Levels of measurement Center Spread Shape Rank or relative position Association
Background Information Types of variables –Qualitative –Quantitative Scales or Levels of Measurement –Nominal – Names the category, therefore a qualitative variable represents a nominal scale –Ordinal – Values that can be ordered, reflect differing degrees or amounts of a characteristic being studied, difference between values are not interpretable. –Interval – Values can be ordered, however, difference between values are interpretable. –Ratio – A zero as a value is meaningful, ratios make sense.
Examples of Levels of Measurement Nominal - Numbers assigned to sport figures, gender, party affiliation Ordinal – Numbers assigned to educational attainment, rank in population Interval – Temperature, there is a zero but it depends on how it is measured – it is not an absolute zero, a temperature of 100 is not twice as hot as a temperature of 50. Ratio – Has an absolute zero, weight, count of the number of people, height, distance, elapsed time.
Why is knowing the level of measurement important? It will help you decide how to interpret the data from that variable. Helps you decide what statistical analysis is appropriate on the values assigned. http://www.socialresearchmethods.net/selstat/ssstart.htm
Central Tendency Central Tendency refers to measuring the center or average. Only the notation for mean is standard. The most common measures of central tendency are: –Arithmetic Mean or Mean – –Mode – M o – the item that occurs with greatest frequency –Median – Mdn – the middle score when the observations are arranged in order of magnitude, so that an equal number of scores fall below and above.
Examples of these measures Mean of: 2, 3, 6, 7, 3, 5, 10 (2 + 3 + 6 + 7 + 3 + 5 + 10)/ 7 = 36/ 7 = 5.14 Mode of: 2, 3, 6, 7, 3, 5, 10 is 3 Median of: 2, 3, 6, 7, 3, 5, 10 First data is ordered: 2, 3, 3, 5, 6, 7, 10. Middle value is 5 therefore that is the median.
Some Important Points About These Measures Mode is the only descriptive measure used for nominal data. Median is unaffected by extreme values, it is resistant to extreme observations.
Some Important Points About These Measures Mean or Average is affected by extremely small or large values. We say that it is sensitive or nonresistant to the influence of extreme observations. The mean is the balance point of the distribution. In symmetric distributions the mean and median are close together.
More important points In skewed data the mean is pulled to the tail of the distribution. Median is not necessarily preferred over the mean even if it is resistant. However if data is known to be strongly skewed then the median is preferable. Finally, the average is usually the measurement of central tendency of choice because it is stable during sampling.
Measuring Spread or Variability There are several measures of variability. These measures give an added dimension to the data. More information about the data is better than less. Example: A test was given in two classes and the average in one class was 97 and the average in the other was 94. Was the second test more difficult? Was it easier to get an A in the first class than the other? Not necessarily to both questions. The spread of the test grades might help answer the questions. Say that the spread of grades in the first test was 85 – 100 and in the second test the spread was 92 – 96.
Measures of Variability, Spread of Dispersion Range – Difference between highest and lowest items in a distribution. This measure is not responsive to each item in the distribution. Quartiles Q 1 and Q 3 – Medians of each part of the distribution to the left and right of the median. Interquartile range – IQR is range between Q 1 and Q 3. IQR is used to find outliers. The rule is that if an item is 1.5 times the IQR below or above the Q 1 and Q 3 then it is considered and outlier.
The Five-Number Summary A convenient and quick way to graph and give some preliminary descriptive statistics is to determine the five-number summary. We need two additional bits of information. The maximum and minimum. Example: The data set in a previous slide was: 2, 3, 3, 5, 6, 7, 10. The median is : 5 The Q 1 and Q 3 are 3 and 7 respectively. The minimum is 2 The maximum is 10
Deviations from the mean Another way to measure spread is to measure the deviations from the mean or average. For our example: Avg. 5.14, so deviations are, 2 – 5.14, 3 – 5.14, 3 – 5.14, 5 – 5.14, 6 – 5.14, 7 – 5.14, 10 – 5.14. So, they are: -3.14, -2.14, -2.14, -0.14, 0.86, 1.86, 4.86. Notice that they add up to zero. So as a descriptor it tells you something about the spread but since the sum is always zero the squares are computed and added.
Deviations from the mean continued Simply dividing by the number of sample items would give us the average of the sum of the squared deviations from the mean or variance. However, we will find out that it will give us an unbiased estimator of the variance if we divide by # items – 1. So formula becomes:
Standard Deviation A more useful and popular statistic is the standard deviation. Its units will be the same as the items in the data set. Fortunately, it does not involve another formula. By taking the square root of the variance we also have the standard deviation. Again, the standard deviation is nonresistant to extreme values. The formula then is:
Class Demos Outliers Demo Data Teacher Stress Data Key for Teacher Stress Data