Histograms and Distributions

Name: Histograms and Distributions
Uploaded: 2017-11-04T06:05:59+00:00
Duration: PTM36S40
Channel: Morgan Powers
Description: Histograms and Distributions

Histograms and Distributions

Suppose you want to know if athletes have faster reflexes than non-athletes? In order to get as close to the answer to this question as possible you decide to run an experiment: Using a web-based program you measure the reaction times of 25 athletes and 25 non-athletes under controlled conditions.

Frequency refers to how often a particular value appears in the data: Reaction Time frequency 230 1 231 232 233 2 234 235 236 237 Etc…

A Histogram is a plot of frequency: Histogram frequency Time (ms) This is a weak attempt at making an informative histogram…why?

It would be more informative to place the data into intervals called bins. bins You choose the appropriate bin size. The above bins have an interval of 10.

If the bin intervals are too small, the histogram will be too spread out… Histogram frequency Time (ms) The bins above have an interval of 1…

If the bin intervals are too large, the information will be too clumped: frequency Histogram Time (ms) The bins above have an interval of 80…

bins Let’s go back to a bin interval of 10 and look at the resulting histogram…

frequency Histogram Time (ms) This is a decent choice. Remember that all intervals must have the same size…

frequency Histogram Time (ms) SAMPLE SIZE: Currently the sample size is only 25 students in the non-athlete group. Let’s see what happens to our histogram as more data is collected (sample size increases)…

SAMPLE SIZE: The sample size is now 73 students. Let’s compare the before and after histograms… frequency Histogram bin non-athletes 1 2 6 12 17 15 9 4 sample size 73 Time (ms)

SAMPLE SIZE: The sample size is now 73 students. Let’s compare the before and after histograms… frequency frequency Histogram (after) Histogram (before) Time (ms) Time (ms)

We can imagine that our intervals are infinitely small and our sample size is infinitely large, which will result in the formation of a smooth curve: Histogram frequency Time (ms)

This curve is known as a Normal Distribution or Bell-Shaped Curve… It represents the probability of getting a data point in a given range or data. Histogram frequency Time (ms)

For example, the probability of you next measurement being between 261 and 341 is near 100%. Likewise, the probability of your next measurement being between 261 and 300 is around 50% as this is half the area under the curve. Histogram frequency Time (ms)

What is the probability of your next data measurement being ms? Near ZERO since this is only tiny fraction of the curve. Histogram frequency Time (ms)

Descriptive Statistics

Histograms and Distributions Descriptive Statistics
Measures of Central Tendency 1. The MEAN: This should be something you can already perform on a data set. Sum the numbers and divide this by the number of numbers you have. It can by expressed mathematically by the equation above where x is a random variable that you are measuring and n is the number of measurements you have made.

Measures of Central Tendency 2. The MEDIAN: This is simply the value in a data set that separates the higher half of a sample from the lower half. Reaction Time (ms) 265 273 286 291 293 300 330 For example, in the sample to the right, the value that separates the higher and lower halves of data is 291ms, which is the median. Just arrange the data from highest to lowest or vice versa and find the central number…

Measures of Central Tendency 2. The MEDIAN: This is simply the value in a data set that separates the higher half of a sample from the lower half. Reaction Time (ms) 265 273 286 292 293 300 What if there is an even number of data points like shown on the right? Just average the two central measurement. In this case you average 286 and 291 to get a median of 289.

Careful with the MEAN and MEDIAN For example, a college boasts that the average starting salary of their last years graduating class was $362,000 per year. This sounds quite impressive… However, what they did not tell you was that the class size was 30 students of which 28 started at $30,000 a year and one student was first round draft pick in the NFL making approximately $10,000,000 per year. frequency Histogram Such a data point ($10,000,000 per year) can be considered an outlier, which is a data point much higher or lower than the rest of the data points. An outlier can be seen in the histogram to the right of our athlete data…perhaps the person blinked while the reaction time was being measured. Time (ms)

Careful with the MEAN and MEDIAN For example, a college boasts that the average starting salary of their last years graduating class was $362,000 per year. This sounds quite impressive… However, what they did not tell you was that the class size was 30 students of which 28 started at $30,000 a year and one student was first round draft pick in the NFL making approximately $10,000,000 per year. What is the median of this data set? $30,000 The median is far less sensitive to outliers than the mean.

Careful with the MEAN and MEDIAN So should we be focusing on the median more than the mean???? No. Generally speaking, the mean is TYPICALLY a far more accurate measurement in terms of central tendency than the median when outliers have been dealt with. To convince yourself, try this exercise from Seeing Statistics ( The median is more resistant to extreme, misleading data values so it would seem to be the clear choice. However, we also need to consider accuracy. Is the median or the mean more likely to be close to the true value? To evaluate the relative accuracy of the median and the mean, let's consider how they do when we know the true center of the data. Suppose that the only possible scores are the whole numbers between 0 and 100. The center of these 101 numbers, whether we use the median or the mean, is 50. What if we were to select five numbers randomly from this set of 101 and calculate the median and mean of those five numbers? Would the median or the mean be closer to what we know is the true value of 50? Mean will win ~78% of the time.

Measures of Spread 1. The RANGE: This is simply the length of the smallest interval containing all of the data Reaction Time (ms) 265 273 286 292 293 300 For example, the range of the data to the right would be… 265 ms to 300 ms However, the range suffers from the same drawbacks as the mean and even more so in terms of describing data due to, once again, … outliers.

Measures of Spread 1. The RANGE: This is simply the length of the smallest interval containing all of the data Reaction Time (ms) 265 273 286 292 293 300 734 Calculate the range now with the addition of one new measurement that happens to be an outlier: 265 ms to 734 ms The range is more sensitive to outliers than the mean because with a large sample size, the effect on the mean is diluted.

Measures of Spread 2. The INTERQUARTILE RANGE: The interquartile (between quarters) range is one way around the outlier issue. This value is calculated by first splitting the data up into four sections (quarters) from low to high with the same number of data points in each section as shown below: The interquartile range is the range between the number that defines the upper end of Quarter 1 (Q1) and the lower end of Quarter 3 (Q3)…let’s look at an example.

Measures of Spread 2. The INTERQUARTILE RANGE: Calculate the interquartile range of this data: A. Find the median 268 ms, the 13th value B. Now find the median of the first half of the data excluding the 13th value ( ) / 2 = 231 ms = Q1 C. Find the median of the second half of the data excluding the 13th value ( ) / 2 = 292 ms = Q3 D. The interquartile range is 231 ms to 292 ms. It is also sometimes stated as Q3 – Q1, which would be 61 ms in this case.

Measures of Spread 2. The INTERQUARTILE RANGE: If you start with an even number of data points as shown to the right then… Split the data in half and find the median of each half. In this case one would split the data between values 12 and 13. A. The median of the top half is 231 ms again. B. The median of the bottom half is ( )/2 = (289) ms. C. The interquartile range is 231 ms to 289 ms. It is also sometimes stated as Q3 – Q1, which would be 58 ms in this case.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) The Standard Deviation is simply a value describing the distance from the mean in BOTH directions that will encompass 68% of your data on average. Therefore, σ is a direct measure of the spread of your data…let’s look at a quick example.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) This histogram shows blood pressure data for a large sampling of adult males. The mean is around… 82 mmHg σ is around… 10 mmHg What does this mean? It means that between 82 +/- 10 mmHg (between 72 and 92 mmHg) falls 68% of the data points.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) Therefore, the more spread out your data is… …the greater the value of σ.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) To take it a step further, two standard deviations away from the mean on both sides (+/- 2σ) will encompass… 95% of the data. Likewise, +/- 3σ will encompass 99.7% of the data.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) How does one calculate the Standard Deviation (σ)? Let’s go back to our athlete/non-athlete reaction time data to see how this is done starting with the non-athlete sample…

Measures of Spread 3. The STANDARD DEVIATION (σ or s) Where should we begin? − By calculating the mean (X)… 278.5 ms Now what? (think about what σ tells us) It describes the spread of the data (or width of the normal distribution / bell-shaped curve). Therefore, it is only logical to find how far away all of your data is from the mean…

Measures of Spread 3. The STANDARD DEVIATION (σ or s) X - X − X − X =278.5 ms X – X = the mean minus the measured value − Now we are starting to get an idea about how spread out the data is from the mean, which is what σ is all about.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) X - X − − X (X - X)2 The next step is to… − square all of the differences (X - X)2

Measures of Spread 3. The STANDARD DEVIATION (σ or s) X - X − − X (X - X)2 Then… You, for the most part, average the squares: (X - X)2 / n-1 − The reason one uses n-1 is to account for sample size. If n is large you are essentially dividing by n and averaging. If n is small like a sample size of n=3, then n-1 makes a large difference in the resulting prediction of σ.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) X - X − − X (X - X)2 Then… You essentially average the squares: (X - X)2 / n-1 = − 998.2 This number is known as the variance and is directly related to the spread of your data.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) X - X − − X (X - X)2 One more step to get σ… Square root the “average” to go back: (X - X)2 / n-1 = √ − 31.6 This is the standard deviation (σ). What does this number mean?

Measures of Spread 3. The STANDARD DEVIATION (σ or s) X - X − − X (X - X)2 It means that ACCORDING TO THE CURRENT DATA, 68% of future data collected should fall between 279 +/ foun Read the red text above over and over… as your stats are only as good as your data. Use common sense.

Measures of Spread 3. The STANDARD DEVIATION (σ or s) Standard deviation formula (what we just did):

Measures of Spread 3. The STANDARD DEVIATION (σ or s) Your turn, athlete data… Found in research methods folder (athlete-reaction-time-data.xls) and on edmodo

Measures of Spread 3. The STANDARD DEVIATION (σ or s) Your turn, athlete data… / ms Found in research methods folder (athlete-reaction-time-data.xls) and on edmodo

Histograms and Distributions Histograms and Distributions
Descriptive Statistics Descriptive Statistics Measures of Spread 3. The STANDARD DEVIATION (σ or s) Summary of current data: athletes Non- Mean +/- σ 264 +/- 30.6 279 +/- 31.6 What does it mean? … patients

Descriptive Statistics Descriptive Statistics Measures of Spread 3. The STANDARD DEVIATION (σ or s) The significance of the standard deviation: The graph on the right shows two data sets having the SAME mean. What is different then? The blue data set has a greater spread and therefore a larger σ. Which data set would you prefer (if you had a choice)? The red one as there is less noise / variability. Variability is an inevitable limitation in the methods we use to observe nature. It is your job to make as precise a measurement as possible thereby limiting the variability.

Descriptive Statistics Descriptive Statistics Measures of Spread 3. The STANDARD DEVIATION (σ or s) Compare the histograms of non-athletes to athletes: Non-athletes Athletes Better yet, overlay the histograms…

Compare the histograms of non-athletes to athletes: athletes Non- Mean +/- σ 264 +/- 30.6 279 +/- 31.6 Number of students (frequency) Non-athletes Athletes P=.11323 Q: Is there really a difference between these two groups??? What should we do? Reaction time (ms)

Descriptive Statistics Descriptive Statistics Measures of Spread 3. The STANDARD DEVIATION (σ or s) Collect more data (larger sample size), which is really the only option at this point… bin non-athletes athletes 3 1 6 2 8 12 15 10 17 9 4 sample size 73 77 Non-athletes Athletes

Measures of Spread 3. The STANDARD DEVIATION (σ or s) Number of students (frequency) Number of students (frequency) Reaction time (ms) Reaction time (ms) athletes Non- Mean +/- σ 264 +/- 30.6 279 +/- 31.6 athletes Non- Mean +/- σ 251 +/- 30.8 298 +/- 28.5 Sample size: 73 in non-athletes 77 in athletes Sample size: 25 in each group (N=50)

Measures of Spread 3. The STANDARD DEVIATION (σ or s) What you should notice is that the means changed dramatically and the two goups are beginning to separate indicating that there may actually be a difference. There is no substitute for carefully collected / high quality data and a large sample size. Number of students (frequency) Number of students (frequency) Reaction time (ms) Reaction time (ms) athletes Non- Mean +/- σ 264 +/- 30.6 279 +/- 31.6 athletes Non- Mean +/- σ 251 +/- 30.8 298 +/- 28.5 Sample size: 73 in non-athletes 77 in athletes Sample size: 25 in each group (N=50)

Descriptive Statistics
Measures of Spread 3. The STANDARD DEVIATION (σ or s) Let’s go back to the small sample size data… athletes Non- Mean +/- σ 264 +/- 30.6 279 +/- 31.6 Number of students (frequency) Reaction time (ms) How can we determine if there is a significant difference between these two groups?

T-Test assesses whether the means of two groups are statistically different from each other

= Standard Error of the difference

Therefore the t-value is related to how different the means are and how broad yours data is. A high t-value is obviously what you hope for… Calculate the t-score

Degrees of freedom is the sum of the people in both groups minus 2 -Degrees of freedom is the sum of the people in both groups minus 2 df = 48

The null hypothesis vs the hypothesis 1. The hypothesis: Athletes will have a quicker reaction time than non-athletes. 2. The null hypothesis: The null hypothesis always states that there is no relationship between the two groups or there is no difference in reaction time between athletes and non-athletes. Degrees of freedom is the sum of the people in both groups minus 2

The p-value 1. The p-value is a number between 0 and 1. 2. It is the probability (hence the p-value) that there is no difference between the groups supporting the null hypothesis. 3. Therefore, the probability that there is a difference between the two groups is 1 minus the p-value. Degrees of freedom is the sum of the people in both groups minus 2 4. In order for the data to support the hypothesis, the p-value must be high or low? The p-value should be low (<0.05), which says that there is less than a 5% chance that there is no difference between the two groups. Therefore, there is greater than 95% chance that there is a difference.

Statistical Significance When the p-value is less than 0.05, we say that the data is statistically significant, and there may be a real difference between the two groups. Be warned that just because p is less than 0.05 between two groups doesn’t mean that there is actually a difference. For example, if we find p < 0.05 for the reaction time experiment, it doesn’t mean that there is a definite difference between athletes and non-athletes. It only means that there is a difference in our data, but our data might be flawed or there is not enough data yet (sample size too small) or we measured the data improperly, or the sampling wasn’t random, or the experiment was garbage, etc… Degrees of freedom is the sum of the people in both groups minus 2 Doubt is the greatest tool of any scientist (person).

How is the p-value determined? The p-value is found by using a standard t-table in combination with the t-value and the degrees of freedom previously determined: Degrees of freedom is the sum of the people in both groups minus 2

Now you determine the p-value for your data. Degrees of freedom is the sum of the people in both groups minus 2

1. Begin by choosing the dependent variable like grade for example. Since the T-test can only look at two groups simultaneously and there are four grades, we need to perform all the possible combinations (there was apparently only one 9th grader and therefore the sample size is too low to look at this grade): 10th vs 11th 10th vs 12th 11th vs 12th Degrees of freedom is the sum of the people in both groups minus 2 We also would want to know if the mean of each group is significantly different than the actual value. Actual value vs 10th Actual value vs 11th Actual value vs 12th This needs to be done twice, once for the line estimation and once for the dots estimation!!

These are the tables you need to fill out: Grade Mean SD Variance 10th 11th 12th Gades Difference of means Variability of Groups T-score P-value 10th vs actual 11th vs actual 12th vs actual 10th vs 11th 10th vs 12th 11th vs 12th Degrees of freedom is the sum of the people in both groups minus 2 Write a conclusion based on your analysis. Remember, just because p < 0.5 it doesn’t necessarily mean you hypothesis is supported!

Histograms and Distributions

Similar presentations

Presentation on theme: "Histograms and Distributions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Histograms and Distributions

Similar presentations

Presentation on theme: "Histograms and Distributions"— Presentation transcript:

Similar presentations

About project

Feedback