1 Outliers Outliers are data points that are not like many of the other points, or values. Here we learn about some tools to detect them.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Psych 5500/6500 The Sampling Distribution of the Mean Fall, 2008.
Confidence Intervals Chapter 10. Rate your confidence Name my age within 10 years? 0 within 5 years? 0 within 1 year? 0 Shooting a basketball.
1 Normal Probability Distributions. 2 Review relative frequency histogram 1/10 2/10 4/10 2/10 1/10 Values of a variable, say test scores In.
The Normal distributions BPS chapter 3 © 2006 W.H. Freeman and Company.
HS 67 - Intro Health Stat The Normal Distributions
Stat350, Lecture#4 :Density curves and normal distribution Try to draw a smooth curve overlaying the histogram. The curve is a mathematical model for the.
Theoretical Probability Distributions We have talked about the idea of frequency distributions as a way to see what is happening with our data. We have.
Continuous Probability Distributions.  Experiments can lead to continuous responses i.e. values that do not have to be whole numbers. For example: height.
C HAPTER 2: T HE N ORMAL D ISTRIBUTIONS. R ECALL SECTION 2.1 In section 2.1 density curves were introduced: A density curve is an idealized mathematical.
Terminology A statistic is a number calculated from a sample of data. For each different sample, the value of the statistic is a uniquely determined number.
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1 The Normal Probability Distribution. 2 Review relative frequency histogram 1/10 2/10 4/10 2/10 1/10 Values of a variable, say test scores
1 Difference Between the Means of Two Populations.
1 Zscore. 2 age x = I want to use an example here to introduce some ideas. Say a sample of data has been taken and the age was one.
1 The Basics of Regression Regression is a statistical technique that can ultimately be used for forecasting.
1 Hypothesis Testing In this section I want to review a few things and then introduce hypothesis testing.
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
The Normal Distribution
1.2: Describing Distributions
Chapter Six z-Scores and the Normal Curve Model. Copyright © Houghton Mifflin Company. All rights reserved.Chapter The absolute value of a number.
1 The Sample Mean rule Recall we learned a variable could have a normal distribution? This was useful because then we could say approximately.
BPS - 5th Ed. Chapter 31 The Normal Distributions.
The Sampling Distribution of the Sample Mean AGAIN – with a new angle.
1 Confidence Interval for Population Mean The case when the population standard deviation is unknown (the more common case).
5.4 The Central Limit Theorem Statistics Mrs. Spitz Fall 2008.
Chapter 2: The Normal Distribution
Chapter 3 - Part B Descriptive Statistics: Numerical Methods
Rules of Data Dispersion By using the mean and standard deviation, we can find the percentage of total observations that fall within the given interval.
Review – Using Standard Deviation Here are eight test scores from a previous Stats 201 class: 35, 59, 70, 73, 75, 81, 84, 86. The mean and standard deviation.
A Sampling Distribution
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
Stat 1510: Statistical Thinking and Concepts 1 Density Curves and Normal Distribution.
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Slide Slide 1 Section 3-3 Measures of Variation. Slide Slide 2 Key Concept Because this section introduces the concept of variation, which is something.
Descriptive Statistics Measures of Variation. Essentials: Measures of Variation (Variation – a must for statistical analysis.) Know the types of measures.
Standard Normal Calculations. What you’ll learn  Properties of the standard normal dist n  How to transform scores into normal dist n scores  Determine.
Z Scores. Normal vs. Standard Normal Standard Normal Curve: Most normal curves are not standard normal curves They may be translated along the x axis.
Essential Statistics Chapter 31 The Normal Distributions.
Some probability distribution The Normal Distribution
CHAPTER 3: The Normal Distributions
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Jan. 19 Statistic for the day: Number of Wisconsin’s 33 Senators who voted in favor of a 1988 bill that allows the blind to hunt: 27 Assignment: Read Chapter.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 6 Probability Distributions Section 6.2 Probabilities for Bell-Shaped Distributions.
Describing Quantitative Data Numerically Symmetric Distributions Mean, Variance, and Standard Deviation.
BPS - 5th Ed. Chapter 31 The Normal Distributions.
Essential Statistics Chapter 31 The Normal Distributions.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
NORMAL DISTRIBUTION Chapter 3. DENSITY CURVES Example: here is a histogram of vocabulary scores of 947 seventh graders. BPS - 5TH ED. CHAPTER 3 2 The.
Copyright © Cengage Learning. All rights reserved. 2 Descriptive Analysis and Presentation of Single-Variable Data.
Unit 2 Section 2.4 – Day 2.
Chapter 3, part C. III. Uses of means and standard deviations Of course we don’t just calculate measures of location and dispersion just because we can,
THE NORMAL DISTRIBUTION AND Z- SCORES Areas Under the Curve.
The Normal distribution and z-scores
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
INFERENTIAL STATISTICS DOING STATS WITH CONFIDENCE.
©2011 Brooks/Cole, Cengage Learning Elementary Statistics: Looking at the Big Picture 1 Lecture 7: Chapter 4, Section 3 Quantitative Variables (Summaries,
Seventy efficiency apartments were randomly Seventy efficiency apartments were randomly sampled in a college town. The monthly rent prices for the apartments.
Stat 2411 Statistical Methods Chapter 4. Measure of Variation.
1 1 Slide © 2003 Thomson/South-Western. 2 2 Slide © 2003 Thomson/South-Western Chapter 3 Descriptive Statistics: Numerical Methods Part B n Measures of.
Z-scores, normal distribution, and more.  The bell curve is a symmetric curve, with the center of the graph being the high point, and the two sides on.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
Welcome to MM570 Psychological Statistics Unit 5 Introduction to Hypothesis Testing Dr. Ami M. Gates.
Chapter 3.3 – 3.4 Applications of the Standard Deviation and Measures of Relative Standing.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
What does a population that is normally distributed look like?
ANATOMY OF THE EMPIRICAL RULE
Math 3 Warm Up 4/23/12 Find the probability mean and standard deviation for the following data. 2, 4, 5, 6, 5, 5, 5, 2, 2, 4, 4, 3, 3, 1, 2, 2, 3,
Presentation transcript:

1 Outliers Outliers are data points that are not like many of the other points, or values. Here we learn about some tools to detect them.

2 age x = I want to use an example here to introduce some ideas. Say a sample of data has been taken and the age was one of the variables. Here I have a number line with age as the variable. Let’s say the mean, or average, age in the sample is 22. Let’s also say the sample standard deviation calculated out to be 2.

3 Would you like a cookie? Parker, what a silly question! I ask it because a cookie means 1 cookie. If you like the cookie, maybe you can have more. Well, in stats we say a standard deviation. This means 1 standard deviation. The standard deviation is calculated from the data. In the example on age I said the standard deviation was 2. In another data set a standard deviation might be 4. The standard deviation potentially changes from data set to data set. In our example, again I say the standard deviation was 2. Now, for another silly idea.

4 digress Say you have a board 60 inches long but you only want it to be 48 inches long. How many inches do you cut off? = 12. How many feet do you cut off? 12 inches/12inches per foot = 1 foot. We do something close to this. The problem is that a standard deviation is not exactly like a foot. A foot is always 12 inches. Standard deviation can change from problem to problem.

5 Z-score Imagine a little guy with a thick accent comes up to you at the ball game and says, “what is z-score?” You might say, “3 to 2, we’re up.” This has nothing to do with stats, but was fun to type. Think about our age example where the mean was 22 and the standard deviation was 2. A z-score tells us how far a data point is from the mean, but in standard deviation units (in feet, not inches). Example: 25 is 3 from the mean, but since the standard deviation is 2, 25 is only 1.5 standard deviations from the mean.

6 z scores 19 is also 1.5 standard deviations from the mean. Note 19 is on the low side and 25 is on the high side of the mean. In general, to get a z-score for a data point do this: 1) Take the point minus the mean 2) Divide the result by the standard deviation. In notation form z = (x i – x)/s

7 age x = Age 25 is (25-22)/2 = 1.5 standard deviations above the mean Age 19 is (19 – 22)/2 = -1.5 standard deviations below the mean Z’s that calculate out with a negative sign signifies the value is less than the mean, positive is above the mean.

8 More notes about the z-score: 1)When a data point equals the sample mean the z = 0. 2)Z is a measure of relative location – standard deviations from the mean. Examples of two data sets Data set a – mean = 22, standard deviation = 2. Data set b – mean = 41, standard deviation = and 44 are similar in that both have z = 1, meaning they are both 1 standard deviation above their mean.

9 Chebyshev’s Theorem At least 1 – (1/z 2 ) of the data values must be within z standard deviations of the mean, where z is any value greater than 1. Calculation example: Z = 2, we have 1 – (1/4) = ¾ or.75 or 75% Z = 3, we have 1 – (1/9) = 8/9 =.89 or 89% Note in the statement of the theorem the word within is used. This means that we can be on either side of the mean. Example: At least what percent of scores are within 2.4 standard deviations of the mean, when the mean is 70 and the standard deviation is 5? 1 – (1/2.4 2 ) =.826 or 82.6%

10 Empirical rule based on the normal distribution Chevy Chase’s theorem (OK Chebyshev’s) was general. When data follow a normal distribution, a more specific rule applies. Thought experiment: Imagine you put your eyes just above the top of a table. I then drop sugar in a small stream, with a steady hand, onto the table. The sugar will start to pile up like this picture below. Why does the sugar pile up like this? I have no idea! But it seems normal!

11 I have drawn here a histogram and I have put on top of it a bell shaped curve. In the case where you can put a bell shaped curve on top of a histogram to approximate the distribution of the variable, then the variable is called normal.

rule This rule allows us to say approximately 68% of the people in the data set have a value on the variable within 1 standard deviation of the mean. Approximately 95% have a value within 2 standard deviations of the mean, and approximately 99.7% have a value within 3 standard deviations of the mean. Let’s look at this idea again in the context of an example. Say we asked a whole bunch of people how many ounces of Mt. Dew they consume each year. Say the responses follow a normal distribution with mean = 5480 and standard deviation = 480.

13 The rule again  % ---  Ounces of Mt. Dew  %  per year  % 

14 The rule So, by the rule we know that about 68% of the people in the data set have between 5000 and 5960 ounces of Mt. Dew (ozs of MD)per year. By the rule we know that about 95% of the people in the data set have between 4520 and 6440 ounces of Mt. Dew per year.

15 Outlier – A data point that has a z less than –3 or greater than +3 is likely to be an outlier. This means the data point is really not like the other points. Maybe the point should not be included in the statistical analysis. Chebyshev’s Theorem and the empirical rule tell us about where most of the data should be. In this sense these rules can also assist us in thinking about data points that really do not belong. Say I want to take 72 minus 43, but I type in my calculator 72 minus 34. I would get 38, but I really wanted to get 29. The 38 is off by 9 from what I wanted. Accounting folks know if you are off by 9 you should check to see if you made a transposition error. In stats, the first thing we should do with an outlier is check the records and make sure a data entry error was not made. If it was an entry error – fix it! Otherwise, maybe you want to disregard that data point.