Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2 Dustin Lueker.  Center of the data ◦ Mean ◦ Median ◦ Mode  Dispersion of the data  Sometimes referred to as spread ◦ Variance, Standard deviation.

Similar presentations


Presentation on theme: "Lecture 2 Dustin Lueker.  Center of the data ◦ Mean ◦ Median ◦ Mode  Dispersion of the data  Sometimes referred to as spread ◦ Variance, Standard deviation."— Presentation transcript:

1 Lecture 2 Dustin Lueker

2  Center of the data ◦ Mean ◦ Median ◦ Mode  Dispersion of the data  Sometimes referred to as spread ◦ Variance, Standard deviation ◦ Interquartile range ◦ Range 2STA 291 Winter 09/10 Lecture 2

3  Mean ◦ Arithmetic average  Median ◦ Midpoint of the observations when they are arranged in order  Smallest to largest  Mode ◦ Most frequently occurring value 3STA 291 Winter 09/10 Lecture 2

4  Sample size n  Observations x 1, x 2, …, x n  Sample Mean “x-bar” 4STA 291 Winter 09/10 Lecture 2

5  Population size N  Observations x 1, x 2,…, x N  Population Mean “mu”  Note: This is for a finite population of size N 5STA 291 Winter 09/10 Lecture 2

6  Requires numerical values ◦ Only appropriate for quantitative data ◦ Does not make sense to compute the mean for nominal variables ◦ Can be calculated for ordinal variables, but this does not always make sense  Should be careful when using the mean on ordinal variables  Example “Weather” (on an ordinal scale) Sun=1, Partly Cloudy=2, Cloudy=3, Rain=4, Thunderstorm=5 Mean (average) weather=2.8  Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale 6STA 291 Winter 09/10 Lecture 2

7  Center of gravity for the data set  Sum of the values above the mean is equal to the sum of the values below the mean STA 291 Winter 09/10 Lecture 27

8  Mean ◦ Sum of observations divided by the number of observations  Example ◦ {7, 12, 11, 18} ◦ Mean = 8STA 291 Winter 09/10 Lecture 2

9  Highly influenced by outliers ◦ Data points that are far from the rest of the data  Not representative of a typical observation if the distribution of the data is highly skewed ◦ Example  Monthly income for five people 1,0002,0003,0004,000100,000  Average monthly income =  Not representative of a typical observation 9STA 291 Winter 09/10 Lecture 2

10  Measurement that falls in the middle of the ordered sample  When the sample size n is odd, there is a middle value ◦ It has the ordered index (n+1)/2  Ordered index is where that value falls when the sample is listed from smallest to largest  An index of 2 means the second smallest value ◦ Example  1.7, 4.6, 5.7, 6.1, 8.3 n=5, (n+1)/2=6/2=3, index = 3 Median = 3 rd smallest observation = 5.7 10STA 291 Winter 09/10 Lecture 2

11  When the sample size n is even, average the two middle values ◦ Example  3, 5, 6, 9, n=4 (n+1)/2=5/2=2.5, Index = 2.5 Median = midpoint between 2 nd and 3 rd smallest observations = (5+6)/2 = 5.5 11STA 291 Winter 09/10 Lecture 2

12  For skewed distributions, the median is often a more appropriate measure of central tendency than the mean  The median usually better describes a “typical value” when the sample distribution is highly skewed  Example ◦ Monthly income for five people 1,000 2,000 3,000 4,000 100,000 ◦ Median monthly income:  Does this better describe a “typical value” in the data set than the mean of 22,000? 12STA 291 Winter 09/10 Lecture 2

13 13 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured x i = Measurement of the i th unit Mean - Arithmetic Average Median - Midpoint of the observations when they are arranged in increasing order STA 291 Winter 09/10 Lecture 2

14  Trimmed mean is a compromise between the median and mean ◦ Calculating the trimmed mean  Order the date from smallest to largest  Delete a selected number of values from each end of the ordered list  Find the mean of the remaining values ◦ The trimming percentage is the percentage of values that have been deleted from each end of the ordered list 14STA 291 Winter 09/10 Lecture 2

15  Example: Highest Degree Completed 15 Highest DegreeFrequencyPercentage Not a high school graduate 38,01221.4 High school only 65,29136.8 Some college, no degree 33,19118.7 Associate, Bachelor, Master, Doctorate, Professional 41,12423.2 Total 177,618100 STA 291 Winter 09/10 Lecture 2

16  n = 177,618  (n+1)/2 = 88,809.5  Median = midpoint between the 88809 th smallest and 88810 th smallest observations ◦ Both are in the category “High school only”  Mean wouldn’t make sense here since the variable is only ordinal  Median ◦ Can be used for interval data and for ordinal data ◦ Can not be used for nominal data because the observations can not be ordered on a scale 16STA 291 Winter 09/10 Lecture 2

17  Mean ◦ Interval data with an approximately symmetric distribution  Median ◦ Interval data ◦ Ordinal data  Mean is sensitive to outliers, median is not 17STA 291 Winter 09/10 Lecture 2

18  Symmetric distribution ◦ Mean = Median  Skewed distribution ◦ Mean lies more towards the direction which the distribution is skewed 18STA 291 Winter 09/10 Lecture 2

19  Disadvantage ◦ Insensitive to changes within the lower or upper half of the data ◦ Example  1, 2, 3, 4, 5  1, 2, 3, 100, 100 ◦ Sometimes, the mean is more informative even when the distribution is skewed 19STA 291 Winter 09/10 Lecture 2

20  Keeneland Sales STA 291 Winter 09/10 Lecture 220

21  Difference between the largest and smallest observation ◦ Very much affected by outliers  A misrecorded observation may lead to an outlier, and affect the range  The range does not always reveal different variation about the mean 21STA 291 Winter 09/10 Lecture 2

22  Sample 1 ◦ Smallest Observation: 112 ◦ Largest Observation: 797 ◦ Range =  Sample 2 ◦ Smallest Observation: 15033 ◦ Largest Observation: 16125 ◦ Range = 22STA 291 Winter 09/10 Lecture 2

23  The p th percentile (L p ) is a number such that p% of the observations take values below it, and (100-p)% take values above it ◦ 50 th percentile = median ◦ 25 th percentile = lower quartile ◦ 75 th percentile = upper quartile  The index of L p ◦ (n+1)p/100 23STA 291 Winter 09/10 Lecture 2

24  25 th percentile ◦ lower quartile ◦ Q1 ◦ (approximately) median of the observations below the median  75 th percentile ◦ upper quartile ◦ Q3 ◦ (approximately) median of the observations above the median 24STA 291 Winter 09/10 Lecture 2

25  Find the 25 th percentile of this data set ◦ {3, 7, 12, 13, 15, 19, 24} 25STA 291 Winter 09/10 Lecture 2

26  Use when the index is not a whole number  Want to go closest index lower then go the distance of the decimal towards the next number  If the index is found to be 5.4 you want to go to the 5 th value then add.4 of the value between the 5 th value and 6 th value ◦ In essence we are going to the 5.4 th value STA 291 Winter 09/10 Lecture 226

27  Find the 40 th percentile of the same data set ◦ {3, 7, 12, 13, 15, 19, 24}  Must use interpolation 27STA 291 Winter 09/10 Lecture 2

28  Five Number Summary ◦ Minimum ◦ Lower Quartile ◦ Median ◦ Upper Quartile ◦ Maximum  Example ◦ minimum=4 ◦ Q1=256 ◦ median=530 ◦ Q3=1105 ◦ maximum=320,000.  What does this suggest about the shape of the distribution? 28STA 291 Winter 09/10 Lecture 2

29  The Interquartile Range (IQR) is the difference between upper and lower quartile ◦ IQR = Q3 – Q1 ◦ IQR = Range of values that contains the middle 50% of the data ◦ IQR increases as variability increases  Murder Rate Data ◦ Q1= 3.9 ◦ Q3 = 10.3 ◦ IQR = 29STA 291 Winter 09/10 Lecture 2

30  Displays the five number summary (and more) graphical  Consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)  A line within the box that marks the median,  And whiskers that extend to the maximum and minimum values  This is assuming there are no outliers in the data set 30STA 291 Winter 09/10 Lecture 2

31  An observation is an outlier if it falls ◦ more than 1.5 IQR above the upper quartile or ◦ more than 1.5 IQR below the lower quartile 31STA 291 Winter 09/10 Lecture 2

32  Whiskers only extend to the most extreme observations within 1.5 IQR beyond the quartiles  If an observation is an outlier, it is marked by an x, +, or some other identifier 32STA 291 Winter 09/10 Lecture 2

33  Values  Min = 148  Q1 = 158  Median = Q2 = 162  Q3 = 182  Max = 204  Create a box plot 33STA 291 Winter 09/10 Lecture 2

34  On right-skewed distributions, minimum, Q1, and median will be “bunched up”, while Q3 and the maximum will be farther away.  For left-skewed distributions, the “mirror” is true: the maximum, Q3, and the median will be relatively close compared to the corresponding distances to Q1 and the minimum.  Symmetric distributions? STA 291 Winter 09/10 Lecture 234

35  Statistics that describe variability ◦ Two distributions may have the same mean and/or median but different variability  Mean and Median only describe a typical value, but not the spread of the data ◦ Range ◦ Variance ◦ Standard Deviation ◦ Interquartile Range  All of these can be computed for the sample or population 35STA 291 Winter 09/10 Lecture 2

36  The deviation of the i th observation x i from the sample mean is the difference between them, ◦ Sum of all deviations is zero ◦ Therefore, we use either the sum of the absolute deviations or the sum of the squared deviations as a measure of variation 36STA 291 Winter 09/10 Lecture 2

37  Variance of n observations is the sum of the squared deviations, divided by n-1 37STA 291 Winter 09/10 Lecture 2

38 38 1. Calculate the mean 2. For each observation, calculate the deviation 3. For each observation, calculate the squared deviation 4. Add up all the squared deviations 5. Divide the result by (n-1) Or N if you are finding the population variance (To get the standard deviation, take the square root of the result) STA 291 Winter 09/10 Lecture 2

39 39 ObservationMeanDeviationSquared Deviation 1 3 4 7 10 Sum of the Squared Deviations n-1 Sum of the Squared Deviations / (n-1) STA 291 Winter 09/10 Lecture 2

40  About the average of the squared deviations ◦ “average squared distance from the mean”  Unit ◦ Square of the unit for the original data  Difficult to interpret ◦ Solution  Take the square root of the variance, and the unit is the same as for the original data  Standard Deviation 40STA 291 Winter 09/10 Lecture 2

41  s ≥ 0 ◦ s = 0 only when all observations are the same  If data is collected for the whole population instead of a sample, then n-1 is replaced by n  s is sensitive to outliers 41STA 291 Winter 09/10 Lecture 2

42  Sample ◦ Variance ◦ Standard Deviation  Population ◦ Variance ◦ Standard Deviation 42STA 291 Winter 09/10 Lecture 2

43  Population mean and population standard deviation are denoted by the Greek letters μ (mu) and σ (sigma) ◦ They are unknown constants that we would like to estimate  Sample mean and sample standard deviation are denoted by and s ◦ They are random variables, because their values vary according to the random sample that has been selected 43STA 291 Winter 09/10 Lecture 2

44  If the data is approximately symmetric and bell-shaped then ◦ About 68% of the observations are within one standard deviation from the mean ◦ About 95% of the observations are within two standard deviations from the mean ◦ About 99.7% of the observations are within three standard deviations from the mean 44STA 291 Winter 09/10 Lecture 2

45 45

46  SAT scores are scaled so that they have an approximate bell-shaped distribution with a mean of 500 and standard deviation of 100 ◦ About 68% of the scores are between ◦ About 95% of the scores are between ◦ If you have a score above 700, you are in the top %  What percentile would this be? 46STA 291 Winter 09/10 Lecture 2

47  According to the National Association of Home Builders, the U.S. nationwide median selling price of homes sold in 1995 was $118,000 ◦ Would you expect the mean to be larger, smaller, or equal to $118,000? ◦ Which of the following is the most plausible value for the standard deviation? (a) –15,000, (b) 1,000, (c) 45,000, (d) 1,000,000 47STA 291 Winter 09/10 Lecture 2

48  Experiment ◦ Any activity from which an outcome, measurement, or other such result is obtained  Random (or Chance) Experiment ◦ An experiment with the property that the outcome cannot be predicted with certainty  Outcome ◦ Any possible result of an experiment  Sample Space ◦ Collection of all possible outcomes of an experiment  Event ◦ A specific collection of outcomes  Simple Event ◦ An event consisting of exactly one outcome 48STA 291 Winter 09/10 Lecture 2

49 49 Examples: Experiment 1. Flip a coin 2. Flip a coin 3 times 3. Roll a die 4. Draw a SRS of size 50 from a population Sample Space 1. 2. 3. 4. Event 1. 2. 3. 4.

50  Let A denote an event  Complement of an event A ◦ Denoted by A C, all the outcomes in the sample space S that do not belong to the event A ◦ P(A C )=1-P(A)  Example ◦ If someone completes 64% of his passes, then what percentage is incomplete? 50STA 291 Winter 09/10 Lecture 2 S A

51  Let A and B denote two events  Union of A and B ◦ A ∪ B ◦ All the outcomes in S that belong to at least one of A or B  Intersection of A and B ◦ A ∩ B ◦ All the outcomes in S that belong to both A and B 51STA 291 Winter 09/10 Lecture 2

52  Let A and B be two events in a sample space S ◦ P(A∪B)=P(A)+P(B)-P(A∩B) 52STA 291 Winter 09/10 Lecture 2 S AB

53  Let A and B be two events in a sample space S ◦ P(A∪B)=P(A)+P(B)-P(A∩B)  At State U, all first-year students must take chemistry and math. Suppose 15% fail chemistry, 12% fail math, and 5% fail both. Suppose a first-year student is selected at random, what is the probability that the student failed at least one course? 53STA 291 Winter 09/10 Lecture 2

54  Let A and B denote two events  A and B are Disjoint (mutually exclusive) events if there are no outcomes common to both A and B ◦ A∩B=Ø  Ø = empty set or null set  Let A and B be two disjoint events in a sample space S ◦ P(A∪B)=P(A)+P(B) 54STA 291 Winter 09/10 Lecture 2 S AB

55  The probability of an event occurring is nothing more than a value between 0 and 1 ◦ 0 implies the event will never occur ◦ 1 implies the event will always occur  How do we go about figuring out probabilities? 55STA 291 Winter 09/10 Lecture 2

56  Can be difficult  Different approaches to assigning probabilities to events ◦ Subjective ◦ Objective  Equally likely outcomes (classical approach)  Relative frequency 56STA 291 Winter 09/10 Lecture 2

57  Relies on a person to make a judgment on how likely an event is to occur ◦ Events of interest are usually events that cannot be replicated easily or cannot be modeled with the equally likely outcomes approach  As such, these values will most likely vary from person to person  The only rule for a subjective probability is that the probability of the event must be a value in the interval [0,1] STA 291 Winter 09/10 Lecture 257

58  The equally likely approach usually relies on symmetry to assign probabilities to events ◦ As such, previous research or experiments are not needed to determine the probabilities  Suppose that an experiment has only n outcomes  The equally likely approach to probability assigns a probability of 1/n to each of the outcomes  Further, if an event A is made up of m outcomes then P(A) = m/n STA 291 Winter 09/10 Lecture 258

59  Selecting a simple random sample of 2 individuals ◦ Each pair has an equal probability of being selected  Rolling a fair die ◦ Probability of rolling a “4” is 1/6  This does not mean that whenever you roll the die 6 times, you always get exactly one “4” ◦ Probability of rolling an even number  2,4, & 6 are all even so we have 3 possibly outcomes in the event we want to examine  Thus the probability of rolling an even number is 3/6 = 1/2 59STA 291 Winter 09/10 Lecture 2

60  Borrows from calculus’ concept of the limit ◦ We cannot repeat an experiment infinitely many times so instead we use a ‘large’ n  Process  Repeat an experiment n times  Record the number of times an event A occurs, denote this value by a  Calculate the value of a/n 60STA 291 Winter 09/10 Lecture 2

61  “large” n? ◦ Law of Large Numbers  As the number of repetitions of a random experiment increases, the chance that the relative frequency of occurrence for an event will differ from the true probability of the even by more than any small number approaches 0  Doing a large number of repetitions allows us to accurately approximate the true probabilities using the results of our repetitions 61STA 291 Winter 09/10 Lecture 2


Download ppt "Lecture 2 Dustin Lueker.  Center of the data ◦ Mean ◦ Median ◦ Mode  Dispersion of the data  Sometimes referred to as spread ◦ Variance, Standard deviation."

Similar presentations


Ads by Google